[Patch v3 0/9] NUMA aware page table's pages allocation

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch v3 0/9] NUMA aware page table's pages allocation
@ 2022-12-22  2:34 Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
                   ` (8 more replies)
  0 siblings, 9 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Hi,

This series has expanded from v2 based on the feedbacks. Main items happening in
this series are:

1. KVM MMU shrinker now shrinks KVM caches.
   MMU shrinker will free shadow page caches and split caches whenever shrinker
   is invoked.

2. Page table's pages are allocated on NUMA node during fault and split.
   Pages of page tables will be allocated based on the underlying physical page
   a page table entry is point to. Got performance improvement up to 150% in
   416 vCPUs VM during live migrations.

3. Cache size is reduced from 40 to 5.
   40 is current cache size for KVM memory caches. Reduced them to 5. I didn't
   see any negative performance impact while running perf and dirty_log_perf_test.
   I also saw lesser number of calls to get a free page.

Thanks
Vipin

v3:
- Split patches into smaller ones.
- Repurposed KVM MMU shrinker to free cache pages instead of oldest page table
  pages
- Reduced cache size from 40 to 5
- Removed __weak function and initializing node value in all architectures.
- Some name changes.

v2: https://lore.kernel.org/lkml/20221201195718.1409782-1-vipinsh@google.com/
- All page table pages will be allocated on underlying physical page's
  NUMA node.
- Introduced module parameter, numa_aware_pagetable, to disable this
  feature.
- Using kvm_pfn_to_refcounted_page to get page from a pfn.

v1: https://lore.kernel.org/all/20220801151928.270380-1-vipinsh@google.com/

Vipin Sharma (9):
  KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
  KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker
  KVM: Add module param to make page tables NUMA aware
  KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on
    split
  KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  KVM: x86/mmu: Allocate page table's pages on NUMA node of the
    underlying pages
  KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  KVM: x86/mmu: Reduce default cache size in KVM from 40 to
    PT64_ROOT_MAX_LEVEL

 arch/arm64/kvm/arm.c             |   2 +-
 arch/arm64/kvm/mmu.c             |   4 +-
 arch/mips/kvm/mips.c             |   2 +
 arch/riscv/kvm/mmu.c             |   2 +-
 arch/riscv/kvm/vcpu.c            |   2 +-
 arch/x86/include/asm/kvm_host.h  |  15 +-
 arch/x86/include/asm/kvm_types.h |   2 +-
 arch/x86/kvm/mmu/mmu.c           | 282 +++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h  |   2 +
 arch/x86/kvm/mmu/paging_tmpl.h   |   4 +-
 arch/x86/kvm/mmu/tdp_mmu.c       |  24 ++-
 include/linux/kvm_host.h         |  27 +++
 include/linux/kvm_types.h        |   2 +
 virt/kvm/kvm_main.c              |  35 +++-
 14 files changed, 277 insertions(+), 128 deletions(-)

-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 18:37   ` Ben Gardon
                     ` (2 more replies)
  2022-12-22  2:34 ` [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{} Vipin Sharma
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

mmu_shrink_scan() is very disruptive to VMs. It picks the first
VM in the vm_list, zaps the oldest page which is most likely an upper
level SPTEs and most like to be reused. Prior to TDP MMU, this is even
more disruptive in nested VMs case, considering L1 SPTEs will be the
oldest even though most of the entries are for L2 SPTEs.

As discussed in
https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
shrinker logic has not be very useful in actually keeping VMs performant
and reducing memory usage.

Change mmu_shrink_scan() to free pages from the vCPU's shadow page
cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
VM's performance should not be affected.

This also allows to change cache capacities without worrying too much
about high memory usage in cache.

Tested this change by running dirty_log_perf_test while dropping cache
via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
logs from kvm_mmu_memory_cache_alloc(), which is expected.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |   5 +
 arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h |   2 +
 arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
 include/linux/kvm_host.h        |   1 +
 virt/kvm/kvm_main.c             |  11 ++-
 6 files changed, 114 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa4eb8cfcd7e..89cc809e4a00 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
+	/*
+	 * Protects change in size of mmu_shadow_page_cache cache.
+	 */
+	spinlock_t mmu_shadow_page_cache_lock;
+
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
 	 * In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..157417e1cb6e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
 
 static struct kmem_cache *pte_list_desc_cache;
 struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
+/*
+ * Total number of unused pages in MMU shadow page cache.
+ */
+static struct percpu_counter kvm_total_unused_mmu_pages;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
 
@@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
+				     spinlock_t *cache_lock)
+{
+	int orig_nobjs;
+	int r;
+
+	spin_lock(cache_lock);
+	orig_nobjs = cache->nobjs;
+	r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
+	if (orig_nobjs != cache->nobjs)
+		percpu_counter_add(&kvm_total_unused_mmu_pages,
+				   (cache->nobjs - orig_nobjs));
+	spin_unlock(cache_lock);
+	return r;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
+	r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+				      &vcpu->arch.mmu_shadow_page_cache_lock);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 					  PT64_ROOT_MAX_LEVEL);
 }
 
+static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
+				     spinlock_t *cache_lock)
+{
+	int orig_nobjs;
+
+	spin_lock(cache_lock);
+	orig_nobjs = cache->nobjs;
+	kvm_mmu_free_memory_cache(cache);
+	if (orig_nobjs)
+		percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
+
+	spin_unlock(cache_lock);
+}
+
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
-	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+	mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+				 &vcpu->arch.mmu_shadow_page_cache_lock);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
@@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
 }
 #endif
 
-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values.  We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
-	kvm->arch.n_used_mmu_pages += nr;
-	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, +1);
+	kvm->arch.n_used_mmu_pages++;
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
 
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, -1);
+	kvm->arch.n_used_mmu_pages--;
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
 
@@ -2150,8 +2172,31 @@ struct shadow_page_caches {
 	struct kvm_mmu_memory_cache *page_header_cache;
 	struct kvm_mmu_memory_cache *shadow_page_cache;
 	struct kvm_mmu_memory_cache *shadowed_info_cache;
+	/*
+	 * Protects change in size of shadow_page_cache cache.
+	 */
+	spinlock_t *shadow_page_cache_lock;
 };
 
+void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
+				    spinlock_t *cache_lock)
+{
+	int orig_nobjs;
+	void *page;
+
+	if (!cache_lock) {
+		spin_lock(cache_lock);
+		orig_nobjs = shadow_page_cache->nobjs;
+	}
+	page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
+	if (!cache_lock) {
+		if (orig_nobjs)
+			percpu_counter_dec(&kvm_total_unused_mmu_pages);
+		spin_unlock(cache_lock);
+	}
+	return page;
+}
+
 static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 						      struct shadow_page_caches *caches,
 						      gfn_t gfn,
@@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
+	sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
+						caches->shadow_page_cache_lock);
 	if (!role.direct)
 		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
 
@@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
 		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
 		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
+		.shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
 	};
 
 	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
@@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
 }
 
-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
-	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
@@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
 	/* Direct SPs do not require a shadowed_info_cache. */
 	caches.page_header_cache = &kvm->arch.split_page_header_cache;
 	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+	caches.shadow_page_cache_lock = NULL;
 
 	/* Safe to pass NULL for vCPU since requesting a direct SP. */
 	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	struct kvm *kvm;
-	int nr_to_scan = sc->nr_to_scan;
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm *kvm, *first_kvm = NULL;
 	unsigned long freed = 0;
+	/* spinlock for memory cache */
+	spinlock_t *cache_lock;
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
 
 	mutex_lock(&kvm_lock);
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int idx;
-		LIST_HEAD(invalid_list);
-
-		/*
-		 * Never scan more than sc->nr_to_scan VM instances.
-		 * Will not hit this condition practically since we do not try
-		 * to shrink more than one VM and it is very unlikely to see
-		 * !n_used_mmu_pages so many times.
-		 */
-		if (!nr_to_scan--)
+		if (first_kvm == kvm)
 			break;
-		/*
-		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
-		 * here. We may skip a VM instance errorneosly, but we do not
-		 * want to shrink a VM that only started to populate its MMU
-		 * anyway.
-		 */
-		if (!kvm->arch.n_used_mmu_pages &&
-		    !kvm_has_zapped_obsolete_pages(kvm))
-			continue;
+		if (!first_kvm)
+			first_kvm = kvm;
+		list_move_tail(&kvm->vm_list, &vm_list);
 
-		idx = srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			cache = &vcpu->arch.mmu_shadow_page_cache;
+			cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
+			if (READ_ONCE(cache->nobjs)) {
+				spin_lock(cache_lock);
+				freed += kvm_mmu_empty_memory_cache(cache);
+				spin_unlock(cache_lock);
+			}
 
-		if (kvm_has_zapped_obsolete_pages(kvm)) {
-			kvm_mmu_commit_zap_page(kvm,
-			      &kvm->arch.zapped_obsolete_pages);
-			goto unlock;
 		}
 
-		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
-		write_unlock(&kvm->mmu_lock);
-		srcu_read_unlock(&kvm->srcu, idx);
-
-		/*
-		 * unfair on small ones
-		 * per-vm shrinkers cry out
-		 * sadness comes quickly
-		 */
-		list_move_tail(&kvm->vm_list, &vm_list);
-		break;
+		if (freed >= sc->nr_to_scan)
+			break;
 	}
 
+	if (freed)
+		percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
 	mutex_unlock(&kvm_lock);
+	percpu_counter_sync(&kvm_total_unused_mmu_pages);
 	return freed;
 }
 
 static unsigned long
 mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
-	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
+	return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
 }
 
 static struct shrinker mmu_shrinker = {
@@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
 	if (!mmu_page_header_cache)
 		goto out;
 
-	if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
+	if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
 		goto out;
 
 	ret = register_shrinker(&mmu_shrinker, "x86-mmu");
@@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
 	return 0;
 
 out_shrinker:
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
+	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
 out:
 	mmu_destroy_caches();
 	return ret;
@@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 void kvm_mmu_vendor_module_exit(void)
 {
 	mmu_destroy_caches();
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
+	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
 	unregister_shrinker(&mmu_shrinker);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..c2a342028b6a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
+				    spinlock_t *cache_lock);
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 764f7c87286f..4974fa96deff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
+						&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	return sp;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01aad8b74162..efd9b38ea9a2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13e88297f999..f2d762878b97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
 	return mc->nobjs;
 }
 
-void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
 {
+	int freed = mc->nobjs;
+
 	while (mc->nobjs) {
 		if (mc->kmem_cache)
 			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
@@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
 
-	kvfree(mc->objects);
+	return freed;
+}
 
+void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+{
+	kvm_mmu_empty_memory_cache(mc);
+	kvfree(mc->objects);
 	mc->objects = NULL;
 	mc->capacity = 0;
 }
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
@ 2022-12-27 18:37   ` Ben Gardon
  2022-12-28 22:07     ` Vipin Sharma
  2022-12-29 21:54   ` David Matlack
  2023-01-03 19:32   ` Mingwei Zhang
  2 siblings, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 18:37 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in
> https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
>
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.
>
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
>
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.

Oh, that's not a good thing. I don't think we want to be hitting those
warnings. For one, kernel warnings should not be expected behavior,
probably for many reasons, but at least because Syzbot will find it.
In this particular case, we don't want to hit that because in that
case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
we'll BUG:

void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
{
        void *p;

        if (WARN_ON(!mc->nobjs))
                p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
        else
                p = mc->objects[--mc->nobjs];
        BUG_ON(!p);
        return p;
}

Perhaps the risk of actually panicking is small, but it probably
indicates that we need better error handling around failed allocations
from the cache.
Or, the slightly less elegant approach might be to just hold the cache
lock around the cache topup and use of pages from the cache, but
adding better error handling would probably be cleaner.

>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   5 +
>  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h |   2 +
>  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
>  include/linux/kvm_host.h        |   1 +
>  virt/kvm/kvm_main.c             |  11 ++-
>  6 files changed, 114 insertions(+), 71 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> +       /*
> +        * Protects change in size of mmu_shadow_page_cache cache.
> +        */
> +       spinlock_t mmu_shadow_page_cache_lock;
> +
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
>          * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>
>  static struct kmem_cache *pte_list_desc_cache;
>  struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>         }
>  }
>
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +                                    spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +       int r;
> +
> +       spin_lock(cache_lock);
> +       orig_nobjs = cache->nobjs;
> +       r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> +       if (orig_nobjs != cache->nobjs)
> +               percpu_counter_add(&kvm_total_unused_mmu_pages,
> +                                  (cache->nobjs - orig_nobjs));
> +       spin_unlock(cache_lock);
> +       return r;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>         int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                      PT64_ROOT_MAX_LEVEL);
> +       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +                                     &vcpu->arch.mmu_shadow_page_cache_lock);
>         if (r)
>                 return r;
>         if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                           PT64_ROOT_MAX_LEVEL);
>  }
>
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +                                    spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +
> +       spin_lock(cache_lock);
> +       orig_nobjs = cache->nobjs;
> +       kvm_mmu_free_memory_cache(cache);
> +       if (orig_nobjs)
> +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> +       spin_unlock(cache_lock);
> +}
> +
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> +       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +                                &vcpu->arch.mmu_shadow_page_cache_lock);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
>  }
>  #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values.  We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> -       kvm->arch.n_used_mmu_pages += nr;
> -       percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
>  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, +1);
> +       kvm->arch.n_used_mmu_pages++;
>         kvm_account_pgtable_pages((void *)sp->spt, +1);
>  }
>
>  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, -1);
> +       kvm->arch.n_used_mmu_pages--;
>         kvm_account_pgtable_pages((void *)sp->spt, -1);
>  }
>
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
>         struct kvm_mmu_memory_cache *page_header_cache;
>         struct kvm_mmu_memory_cache *shadow_page_cache;
>         struct kvm_mmu_memory_cache *shadowed_info_cache;
> +       /*
> +        * Protects change in size of shadow_page_cache cache.
> +        */
> +       spinlock_t *shadow_page_cache_lock;
>  };
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +                                   spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +       void *page;
> +
> +       if (!cache_lock) {
> +               spin_lock(cache_lock);
> +               orig_nobjs = shadow_page_cache->nobjs;
> +       }

I believe this is guaranteed to cause a null pointer dereference.

> +       page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> +       if (!cache_lock) {
> +               if (orig_nobjs)
> +                       percpu_counter_dec(&kvm_total_unused_mmu_pages);
> +               spin_unlock(cache_lock);

Again, this will cause a null-pointer dereference. The check above
just needs to be inverted.

> +       }
> +       return page;
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>                                                       struct shadow_page_caches *caches,
>                                                       gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>         struct kvm_mmu_page *sp;
>
>         sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> +       sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> +                                               caches->shadow_page_cache_lock);
>         if (!role.direct)
>                 sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
>                 .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
>                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> +               .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
>         };
>
>         return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>                 kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> -       return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
>  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>                         struct kvm_memory_slot *slot,
>                         struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
>         /* Direct SPs do not require a shadowed_info_cache. */
>         caches.page_header_cache = &kvm->arch.split_page_header_cache;
>         caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +       caches.shadow_page_cache_lock = NULL;
>
>         /* Safe to pass NULL for vCPU since requesting a direct SP. */
>         return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -       struct kvm *kvm;
> -       int nr_to_scan = sc->nr_to_scan;
> +       struct kvm_mmu_memory_cache *cache;
> +       struct kvm *kvm, *first_kvm = NULL;
>         unsigned long freed = 0;
> +       /* spinlock for memory cache */
> +       spinlock_t *cache_lock;
> +       struct kvm_vcpu *vcpu;
> +       unsigned long i;
>
>         mutex_lock(&kvm_lock);
>
>         list_for_each_entry(kvm, &vm_list, vm_list) {
> -               int idx;
> -               LIST_HEAD(invalid_list);
> -
> -               /*
> -                * Never scan more than sc->nr_to_scan VM instances.
> -                * Will not hit this condition practically since we do not try
> -                * to shrink more than one VM and it is very unlikely to see
> -                * !n_used_mmu_pages so many times.
> -                */
> -               if (!nr_to_scan--)
> +               if (first_kvm == kvm)
>                         break;
> -               /*
> -                * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> -                * here. We may skip a VM instance errorneosly, but we do not
> -                * want to shrink a VM that only started to populate its MMU
> -                * anyway.
> -                */
> -               if (!kvm->arch.n_used_mmu_pages &&
> -                   !kvm_has_zapped_obsolete_pages(kvm))
> -                       continue;
> +               if (!first_kvm)
> +                       first_kvm = kvm;
> +               list_move_tail(&kvm->vm_list, &vm_list);
>
> -               idx = srcu_read_lock(&kvm->srcu);

I think we still want to do the SRCU read lock here to prevent
use-after-free on the vCPUs.

> -               write_lock(&kvm->mmu_lock);
> +               kvm_for_each_vcpu(i, vcpu, kvm) {
> +                       cache = &vcpu->arch.mmu_shadow_page_cache;
> +                       cache_lock = vcpu->arch.mmu_shadow_page_cache_lock;
> +                       if (READ_ONCE(cache->nobjs)) {
> +                               spin_lock(cache_lock);
> +                               freed += kvm_mmu_empty_memory_cache(cache);

Would it make sense to just have kvm_mmu_empty_memory_cache()
decrement the per-cpu counter itself? I don't think there's much perf
to be gained by reducing percpu counter updates here and it would
consolidate the bookkeeping.

> +                               spin_unlock(cache_lock);
> +                       }
>
> -               if (kvm_has_zapped_obsolete_pages(kvm)) {
> -                       kvm_mmu_commit_zap_page(kvm,
> -                             &kvm->arch.zapped_obsolete_pages);
> -                       goto unlock;
>                 }
>
> -               freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> -               write_unlock(&kvm->mmu_lock);
> -               srcu_read_unlock(&kvm->srcu, idx);
> -
> -               /*
> -                * unfair on small ones
> -                * per-vm shrinkers cry out
> -                * sadness comes quickly
> -                */

Nooooo, don't delete the beautiful poem!

> -               list_move_tail(&kvm->vm_list, &vm_list);
> -               break;
> +               if (freed >= sc->nr_to_scan)
> +                       break;
>         }
>
> +       if (freed)
> +               percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
>         mutex_unlock(&kvm_lock);
> +       percpu_counter_sync(&kvm_total_unused_mmu_pages);
>         return freed;
>  }
>
>  static unsigned long
>  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -       return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> +       return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);

This will return 0 if the sum of all the per-cpu counters is negative.
It should never be negative though. Might be nice to add a warning if
we would get a negative sum.

>  }
>
>  static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
>         if (!mmu_page_header_cache)
>                 goto out;
>
> -       if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> +       if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
>                 goto out;
>
>         ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
>         return 0;
>
>  out_shrinker:
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>  out:
>         mmu_destroy_caches();
>         return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  void kvm_mmu_vendor_module_exit(void)
>  {
>         mmu_destroy_caches();
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>         unregister_shrinker(&mmu_shrinker);
>  }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +                                   spinlock_t *cache_lock);
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>         struct kvm_mmu_page *sp;
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> +                                               &vcpu->arch.mmu_shadow_page_cache_lock);
>
>         return sp;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>         return mc->nobjs;
>  }
>
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
>  {
> +       int freed = mc->nobjs;
> +
>         while (mc->nobjs) {
>                 if (mc->kmem_cache)
>                         kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>                         free_page((unsigned long)mc->objects[--mc->nobjs]);
>         }
>
> -       kvfree(mc->objects);
> +       return freed;
> +}
>
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> +       kvm_mmu_empty_memory_cache(mc);
> +       kvfree(mc->objects);
>         mc->objects = NULL;
>         mc->capacity = 0;
>  }
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-27 18:37   ` Ben Gardon
@ 2022-12-28 22:07     ` Vipin Sharma
  2022-12-29 21:15       ` David Matlack
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:07 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > VM in the vm_list, zaps the oldest page which is most likely an upper
> > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > oldest even though most of the entries are for L2 SPTEs.
> >
> > As discussed in
> > https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> > shrinker logic has not be very useful in actually keeping VMs performant
> > and reducing memory usage.
> >
> > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > VM's performance should not be affected.
> >
> > This also allows to change cache capacities without worrying too much
> > about high memory usage in cache.
> >
> > Tested this change by running dirty_log_perf_test while dropping cache
> > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > logs from kvm_mmu_memory_cache_alloc(), which is expected.
>
> Oh, that's not a good thing. I don't think we want to be hitting those
> warnings. For one, kernel warnings should not be expected behavior,
> probably for many reasons, but at least because Syzbot will find it.
> In this particular case, we don't want to hit that because in that
> case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> we'll BUG:
>
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> {
>         void *p;
>
>         if (WARN_ON(!mc->nobjs))
>                 p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
>         else
>                 p = mc->objects[--mc->nobjs];
>         BUG_ON(!p);
>         return p;
> }
>
> Perhaps the risk of actually panicking is small, but it probably
> indicates that we need better error handling around failed allocations
> from the cache.
> Or, the slightly less elegant approach might be to just hold the cache
> lock around the cache topup and use of pages from the cache, but
> adding better error handling would probably be cleaner.

I was counting on the fact that shrinker will ideally run only in
extreme cases, i.e. host is running on low memory. So, this WARN_ON
will only be rarely used. I was not aware of Syzbot, it seems like it
will be a concern if it does this kind of testing.

I thought about keeping a mutex, taking it during topup and releasing
it after the whole operation is done but I stopped it as the duration
of holding mutex will be long and might block the memory shrinker
longer. I am not sure though, if this is a valid concern.

I can't think of a better error handling in this situation. I can
change logic to hold mutex if the above mutex hold duration concern
won't be an issue compared to the current WARN_ON() approach.

>
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   5 +
> >  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/mmu_internal.h |   2 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
> >  include/linux/kvm_host.h        |   1 +
> >  virt/kvm/kvm_main.c             |  11 ++-
> >  6 files changed, 114 insertions(+), 71 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aa4eb8cfcd7e..89cc809e4a00 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> >         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> >         struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > +       /*
> > +        * Protects change in size of mmu_shadow_page_cache cache.
> > +        */
> > +       spinlock_t mmu_shadow_page_cache_lock;
> > +
> >         /*
> >          * QEMU userspace and the guest each have their own FPU state.
> >          * In vcpu_run, we switch between the user and guest FPU contexts.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..157417e1cb6e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> >
> >  static struct kmem_cache *pte_list_desc_cache;
> >  struct kmem_cache *mmu_page_header_cache;
> > -static struct percpu_counter kvm_total_used_mmu_pages;
> > +/*
> > + * Total number of unused pages in MMU shadow page cache.
> > + */
> > +static struct percpu_counter kvm_total_unused_mmu_pages;
> >
> >  static void mmu_spte_set(u64 *sptep, u64 spte);
> >
> > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >         }
> >  }
> >
> > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > +                                    spinlock_t *cache_lock)
> > +{
> > +       int orig_nobjs;
> > +       int r;
> > +
> > +       spin_lock(cache_lock);
> > +       orig_nobjs = cache->nobjs;
> > +       r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > +       if (orig_nobjs != cache->nobjs)
> > +               percpu_counter_add(&kvm_total_unused_mmu_pages,
> > +                                  (cache->nobjs - orig_nobjs));
> > +       spin_unlock(cache_lock);
> > +       return r;
> > +}
> > +
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> >         int r;
> > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> >         if (r)
> >                 return r;
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > -                                      PT64_ROOT_MAX_LEVEL);
> > +       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +                                     &vcpu->arch.mmu_shadow_page_cache_lock);
> >         if (r)
> >                 return r;
> >         if (maybe_indirect) {
> > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >                                           PT64_ROOT_MAX_LEVEL);
> >  }
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > +                                    spinlock_t *cache_lock)
> > +{
> > +       int orig_nobjs;
> > +
> > +       spin_lock(cache_lock);
> > +       orig_nobjs = cache->nobjs;
> > +       kvm_mmu_free_memory_cache(cache);
> > +       if (orig_nobjs)
> > +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > +       spin_unlock(cache_lock);
> > +}
> > +
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > +       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +                                &vcpu->arch.mmu_shadow_page_cache_lock);
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> >  }
> > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> >  }
> >  #endif
> >
> > -/*
> > - * This value is the sum of all of the kvm instances's
> > - * kvm->arch.n_used_mmu_pages values.  We need a global,
> > - * aggregate version in order to make the slab shrinker
> > - * faster
> > - */
> > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > -{
> > -       kvm->arch.n_used_mmu_pages += nr;
> > -       percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > -}
> > -
> >  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  {
> > -       kvm_mod_used_mmu_pages(kvm, +1);
> > +       kvm->arch.n_used_mmu_pages++;
> >         kvm_account_pgtable_pages((void *)sp->spt, +1);
> >  }
> >
> >  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  {
> > -       kvm_mod_used_mmu_pages(kvm, -1);
> > +       kvm->arch.n_used_mmu_pages--;
> >         kvm_account_pgtable_pages((void *)sp->spt, -1);
> >  }
> >
> > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> >         struct kvm_mmu_memory_cache *page_header_cache;
> >         struct kvm_mmu_memory_cache *shadow_page_cache;
> >         struct kvm_mmu_memory_cache *shadowed_info_cache;
> > +       /*
> > +        * Protects change in size of shadow_page_cache cache.
> > +        */
> > +       spinlock_t *shadow_page_cache_lock;
> >  };
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > +                                   spinlock_t *cache_lock)
> > +{
> > +       int orig_nobjs;
> > +       void *page;
> > +
> > +       if (!cache_lock) {
> > +               spin_lock(cache_lock);
> > +               orig_nobjs = shadow_page_cache->nobjs;
> > +       }
>
> I believe this is guaranteed to cause a null pointer dereference.
>
> > +       page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > +       if (!cache_lock) {
> > +               if (orig_nobjs)
> > +                       percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > +               spin_unlock(cache_lock);
>
> Again, this will cause a null-pointer dereference. The check above
> just needs to be inverted.

Yes, I forgot to change it in the commit and one patch later in the
series removes this whole "if(!cache_lock)" condition so it skipped my
attention. Thanks for catching it.

>
> > +       }
> > +       return page;
> > +}
> > +
> >  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> >                                                       struct shadow_page_caches *caches,
> >                                                       gfn_t gfn,
> > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> >         struct kvm_mmu_page *sp;
> >
> >         sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > +       sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > +                                               caches->shadow_page_cache_lock);
> >         if (!role.direct)
> >                 sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> >
> > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> >                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> >                 .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> >                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > +               .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> >         };
> >
> >         return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> >         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> >                 kvm_tdp_mmu_zap_invalidated_roots(kvm);
> >  }
> >
> > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > -{
> > -       return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > -}
> > -
> >  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> >                         struct kvm_memory_slot *slot,
> >                         struct kvm_page_track_notifier_node *node)
> > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> >         /* Direct SPs do not require a shadowed_info_cache. */
> >         caches.page_header_cache = &kvm->arch.split_page_header_cache;
> >         caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > +       caches.shadow_page_cache_lock = NULL;
> >
> >         /* Safe to pass NULL for vCPU since requesting a direct SP. */
> >         return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> >  static unsigned long
> >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > -       struct kvm *kvm;
> > -       int nr_to_scan = sc->nr_to_scan;
> > +       struct kvm_mmu_memory_cache *cache;
> > +       struct kvm *kvm, *first_kvm = NULL;
> >         unsigned long freed = 0;
> > +       /* spinlock for memory cache */
> > +       spinlock_t *cache_lock;
> > +       struct kvm_vcpu *vcpu;
> > +       unsigned long i;
> >
> >         mutex_lock(&kvm_lock);
> >
> >         list_for_each_entry(kvm, &vm_list, vm_list) {
> > -               int idx;
> > -               LIST_HEAD(invalid_list);
> > -
> > -               /*
> > -                * Never scan more than sc->nr_to_scan VM instances.
> > -                * Will not hit this condition practically since we do not try
> > -                * to shrink more than one VM and it is very unlikely to see
> > -                * !n_used_mmu_pages so many times.
> > -                */
> > -               if (!nr_to_scan--)
> > +               if (first_kvm == kvm)
> >                         break;
> > -               /*
> > -                * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > -                * here. We may skip a VM instance errorneosly, but we do not
> > -                * want to shrink a VM that only started to populate its MMU
> > -                * anyway.
> > -                */
> > -               if (!kvm->arch.n_used_mmu_pages &&
> > -                   !kvm_has_zapped_obsolete_pages(kvm))
> > -                       continue;
> > +               if (!first_kvm)
> > +                       first_kvm = kvm;
> > +               list_move_tail(&kvm->vm_list, &vm_list);
> >
> > -               idx = srcu_read_lock(&kvm->srcu);
>
> I think we still want to do the SRCU read lock here to prevent
> use-after-free on the vCPUs.

Since I am in mutex_lock(&kvm_lock), a kvm will not be removed from
kvm->vm_list, this will block kvm_destroy_vm() moving further to
destroy vcpus via kvm_arch_destroy_vm() > kvm_destroy_vcpus(). Do we
still need the srcu_read_lock()? Also, kvm_for_each_vcpu() using
xa_for_each_range() which uses RCU for traversing the loop, won't
these two be sufficient to avoid needing srcu_read_lock() here?

>
> > -               write_lock(&kvm->mmu_lock);
> > +               kvm_for_each_vcpu(i, vcpu, kvm) {
> > +                       cache = &vcpu->arch.mmu_shadow_page_cache;
> > +                       cache_lock = vcpu->arch.mmu_shadow_page_cache_lock;
> > +                       if (READ_ONCE(cache->nobjs)) {
> > +                               spin_lock(cache_lock);
> > +                               freed += kvm_mmu_empty_memory_cache(cache);
>
> Would it make sense to just have kvm_mmu_empty_memory_cache()
> decrement the per-cpu counter itself? I don't think there's much perf
> to be gained by reducing percpu counter updates here and it would
> consolidate the bookkeeping.

kvm_mmu_empty_memory_cache() is also used by other caches for which
are not keeping the count.

>
> > +                               spin_unlock(cache_lock);
> > +                       }
> >
> > -               if (kvm_has_zapped_obsolete_pages(kvm)) {
> > -                       kvm_mmu_commit_zap_page(kvm,
> > -                             &kvm->arch.zapped_obsolete_pages);
> > -                       goto unlock;
> >                 }
> >
> > -               freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > -
> > -unlock:
> > -               write_unlock(&kvm->mmu_lock);
> > -               srcu_read_unlock(&kvm->srcu, idx);
> > -
> > -               /*
> > -                * unfair on small ones
> > -                * per-vm shrinkers cry out
> > -                * sadness comes quickly
> > -                */
>
> Nooooo, don't delete the beautiful poem!

I will fix this mistake in the next version, pardon my ignorance :)

>
> > -               list_move_tail(&kvm->vm_list, &vm_list);
> > -               break;
> > +               if (freed >= sc->nr_to_scan)
> > +                       break;
> >         }
> >
> > +       if (freed)
> > +               percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> >         mutex_unlock(&kvm_lock);
> > +       percpu_counter_sync(&kvm_total_unused_mmu_pages);
> >         return freed;
> >  }
> >
> >  static unsigned long
> >  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > -       return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > +       return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
>
> This will return 0 if the sum of all the per-cpu counters is negative.
> It should never be negative though. Might be nice to add a warning if
> we would get a negative sum.
>

Sounds good.


> >  }
> >
> >  static struct shrinker mmu_shrinker = {
> > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> >         if (!mmu_page_header_cache)
> >                 goto out;
> >
> > -       if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > +       if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> >                 goto out;
> >
> >         ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> >         return 0;
> >
> >  out_shrinker:
> > -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> >  out:
> >         mmu_destroy_caches();
> >         return ret;
> > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >  void kvm_mmu_vendor_module_exit(void)
> >  {
> >         mmu_destroy_caches();
> > -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> >         unregister_shrinker(&mmu_shrinker);
> >  }
> >
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index ac00bfbf32f6..c2a342028b6a 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > +                                   spinlock_t *cache_lock);
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 764f7c87286f..4974fa96deff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> >         struct kvm_mmu_page *sp;
> >
> >         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > +                                               &vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >         return sp;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 01aad8b74162..efd9b38ea9a2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> >  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> >  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 13e88297f999..f2d762878b97 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> >         return mc->nobjs;
> >  }
> >
> > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> >  {
> > +       int freed = mc->nobjs;
> > +
> >         while (mc->nobjs) {
> >                 if (mc->kmem_cache)
> >                         kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> >                         free_page((unsigned long)mc->objects[--mc->nobjs]);
> >         }
> >
> > -       kvfree(mc->objects);
> > +       return freed;
> > +}
> >
> > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +{
> > +       kvm_mmu_empty_memory_cache(mc);
> > +       kvfree(mc->objects);
> >         mc->objects = NULL;
> >         mc->capacity = 0;
> >  }
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-28 22:07     ` Vipin Sharma
@ 2022-12-29 21:15       ` David Matlack
  2023-01-03 17:38         ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 21:15 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: Ben Gardon, seanjc, pbonzini, kvm, linux-kernel

On Wed, Dec 28, 2022 at 02:07:49PM -0800, Vipin Sharma wrote:
> On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <bgardon@google.com> wrote:
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > Tested this change by running dirty_log_perf_test while dropping cache
> > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> >
> > Oh, that's not a good thing. I don't think we want to be hitting those
> > warnings. For one, kernel warnings should not be expected behavior,
> > probably for many reasons, but at least because Syzbot will find it.
> > In this particular case, we don't want to hit that because in that
> > case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> > we'll BUG:
> >
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> > {
> >         void *p;
> >
> >         if (WARN_ON(!mc->nobjs))
> >                 p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
> >         else
> >                 p = mc->objects[--mc->nobjs];
> >         BUG_ON(!p);
> >         return p;
> > }
> >
> > Perhaps the risk of actually panicking is small, but it probably
> > indicates that we need better error handling around failed allocations
> > from the cache.
> > Or, the slightly less elegant approach might be to just hold the cache
> > lock around the cache topup and use of pages from the cache, but
> > adding better error handling would probably be cleaner.
> 
> I was counting on the fact that shrinker will ideally run only in
> extreme cases, i.e. host is running on low memory. So, this WARN_ON
> will only be rarely used. I was not aware of Syzbot, it seems like it
> will be a concern if it does this kind of testing.

In an extreme low-memory situation, forcing vCPUS to do GFP_ATOMIC
allocations to handle page faults is risky. Plus it's a waste of time to
free that memory since it's just going to get immediately reallocated.

> 
> I thought about keeping a mutex, taking it during topup and releasing
> it after the whole operation is done but I stopped it as the duration
> of holding mutex will be long and might block the memory shrinker
> longer. I am not sure though, if this is a valid concern.

Use mutex_trylock() to skip any vCPUs that are currently handling page
faults.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-29 21:15       ` David Matlack
@ 2023-01-03 17:38         ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 17:38 UTC (permalink / raw)
  To: David Matlack; +Cc: Ben Gardon, seanjc, pbonzini, kvm, linux-kernel

On Thu, Dec 29, 2022 at 1:15 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 28, 2022 at 02:07:49PM -0800, Vipin Sharma wrote:
> > On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <bgardon@google.com> wrote:
> > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > >
> > > > Tested this change by running dirty_log_perf_test while dropping cache
> > > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> > >
> > > Oh, that's not a good thing. I don't think we want to be hitting those
> > > warnings. For one, kernel warnings should not be expected behavior,
> > > probably for many reasons, but at least because Syzbot will find it.
> > > In this particular case, we don't want to hit that because in that
> > > case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> > > we'll BUG:
> > >
> > > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> > > {
> > >         void *p;
> > >
> > >         if (WARN_ON(!mc->nobjs))
> > >                 p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
> > >         else
> > >                 p = mc->objects[--mc->nobjs];
> > >         BUG_ON(!p);
> > >         return p;
> > > }
> > >
> > > Perhaps the risk of actually panicking is small, but it probably
> > > indicates that we need better error handling around failed allocations
> > > from the cache.
> > > Or, the slightly less elegant approach might be to just hold the cache
> > > lock around the cache topup and use of pages from the cache, but
> > > adding better error handling would probably be cleaner.
> >
> > I was counting on the fact that shrinker will ideally run only in
> > extreme cases, i.e. host is running on low memory. So, this WARN_ON
> > will only be rarely used. I was not aware of Syzbot, it seems like it
> > will be a concern if it does this kind of testing.
>
> In an extreme low-memory situation, forcing vCPUS to do GFP_ATOMIC
> allocations to handle page faults is risky. Plus it's a waste of time to
> free that memory since it's just going to get immediately reallocated.
>
> >
> > I thought about keeping a mutex, taking it during topup and releasing
> > it after the whole operation is done but I stopped it as the duration
> > of holding mutex will be long and might block the memory shrinker
> > longer. I am not sure though, if this is a valid concern.
>
> Use mutex_trylock() to skip any vCPUs that are currently handling page
> faults.

oh yeah! Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
  2022-12-27 18:37   ` Ben Gardon
@ 2022-12-29 21:54   ` David Matlack
  2023-01-03 18:01     ` Vipin Sharma
  2023-01-03 19:32   ` Mingwei Zhang
  2 siblings, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 21:54 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
> 
> As discussed in
> https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
> 
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.

Can you split this commit up? e.g. First drop the old shrinking logic in
one commit (but leave the shrinking infrastructure in place). Then a
commit to make the shrinker free the per-vCPU shadow page caches. And
then perhaps another to make the shrinker free the per-VM shadow page
cache used for eager splitting.

> 
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
> 
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   5 +
>  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h |   2 +
>  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
>  include/linux/kvm_host.h        |   1 +
>  virt/kvm/kvm_main.c             |  11 ++-
>  6 files changed, 114 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
>  
> +	/*
> +	 * Protects change in size of mmu_shadow_page_cache cache.
> +	 */
> +	spinlock_t mmu_shadow_page_cache_lock;
> +
>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
>  	 * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>  
>  static struct kmem_cache *pte_list_desc_cache;
>  struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>  
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>  
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +				     spinlock_t *cache_lock)
> +{
> +	int orig_nobjs;
> +	int r;
> +
> +	spin_lock(cache_lock);
> +	orig_nobjs = cache->nobjs;
> +	r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> +	if (orig_nobjs != cache->nobjs)
> +		percpu_counter_add(&kvm_total_unused_mmu_pages,
> +				   (cache->nobjs - orig_nobjs));
> +	spin_unlock(cache_lock);
> +	return r;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>  	if (r)
>  		return r;
> -	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -				       PT64_ROOT_MAX_LEVEL);
> +	r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +				      &vcpu->arch.mmu_shadow_page_cache_lock);
>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  					  PT64_ROOT_MAX_LEVEL);
>  }
>  
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +				     spinlock_t *cache_lock)
> +{
> +	int orig_nobjs;
> +
> +	spin_lock(cache_lock);
> +	orig_nobjs = cache->nobjs;
> +	kvm_mmu_free_memory_cache(cache);
> +	if (orig_nobjs)
> +		percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> +	spin_unlock(cache_lock);
> +}

It would be nice to avoid adding these wrapper functions.

Once you add a mutex to protect the caches from being freed while vCPUs
are in the middle of a page fault you can drop the spin lock. After that
the only reason to have these wrappers is to update
kvm_total_unused_mmu_pages.

Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
calculate the number of of unused pages in mmu_shrink_count()? Or just
estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
Or have per-VM or per-vCPU shrinkers to avoid needing to do any
aggregation?

> +
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> +	mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +				 &vcpu->arch.mmu_shadow_page_cache_lock);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);

mmu_shadowed_info_cache can be freed by the shrinker as well.

>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
>  }
>  #endif
>  
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values.  We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> -	kvm->arch.n_used_mmu_pages += nr;
> -	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
>  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -	kvm_mod_used_mmu_pages(kvm, +1);
> +	kvm->arch.n_used_mmu_pages++;
>  	kvm_account_pgtable_pages((void *)sp->spt, +1);
>  }
>  
>  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -	kvm_mod_used_mmu_pages(kvm, -1);
> +	kvm->arch.n_used_mmu_pages--;
>  	kvm_account_pgtable_pages((void *)sp->spt, -1);
>  }
>  
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
>  	struct kvm_mmu_memory_cache *page_header_cache;
>  	struct kvm_mmu_memory_cache *shadow_page_cache;
>  	struct kvm_mmu_memory_cache *shadowed_info_cache;
> +	/*
> +	 * Protects change in size of shadow_page_cache cache.
> +	 */
> +	spinlock_t *shadow_page_cache_lock;
>  };
>  
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +				    spinlock_t *cache_lock)
> +{
> +	int orig_nobjs;
> +	void *page;
> +
> +	if (!cache_lock) {
> +		spin_lock(cache_lock);
> +		orig_nobjs = shadow_page_cache->nobjs;
> +	}
> +	page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> +	if (!cache_lock) {
> +		if (orig_nobjs)
> +			percpu_counter_dec(&kvm_total_unused_mmu_pages);
> +		spin_unlock(cache_lock);
> +	}
> +	return page;
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  						      struct shadow_page_caches *caches,
>  						      gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  	struct kvm_mmu_page *sp;
>  
>  	sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> +	sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> +						caches->shadow_page_cache_lock);
>  	if (!role.direct)
>  		sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>  
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>  		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
>  		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
>  		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> +		.shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
>  	};
>  
>  	return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>  
>  	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>  
>  	vcpu->arch.mmu = &vcpu->arch.root_mmu;
>  	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  		kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  }
>  
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> -	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
>  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  			struct kvm_memory_slot *slot,
>  			struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
>  	/* Direct SPs do not require a shadowed_info_cache. */
>  	caches.page_header_cache = &kvm->arch.split_page_header_cache;
>  	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +	caches.shadow_page_cache_lock = NULL;
>  
>  	/* Safe to pass NULL for vCPU since requesting a direct SP. */
>  	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -	struct kvm *kvm;
> -	int nr_to_scan = sc->nr_to_scan;
> +	struct kvm_mmu_memory_cache *cache;
> +	struct kvm *kvm, *first_kvm = NULL;
>  	unsigned long freed = 0;
> +	/* spinlock for memory cache */
> +	spinlock_t *cache_lock;
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
>  
>  	mutex_lock(&kvm_lock);
>  
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		int idx;
> -		LIST_HEAD(invalid_list);
> -
> -		/*
> -		 * Never scan more than sc->nr_to_scan VM instances.
> -		 * Will not hit this condition practically since we do not try
> -		 * to shrink more than one VM and it is very unlikely to see
> -		 * !n_used_mmu_pages so many times.
> -		 */
> -		if (!nr_to_scan--)
> +		if (first_kvm == kvm)
>  			break;
> -		/*
> -		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> -		 * here. We may skip a VM instance errorneosly, but we do not
> -		 * want to shrink a VM that only started to populate its MMU
> -		 * anyway.
> -		 */
> -		if (!kvm->arch.n_used_mmu_pages &&
> -		    !kvm_has_zapped_obsolete_pages(kvm))
> -			continue;
> +		if (!first_kvm)
> +			first_kvm = kvm;
> +		list_move_tail(&kvm->vm_list, &vm_list);
>  
> -		idx = srcu_read_lock(&kvm->srcu);
> -		write_lock(&kvm->mmu_lock);
> +		kvm_for_each_vcpu(i, vcpu, kvm) {

What protects this from racing with vCPU creation/deletion?

> +			cache = &vcpu->arch.mmu_shadow_page_cache;
> +			cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> +			if (READ_ONCE(cache->nobjs)) {
> +				spin_lock(cache_lock);
> +				freed += kvm_mmu_empty_memory_cache(cache);
> +				spin_unlock(cache_lock);
> +			}

What about freeing kvm->arch.split_shadow_page_cache as well?

>  
> -		if (kvm_has_zapped_obsolete_pages(kvm)) {
> -			kvm_mmu_commit_zap_page(kvm,
> -			      &kvm->arch.zapped_obsolete_pages);
> -			goto unlock;
>  		}
>  
> -		freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> -		write_unlock(&kvm->mmu_lock);
> -		srcu_read_unlock(&kvm->srcu, idx);
> -
> -		/*
> -		 * unfair on small ones
> -		 * per-vm shrinkers cry out
> -		 * sadness comes quickly
> -		 */
> -		list_move_tail(&kvm->vm_list, &vm_list);
> -		break;
> +		if (freed >= sc->nr_to_scan)
> +			break;
>  	}
>  
> +	if (freed)
> +		percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
>  	mutex_unlock(&kvm_lock);
> +	percpu_counter_sync(&kvm_total_unused_mmu_pages);
>  	return freed;
>  }
>  
>  static unsigned long
>  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> +	return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
>  }
>  
>  static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
>  	if (!mmu_page_header_cache)
>  		goto out;
>  
> -	if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> +	if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
>  		goto out;
>  
>  	ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
>  	return 0;
>  
>  out_shrinker:
> -	percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>  out:
>  	mmu_destroy_caches();
>  	return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  void kvm_mmu_vendor_module_exit(void)
>  {
>  	mmu_destroy_caches();
> -	percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +	percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>  	unregister_shrinker(&mmu_shrinker);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +				    spinlock_t *cache_lock);
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu_page *sp;
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> +						&vcpu->arch.mmu_shadow_page_cache_lock);
>  
>  	return sp;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>  	return mc->nobjs;
>  }
>  
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
>  {
> +	int freed = mc->nobjs;
> +
>  	while (mc->nobjs) {
>  		if (mc->kmem_cache)
>  			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>  			free_page((unsigned long)mc->objects[--mc->nobjs]);
>  	}
>  
> -	kvfree(mc->objects);
> +	return freed;
> +}
>  
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> +	kvm_mmu_empty_memory_cache(mc);
> +	kvfree(mc->objects);
>  	mc->objects = NULL;
>  	mc->capacity = 0;
>  }
> -- 
> 2.39.0.314.g84b9a713c41-goog
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-29 21:54   ` David Matlack
@ 2023-01-03 18:01     ` Vipin Sharma
  2023-01-04  0:25       ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 18:01 UTC (permalink / raw)
  To: David Matlack; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Thu, Dec 29, 2022 at 1:55 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > VM in the vm_list, zaps the oldest page which is most likely an upper
> > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > oldest even though most of the entries are for L2 SPTEs.
> >
> > As discussed in
> > https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> > shrinker logic has not be very useful in actually keeping VMs performant
> > and reducing memory usage.
> >
> > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > VM's performance should not be affected.
>
> Can you split this commit up? e.g. First drop the old shrinking logic in
> one commit (but leave the shrinking infrastructure in place). Then a
> commit to make the shrinker free the per-vCPU shadow page caches. And
> then perhaps another to make the shrinker free the per-VM shadow page
> cache used for eager splitting.
>

Sounds good, I will separate it in two parts, one for dropping old
logic, one for adding per vcpu shadow page caches. Patch 3 is enabling
shrinkerto free per-VM shadow page.

> >
> > This also allows to change cache capacities without worrying too much
> > about high memory usage in cache.
> >
> > Tested this change by running dirty_log_perf_test while dropping cache
> > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |   5 +
> >  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
> >  arch/x86/kvm/mmu/mmu_internal.h |   2 +
> >  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
> >  include/linux/kvm_host.h        |   1 +
> >  virt/kvm/kvm_main.c             |  11 ++-
> >  6 files changed, 114 insertions(+), 71 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aa4eb8cfcd7e..89cc809e4a00 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> >       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> >       struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > +     /*
> > +      * Protects change in size of mmu_shadow_page_cache cache.
> > +      */
> > +     spinlock_t mmu_shadow_page_cache_lock;
> > +
> >       /*
> >        * QEMU userspace and the guest each have their own FPU state.
> >        * In vcpu_run, we switch between the user and guest FPU contexts.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..157417e1cb6e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> >
> >  static struct kmem_cache *pte_list_desc_cache;
> >  struct kmem_cache *mmu_page_header_cache;
> > -static struct percpu_counter kvm_total_used_mmu_pages;
> > +/*
> > + * Total number of unused pages in MMU shadow page cache.
> > + */
> > +static struct percpu_counter kvm_total_unused_mmu_pages;
> >
> >  static void mmu_spte_set(u64 *sptep, u64 spte);
> >
> > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >       }
> >  }
> >
> > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > +                                  spinlock_t *cache_lock)
> > +{
> > +     int orig_nobjs;
> > +     int r;
> > +
> > +     spin_lock(cache_lock);
> > +     orig_nobjs = cache->nobjs;
> > +     r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > +     if (orig_nobjs != cache->nobjs)
> > +             percpu_counter_add(&kvm_total_unused_mmu_pages,
> > +                                (cache->nobjs - orig_nobjs));
> > +     spin_unlock(cache_lock);
> > +     return r;
> > +}
> > +
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> >       int r;
> > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> >       if (r)
> >               return r;
> > -     r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > -                                    PT64_ROOT_MAX_LEVEL);
> > +     r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +                                   &vcpu->arch.mmu_shadow_page_cache_lock);
> >       if (r)
> >               return r;
> >       if (maybe_indirect) {
> > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >                                         PT64_ROOT_MAX_LEVEL);
> >  }
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > +                                  spinlock_t *cache_lock)
> > +{
> > +     int orig_nobjs;
> > +
> > +     spin_lock(cache_lock);
> > +     orig_nobjs = cache->nobjs;
> > +     kvm_mmu_free_memory_cache(cache);
> > +     if (orig_nobjs)
> > +             percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > +     spin_unlock(cache_lock);
> > +}
>
> It would be nice to avoid adding these wrapper functions.
>
> Once you add a mutex to protect the caches from being freed while vCPUs
> are in the middle of a page fault you can drop the spin lock. After that
> the only reason to have these wrappers is to update
> kvm_total_unused_mmu_pages.
>
> Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
> calculate the number of of unused pages in mmu_shrink_count()? Or just
> estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
> Or have per-VM or per-vCPU shrinkers to avoid needing to do any
> aggregation?
>

I think we can drop this, by default we can return num_kvms *
num_vcpus * nodes * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE

Whenever mmu_shrink_scan() is called if there are no pages to free
then return SHRINK_STOP which will stop any subsequent calls during
that time.


> > +
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > -     kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > +     mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +                              &vcpu->arch.mmu_shadow_page_cache_lock);
> >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>
> mmu_shadowed_info_cache can be freed by the shrinker as well.
>
> >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> >  }
> > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> >  }
> >  #endif
> >
> > -/*
> > - * This value is the sum of all of the kvm instances's
> > - * kvm->arch.n_used_mmu_pages values.  We need a global,
> > - * aggregate version in order to make the slab shrinker
> > - * faster
> > - */
> > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > -{
> > -     kvm->arch.n_used_mmu_pages += nr;
> > -     percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > -}
> > -
> >  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  {
> > -     kvm_mod_used_mmu_pages(kvm, +1);
> > +     kvm->arch.n_used_mmu_pages++;
> >       kvm_account_pgtable_pages((void *)sp->spt, +1);
> >  }
> >
> >  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  {
> > -     kvm_mod_used_mmu_pages(kvm, -1);
> > +     kvm->arch.n_used_mmu_pages--;
> >       kvm_account_pgtable_pages((void *)sp->spt, -1);
> >  }
> >
> > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> >       struct kvm_mmu_memory_cache *page_header_cache;
> >       struct kvm_mmu_memory_cache *shadow_page_cache;
> >       struct kvm_mmu_memory_cache *shadowed_info_cache;
> > +     /*
> > +      * Protects change in size of shadow_page_cache cache.
> > +      */
> > +     spinlock_t *shadow_page_cache_lock;
> >  };
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > +                                 spinlock_t *cache_lock)
> > +{
> > +     int orig_nobjs;
> > +     void *page;
> > +
> > +     if (!cache_lock) {
> > +             spin_lock(cache_lock);
> > +             orig_nobjs = shadow_page_cache->nobjs;
> > +     }
> > +     page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > +     if (!cache_lock) {
> > +             if (orig_nobjs)
> > +                     percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > +             spin_unlock(cache_lock);
> > +     }
> > +     return page;
> > +}
> > +
> >  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> >                                                     struct shadow_page_caches *caches,
> >                                                     gfn_t gfn,
> > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> >       struct kvm_mmu_page *sp;
> >
> >       sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > +     sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > +                                             caches->shadow_page_cache_lock);
> >       if (!role.direct)
> >               sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> >
> > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> >               .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> >               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> >               .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > +             .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> >       };
> >
> >       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> >       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +     spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >       vcpu->arch.mmu = &vcpu->arch.root_mmu;
> >       vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> >               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> >  }
> >
> > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > -{
> > -     return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > -}
> > -
> >  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> >                       struct kvm_memory_slot *slot,
> >                       struct kvm_page_track_notifier_node *node)
> > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> >       /* Direct SPs do not require a shadowed_info_cache. */
> >       caches.page_header_cache = &kvm->arch.split_page_header_cache;
> >       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > +     caches.shadow_page_cache_lock = NULL;
> >
> >       /* Safe to pass NULL for vCPU since requesting a direct SP. */
> >       return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> >  static unsigned long
> >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > -     struct kvm *kvm;
> > -     int nr_to_scan = sc->nr_to_scan;
> > +     struct kvm_mmu_memory_cache *cache;
> > +     struct kvm *kvm, *first_kvm = NULL;
> >       unsigned long freed = 0;
> > +     /* spinlock for memory cache */
> > +     spinlock_t *cache_lock;
> > +     struct kvm_vcpu *vcpu;
> > +     unsigned long i;
> >
> >       mutex_lock(&kvm_lock);
> >
> >       list_for_each_entry(kvm, &vm_list, vm_list) {
> > -             int idx;
> > -             LIST_HEAD(invalid_list);
> > -
> > -             /*
> > -              * Never scan more than sc->nr_to_scan VM instances.
> > -              * Will not hit this condition practically since we do not try
> > -              * to shrink more than one VM and it is very unlikely to see
> > -              * !n_used_mmu_pages so many times.
> > -              */
> > -             if (!nr_to_scan--)
> > +             if (first_kvm == kvm)
> >                       break;
> > -             /*
> > -              * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > -              * here. We may skip a VM instance errorneosly, but we do not
> > -              * want to shrink a VM that only started to populate its MMU
> > -              * anyway.
> > -              */
> > -             if (!kvm->arch.n_used_mmu_pages &&
> > -                 !kvm_has_zapped_obsolete_pages(kvm))
> > -                     continue;
> > +             if (!first_kvm)
> > +                     first_kvm = kvm;
> > +             list_move_tail(&kvm->vm_list, &vm_list);
> >
> > -             idx = srcu_read_lock(&kvm->srcu);
> > -             write_lock(&kvm->mmu_lock);
> > +             kvm_for_each_vcpu(i, vcpu, kvm) {
>
> What protects this from racing with vCPU creation/deletion?
>
> > +                     cache = &vcpu->arch.mmu_shadow_page_cache;
> > +                     cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> > +                     if (READ_ONCE(cache->nobjs)) {
> > +                             spin_lock(cache_lock);
> > +                             freed += kvm_mmu_empty_memory_cache(cache);
> > +                             spin_unlock(cache_lock);
> > +                     }
>
> What about freeing kvm->arch.split_shadow_page_cache as well?
>
> >
> > -             if (kvm_has_zapped_obsolete_pages(kvm)) {
> > -                     kvm_mmu_commit_zap_page(kvm,
> > -                           &kvm->arch.zapped_obsolete_pages);
> > -                     goto unlock;
> >               }
> >
> > -             freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > -
> > -unlock:
> > -             write_unlock(&kvm->mmu_lock);
> > -             srcu_read_unlock(&kvm->srcu, idx);
> > -
> > -             /*
> > -              * unfair on small ones
> > -              * per-vm shrinkers cry out
> > -              * sadness comes quickly
> > -              */
> > -             list_move_tail(&kvm->vm_list, &vm_list);
> > -             break;
> > +             if (freed >= sc->nr_to_scan)
> > +                     break;
> >       }
> >
> > +     if (freed)
> > +             percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> >       mutex_unlock(&kvm_lock);
> > +     percpu_counter_sync(&kvm_total_unused_mmu_pages);
> >       return freed;
> >  }
> >
> >  static unsigned long
> >  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> >  {
> > -     return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > +     return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> >  }
> >
> >  static struct shrinker mmu_shrinker = {
> > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> >       if (!mmu_page_header_cache)
> >               goto out;
> >
> > -     if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > +     if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> >               goto out;
> >
> >       ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> >       return 0;
> >
> >  out_shrinker:
> > -     percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > +     percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> >  out:
> >       mmu_destroy_caches();
> >       return ret;
> > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >  void kvm_mmu_vendor_module_exit(void)
> >  {
> >       mmu_destroy_caches();
> > -     percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > +     percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> >       unregister_shrinker(&mmu_shrinker);
> >  }
> >
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index ac00bfbf32f6..c2a342028b6a 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > +                                 spinlock_t *cache_lock);
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 764f7c87286f..4974fa96deff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> >       struct kvm_mmu_page *sp;
> >
> >       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > +     sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > +                                             &vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >       return sp;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 01aad8b74162..efd9b38ea9a2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> >  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> >  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 13e88297f999..f2d762878b97 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> >       return mc->nobjs;
> >  }
> >
> > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> >  {
> > +     int freed = mc->nobjs;
> > +
> >       while (mc->nobjs) {
> >               if (mc->kmem_cache)
> >                       kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> >                       free_page((unsigned long)mc->objects[--mc->nobjs]);
> >       }
> >
> > -     kvfree(mc->objects);
> > +     return freed;
> > +}
> >
> > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +{
> > +     kvm_mmu_empty_memory_cache(mc);
> > +     kvfree(mc->objects);
> >       mc->objects = NULL;
> >       mc->capacity = 0;
> >  }
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-03 18:01     ` Vipin Sharma
@ 2023-01-04  0:25       ` Vipin Sharma
  2023-01-18 17:43         ` Sean Christopherson
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2023-01-04  0:25 UTC (permalink / raw)
  To: David Matlack, seanjc, pbonzini; +Cc: bgardon, kvm, linux-kernel

On Tue, Jan 3, 2023 at 10:01 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Thu, Dec 29, 2022 at 1:55 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> > > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > > VM in the vm_list, zaps the oldest page which is most likely an upper
> > > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > > oldest even though most of the entries are for L2 SPTEs.
> > >
> > > As discussed in
> > > https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> > > shrinker logic has not be very useful in actually keeping VMs performant
> > > and reducing memory usage.
> > >
> > > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > > cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > > VM's performance should not be affected.
> >
> > Can you split this commit up? e.g. First drop the old shrinking logic in
> > one commit (but leave the shrinking infrastructure in place). Then a
> > commit to make the shrinker free the per-vCPU shadow page caches. And
> > then perhaps another to make the shrinker free the per-VM shadow page
> > cache used for eager splitting.
> >
>
> Sounds good, I will separate it in two parts, one for dropping old
> logic, one for adding per vcpu shadow page caches. Patch 3 is enabling
> shrinkerto free per-VM shadow page.
>
> > >
> > > This also allows to change cache capacities without worrying too much
> > > about high memory usage in cache.
> > >
> > > Tested this change by running dirty_log_perf_test while dropping cache
> > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> > >
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |   5 +
> > >  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
> > >  arch/x86/kvm/mmu/mmu_internal.h |   2 +
> > >  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
> > >  include/linux/kvm_host.h        |   1 +
> > >  virt/kvm/kvm_main.c             |  11 ++-
> > >  6 files changed, 114 insertions(+), 71 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index aa4eb8cfcd7e..89cc809e4a00 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> > >       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > >       struct kvm_mmu_memory_cache mmu_page_header_cache;
> > >
> > > +     /*
> > > +      * Protects change in size of mmu_shadow_page_cache cache.
> > > +      */
> > > +     spinlock_t mmu_shadow_page_cache_lock;
> > > +
> > >       /*
> > >        * QEMU userspace and the guest each have their own FPU state.
> > >        * In vcpu_run, we switch between the user and guest FPU contexts.
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 254bc46234e0..157417e1cb6e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> > >
> > >  static struct kmem_cache *pte_list_desc_cache;
> > >  struct kmem_cache *mmu_page_header_cache;
> > > -static struct percpu_counter kvm_total_used_mmu_pages;
> > > +/*
> > > + * Total number of unused pages in MMU shadow page cache.
> > > + */
> > > +static struct percpu_counter kvm_total_unused_mmu_pages;
> > >
> > >  static void mmu_spte_set(u64 *sptep, u64 spte);
> > >
> > > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > >       }
> > >  }
> > >
> > > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > +                                  spinlock_t *cache_lock)
> > > +{
> > > +     int orig_nobjs;
> > > +     int r;
> > > +
> > > +     spin_lock(cache_lock);
> > > +     orig_nobjs = cache->nobjs;
> > > +     r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > > +     if (orig_nobjs != cache->nobjs)
> > > +             percpu_counter_add(&kvm_total_unused_mmu_pages,
> > > +                                (cache->nobjs - orig_nobjs));
> > > +     spin_unlock(cache_lock);
> > > +     return r;
> > > +}
> > > +
> > >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > >  {
> > >       int r;
> > > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > >                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > >       if (r)
> > >               return r;
> > > -     r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > -                                    PT64_ROOT_MAX_LEVEL);
> > > +     r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > +                                   &vcpu->arch.mmu_shadow_page_cache_lock);
> > >       if (r)
> > >               return r;
> > >       if (maybe_indirect) {
> > > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > >                                         PT64_ROOT_MAX_LEVEL);
> > >  }
> > >
> > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > +                                  spinlock_t *cache_lock)
> > > +{
> > > +     int orig_nobjs;
> > > +
> > > +     spin_lock(cache_lock);
> > > +     orig_nobjs = cache->nobjs;
> > > +     kvm_mmu_free_memory_cache(cache);
> > > +     if (orig_nobjs)
> > > +             percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > +
> > > +     spin_unlock(cache_lock);
> > > +}
> >
> > It would be nice to avoid adding these wrapper functions.
> >
> > Once you add a mutex to protect the caches from being freed while vCPUs
> > are in the middle of a page fault you can drop the spin lock. After that
> > the only reason to have these wrappers is to update
> > kvm_total_unused_mmu_pages.
> >
> > Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
> > calculate the number of of unused pages in mmu_shrink_count()? Or just
> > estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
> > Or have per-VM or per-vCPU shrinkers to avoid needing to do any
> > aggregation?
> >
>
> I think we can drop this, by default we can return num_kvms *
> num_vcpus * nodes * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>
> Whenever mmu_shrink_scan() is called if there are no pages to free
> then return SHRINK_STOP which will stop any subsequent calls during
> that time.
>
>
> > > +
> > >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > >  {
> > >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > > -     kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > > +     mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > +                              &vcpu->arch.mmu_shadow_page_cache_lock);
> > >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> >
> > mmu_shadowed_info_cache can be freed by the shrinker as well.
> >

Yes, I can do that as well.

> > >       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > >  }
> > > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> > >  }
> > >  #endif
> > >
> > > -/*
> > > - * This value is the sum of all of the kvm instances's
> > > - * kvm->arch.n_used_mmu_pages values.  We need a global,
> > > - * aggregate version in order to make the slab shrinker
> > > - * faster
> > > - */
> > > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > > -{
> > > -     kvm->arch.n_used_mmu_pages += nr;
> > > -     percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > > -}
> > > -
> > >  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > >  {
> > > -     kvm_mod_used_mmu_pages(kvm, +1);
> > > +     kvm->arch.n_used_mmu_pages++;
> > >       kvm_account_pgtable_pages((void *)sp->spt, +1);
> > >  }
> > >
> > >  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > >  {
> > > -     kvm_mod_used_mmu_pages(kvm, -1);
> > > +     kvm->arch.n_used_mmu_pages--;
> > >       kvm_account_pgtable_pages((void *)sp->spt, -1);
> > >  }
> > >
> > > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> > >       struct kvm_mmu_memory_cache *page_header_cache;
> > >       struct kvm_mmu_memory_cache *shadow_page_cache;
> > >       struct kvm_mmu_memory_cache *shadowed_info_cache;
> > > +     /*
> > > +      * Protects change in size of shadow_page_cache cache.
> > > +      */
> > > +     spinlock_t *shadow_page_cache_lock;
> > >  };
> > >
> > > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > > +                                 spinlock_t *cache_lock)
> > > +{
> > > +     int orig_nobjs;
> > > +     void *page;
> > > +
> > > +     if (!cache_lock) {
> > > +             spin_lock(cache_lock);
> > > +             orig_nobjs = shadow_page_cache->nobjs;
> > > +     }
> > > +     page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > > +     if (!cache_lock) {
> > > +             if (orig_nobjs)
> > > +                     percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > > +             spin_unlock(cache_lock);
> > > +     }
> > > +     return page;
> > > +}
> > > +
> > >  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > >                                                     struct shadow_page_caches *caches,
> > >                                                     gfn_t gfn,
> > > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > >       struct kvm_mmu_page *sp;
> > >
> > >       sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > > -     sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > > +     sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > > +                                             caches->shadow_page_cache_lock);
> > >       if (!role.direct)
> > >               sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> > >
> > > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > >               .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > >               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > >               .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > > +             .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > >       };
> > >
> > >       return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > >       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > >
> > >       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > +     spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >       vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > >       vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> > >               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > >  }
> > >
> > > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > > -{
> > > -     return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > > -}
> > > -
> > >  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> > >                       struct kvm_memory_slot *slot,
> > >                       struct kvm_page_track_notifier_node *node)
> > > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> > >       /* Direct SPs do not require a shadowed_info_cache. */
> > >       caches.page_header_cache = &kvm->arch.split_page_header_cache;
> > >       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > > +     caches.shadow_page_cache_lock = NULL;
> > >
> > >       /* Safe to pass NULL for vCPU since requesting a direct SP. */
> > >       return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > >  static unsigned long
> > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > >  {
> > > -     struct kvm *kvm;
> > > -     int nr_to_scan = sc->nr_to_scan;
> > > +     struct kvm_mmu_memory_cache *cache;
> > > +     struct kvm *kvm, *first_kvm = NULL;
> > >       unsigned long freed = 0;
> > > +     /* spinlock for memory cache */
> > > +     spinlock_t *cache_lock;
> > > +     struct kvm_vcpu *vcpu;
> > > +     unsigned long i;
> > >
> > >       mutex_lock(&kvm_lock);
> > >
> > >       list_for_each_entry(kvm, &vm_list, vm_list) {
> > > -             int idx;
> > > -             LIST_HEAD(invalid_list);
> > > -
> > > -             /*
> > > -              * Never scan more than sc->nr_to_scan VM instances.
> > > -              * Will not hit this condition practically since we do not try
> > > -              * to shrink more than one VM and it is very unlikely to see
> > > -              * !n_used_mmu_pages so many times.
> > > -              */
> > > -             if (!nr_to_scan--)
> > > +             if (first_kvm == kvm)
> > >                       break;
> > > -             /*
> > > -              * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > > -              * here. We may skip a VM instance errorneosly, but we do not
> > > -              * want to shrink a VM that only started to populate its MMU
> > > -              * anyway.
> > > -              */
> > > -             if (!kvm->arch.n_used_mmu_pages &&
> > > -                 !kvm_has_zapped_obsolete_pages(kvm))
> > > -                     continue;
> > > +             if (!first_kvm)
> > > +                     first_kvm = kvm;
> > > +             list_move_tail(&kvm->vm_list, &vm_list);
> > >
> > > -             idx = srcu_read_lock(&kvm->srcu);
> > > -             write_lock(&kvm->mmu_lock);
> > > +             kvm_for_each_vcpu(i, vcpu, kvm) {
> >
> > What protects this from racing with vCPU creation/deletion?
> >

vCPU deletion:
We take kvm_lock in mmu_shrink_scan(), the same lock is taken in
kvm_destroy_vm() to remove a vm from vm_list. So, once we are
iterating vm_list we will not see any VM removal which will means no
vcpu removal.

I didn't find any other code for vCPU deletion except failures during
VM and VCPU set up. A VM is only added to vm_list after successful
creation.

vCPU creation:
I think it will work.

kvm_vm_ioctl_create_vcpus() initializes the vcpu, adds it to
kvm->vcpu_array which is of the type xarray and is managed by RCU.
After this online_vcpus is incremented. So, kvm_for_each_vcpu() which
uses RCU to read entries, if it sees incremented online_vcpus value
then it will also sees all of the vcpu initialization.

@Sean, Paolo

Is the above explanation correct, kvm_for_each_vcpu() is safe without any lock?

> > > +                     cache = &vcpu->arch.mmu_shadow_page_cache;
> > > +                     cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> > > +                     if (READ_ONCE(cache->nobjs)) {
> > > +                             spin_lock(cache_lock);
> > > +                             freed += kvm_mmu_empty_memory_cache(cache);
> > > +                             spin_unlock(cache_lock);
> > > +                     }
> >
> > What about freeing kvm->arch.split_shadow_page_cache as well?
> >

I am doing this in patch 3.

> > >
> > > -             if (kvm_has_zapped_obsolete_pages(kvm)) {
> > > -                     kvm_mmu_commit_zap_page(kvm,
> > > -                           &kvm->arch.zapped_obsolete_pages);
> > > -                     goto unlock;
> > >               }
> > >
> > > -             freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > > -
> > > -unlock:
> > > -             write_unlock(&kvm->mmu_lock);
> > > -             srcu_read_unlock(&kvm->srcu, idx);
> > > -
> > > -             /*
> > > -              * unfair on small ones
> > > -              * per-vm shrinkers cry out
> > > -              * sadness comes quickly
> > > -              */
> > > -             list_move_tail(&kvm->vm_list, &vm_list);
> > > -             break;
> > > +             if (freed >= sc->nr_to_scan)
> > > +                     break;
> > >       }
> > >
> > > +     if (freed)
> > > +             percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> > >       mutex_unlock(&kvm_lock);
> > > +     percpu_counter_sync(&kvm_total_unused_mmu_pages);
> > >       return freed;
> > >  }
> > >
> > >  static unsigned long
> > >  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > >  {
> > > -     return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > > +     return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> > >  }
> > >
> > >  static struct shrinker mmu_shrinker = {
> > > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> > >       if (!mmu_page_header_cache)
> > >               goto out;
> > >
> > > -     if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > > +     if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> > >               goto out;
> > >
> > >       ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> > >       return 0;
> > >
> > >  out_shrinker:
> > > -     percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > > +     percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > >  out:
> > >       mmu_destroy_caches();
> > >       return ret;
> > > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > >  void kvm_mmu_vendor_module_exit(void)
> > >  {
> > >       mmu_destroy_caches();
> > > -     percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > > +     percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > >       unregister_shrinker(&mmu_shrinker);
> > >  }
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index ac00bfbf32f6..c2a342028b6a 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > > +                                 spinlock_t *cache_lock);
> > >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 764f7c87286f..4974fa96deff 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > >       struct kvm_mmu_page *sp;
> > >
> > >       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > > -     sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > > +     sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > > +                                             &vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >       return sp;
> > >  }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 01aad8b74162..efd9b38ea9a2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> > >  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> > >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> > >  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> > >  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >  #endif
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 13e88297f999..f2d762878b97 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> > >       return mc->nobjs;
> > >  }
> > >
> > > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> > >  {
> > > +     int freed = mc->nobjs;
> > > +
> > >       while (mc->nobjs) {
> > >               if (mc->kmem_cache)
> > >                       kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > >                       free_page((unsigned long)mc->objects[--mc->nobjs]);
> > >       }
> > >
> > > -     kvfree(mc->objects);
> > > +     return freed;
> > > +}
> > >
> > > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > +{
> > > +     kvm_mmu_empty_memory_cache(mc);
> > > +     kvfree(mc->objects);
> > >       mc->objects = NULL;
> > >       mc->capacity = 0;
> > >  }
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-04  0:25       ` Vipin Sharma
@ 2023-01-18 17:43         ` Sean Christopherson
  0 siblings, 0 replies; 46+ messages in thread
From: Sean Christopherson @ 2023-01-18 17:43 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: David Matlack, pbonzini, bgardon, kvm, linux-kernel

@all, trim your replies!

On Tue, Jan 03, 2023, Vipin Sharma wrote:
> On Tue, Jan 3, 2023 at 10:01 AM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > On Thu, Dec 29, 2022 at 1:55 PM David Matlack <dmatlack@google.com> wrote:
> > > > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > > >  static unsigned long
> > > >  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > >  {
> > > > -     struct kvm *kvm;
> > > > -     int nr_to_scan = sc->nr_to_scan;
> > > > +     struct kvm_mmu_memory_cache *cache;
> > > > +     struct kvm *kvm, *first_kvm = NULL;
> > > >       unsigned long freed = 0;
> > > > +     /* spinlock for memory cache */
> > > > +     spinlock_t *cache_lock;
> > > > +     struct kvm_vcpu *vcpu;
> > > > +     unsigned long i;
> > > >
> > > >       mutex_lock(&kvm_lock);
> > > >
> > > >       list_for_each_entry(kvm, &vm_list, vm_list) {
> > > > -             int idx;
> > > > -             LIST_HEAD(invalid_list);
> > > > -
> > > > -             /*
> > > > -              * Never scan more than sc->nr_to_scan VM instances.
> > > > -              * Will not hit this condition practically since we do not try
> > > > -              * to shrink more than one VM and it is very unlikely to see
> > > > -              * !n_used_mmu_pages so many times.
> > > > -              */
> > > > -             if (!nr_to_scan--)
> > > > +             if (first_kvm == kvm)
> > > >                       break;
> > > > -             /*
> > > > -              * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > > > -              * here. We may skip a VM instance errorneosly, but we do not
> > > > -              * want to shrink a VM that only started to populate its MMU
> > > > -              * anyway.
> > > > -              */
> > > > -             if (!kvm->arch.n_used_mmu_pages &&
> > > > -                 !kvm_has_zapped_obsolete_pages(kvm))
> > > > -                     continue;
> > > > +             if (!first_kvm)
> > > > +                     first_kvm = kvm;
> > > > +             list_move_tail(&kvm->vm_list, &vm_list);
> > > >
> > > > -             idx = srcu_read_lock(&kvm->srcu);
> > > > -             write_lock(&kvm->mmu_lock);
> > > > +             kvm_for_each_vcpu(i, vcpu, kvm) {
> > >
> > > What protects this from racing with vCPU creation/deletion?
> > >
> 
> vCPU deletion:
> We take kvm_lock in mmu_shrink_scan(), the same lock is taken in
> kvm_destroy_vm() to remove a vm from vm_list. So, once we are
> iterating vm_list we will not see any VM removal which will means no
> vcpu removal.
> 
> I didn't find any other code for vCPU deletion except failures during
> VM and VCPU set up. A VM is only added to vm_list after successful
> creation.

Yep, KVM doesn't support destroying/freeing a vCPU after it's been added.

> vCPU creation:
> I think it will work.
> 
> kvm_vm_ioctl_create_vcpus() initializes the vcpu, adds it to
> kvm->vcpu_array which is of the type xarray and is managed by RCU.
> After this online_vcpus is incremented. So, kvm_for_each_vcpu() which
> uses RCU to read entries, if it sees incremented online_vcpus value
> then it will also sees all of the vcpu initialization.

Yep.  The shrinker may race with a vCPU creation, e.g. not process a just-created
vCPU, but that's totally ok in this case since the shrinker path is best effort
(and purging the caches of a newly created vCPU is likely pointless).

> @Sean, Paolo
> 
> Is the above explanation correct, kvm_for_each_vcpu() is safe without any lock?

Well, in this case, you do need to hold kvm_lock ;-)

But yes, iterating over vCPUs without holding the per-VM kvm->lock is safe, the
caller just needs to ensure the VM can't be destroyed, i.e. either needs to hold
a reference to the VM or needs to hold kvm_lock.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
  2022-12-27 18:37   ` Ben Gardon
  2022-12-29 21:54   ` David Matlack
@ 2023-01-03 19:32   ` Mingwei Zhang
  2023-01-04  1:00     ` Vipin Sharma
  2 siblings, 1 reply; 46+ messages in thread
From: Mingwei Zhang @ 2023-01-03 19:32 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in
> https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
>
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache.  Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.
>
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
>
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   5 +
>  arch/x86/kvm/mmu/mmu.c          | 163 +++++++++++++++++++-------------
>  arch/x86/kvm/mmu/mmu_internal.h |   2 +
>  arch/x86/kvm/mmu/tdp_mmu.c      |   3 +-
>  include/linux/kvm_host.h        |   1 +
>  virt/kvm/kvm_main.c             |  11 ++-
>  6 files changed, 114 insertions(+), 71 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> +       /*
> +        * Protects change in size of mmu_shadow_page_cache cache.
> +        */
> +       spinlock_t mmu_shadow_page_cache_lock;
> +
>         /*
>          * QEMU userspace and the guest each have their own FPU state.
>          * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>
>  static struct kmem_cache *pte_list_desc_cache;
>  struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
>  static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>         }
>  }
>
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +                                    spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +       int r;
> +
> +       spin_lock(cache_lock);
> +       orig_nobjs = cache->nobjs;
> +       r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> +       if (orig_nobjs != cache->nobjs)
> +               percpu_counter_add(&kvm_total_unused_mmu_pages,
> +                                  (cache->nobjs - orig_nobjs));
> +       spin_unlock(cache_lock);
> +       return r;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>         int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                      PT64_ROOT_MAX_LEVEL);
> +       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +                                     &vcpu->arch.mmu_shadow_page_cache_lock);
>         if (r)
>                 return r;
>         if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>                                           PT64_ROOT_MAX_LEVEL);
>  }
>
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> +                                    spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +
> +       spin_lock(cache_lock);
> +       orig_nobjs = cache->nobjs;
> +       kvm_mmu_free_memory_cache(cache);
> +       if (orig_nobjs)
> +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> +       spin_unlock(cache_lock);
> +}

I think the mmu_cache allocation and deallocation may cause the usage
of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
lock would definitely sound like a plan, but I think it might affect
the performance. Alternatively, I am wondering if we could use a
mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
concurrency?

Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
mmu write lock. In the page fault path, each vcpu has to collect a
snapshot of  mmu_cache_sequence before calling into
mmu_topup_memory_caches() and check the value again when holding the
mmu lock. If the value is different, that means the mmu_shrinker has
removed the cache objects and because of that, the vcpu should retry.


> +
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> +       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +                                &vcpu->arch.mmu_shadow_page_cache_lock);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
>  }
>  #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values.  We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> -       kvm->arch.n_used_mmu_pages += nr;
> -       percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
>  static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, +1);
> +       kvm->arch.n_used_mmu_pages++;
>         kvm_account_pgtable_pages((void *)sp->spt, +1);
>  }
>
>  static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       kvm_mod_used_mmu_pages(kvm, -1);
> +       kvm->arch.n_used_mmu_pages--;
>         kvm_account_pgtable_pages((void *)sp->spt, -1);
>  }
>
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
>         struct kvm_mmu_memory_cache *page_header_cache;
>         struct kvm_mmu_memory_cache *shadow_page_cache;
>         struct kvm_mmu_memory_cache *shadowed_info_cache;
> +       /*
> +        * Protects change in size of shadow_page_cache cache.
> +        */
> +       spinlock_t *shadow_page_cache_lock;
>  };
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +                                   spinlock_t *cache_lock)
> +{
> +       int orig_nobjs;
> +       void *page;
> +
> +       if (!cache_lock) {
> +               spin_lock(cache_lock);
> +               orig_nobjs = shadow_page_cache->nobjs;
> +       }
> +       page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> +       if (!cache_lock) {
> +               if (orig_nobjs)
> +                       percpu_counter_dec(&kvm_total_unused_mmu_pages);
> +               spin_unlock(cache_lock);
> +       }
> +       return page;
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>                                                       struct shadow_page_caches *caches,
>                                                       gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>         struct kvm_mmu_page *sp;
>
>         sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> +       sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> +                                               caches->shadow_page_cache_lock);
>         if (!role.direct)
>                 sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
>                 .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
>                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> +               .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
>         };
>
>         return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>                 kvm_tdp_mmu_zap_invalidated_roots(kvm);
>  }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> -       return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
>  static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>                         struct kvm_memory_slot *slot,
>                         struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
>         /* Direct SPs do not require a shadowed_info_cache. */
>         caches.page_header_cache = &kvm->arch.split_page_header_cache;
>         caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +       caches.shadow_page_cache_lock = NULL;
>
>         /* Safe to pass NULL for vCPU since requesting a direct SP. */
>         return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>  static unsigned long
>  mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -       struct kvm *kvm;
> -       int nr_to_scan = sc->nr_to_scan;
> +       struct kvm_mmu_memory_cache *cache;
> +       struct kvm *kvm, *first_kvm = NULL;
>         unsigned long freed = 0;
> +       /* spinlock for memory cache */
> +       spinlock_t *cache_lock;
> +       struct kvm_vcpu *vcpu;
> +       unsigned long i;
>
>         mutex_lock(&kvm_lock);
>
>         list_for_each_entry(kvm, &vm_list, vm_list) {
> -               int idx;
> -               LIST_HEAD(invalid_list);
> -
> -               /*
> -                * Never scan more than sc->nr_to_scan VM instances.
> -                * Will not hit this condition practically since we do not try
> -                * to shrink more than one VM and it is very unlikely to see
> -                * !n_used_mmu_pages so many times.
> -                */
> -               if (!nr_to_scan--)
> +               if (first_kvm == kvm)
>                         break;
> -               /*
> -                * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> -                * here. We may skip a VM instance errorneosly, but we do not
> -                * want to shrink a VM that only started to populate its MMU
> -                * anyway.
> -                */
> -               if (!kvm->arch.n_used_mmu_pages &&
> -                   !kvm_has_zapped_obsolete_pages(kvm))
> -                       continue;
> +               if (!first_kvm)
> +                       first_kvm = kvm;
> +               list_move_tail(&kvm->vm_list, &vm_list);
>
> -               idx = srcu_read_lock(&kvm->srcu);
> -               write_lock(&kvm->mmu_lock);
> +               kvm_for_each_vcpu(i, vcpu, kvm) {
> +                       cache = &vcpu->arch.mmu_shadow_page_cache;
> +                       cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> +                       if (READ_ONCE(cache->nobjs)) {
> +                               spin_lock(cache_lock);
> +                               freed += kvm_mmu_empty_memory_cache(cache);
> +                               spin_unlock(cache_lock);
> +                       }
>
> -               if (kvm_has_zapped_obsolete_pages(kvm)) {
> -                       kvm_mmu_commit_zap_page(kvm,
> -                             &kvm->arch.zapped_obsolete_pages);
> -                       goto unlock;
>                 }
>
> -               freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> -               write_unlock(&kvm->mmu_lock);
> -               srcu_read_unlock(&kvm->srcu, idx);
> -
> -               /*
> -                * unfair on small ones
> -                * per-vm shrinkers cry out
> -                * sadness comes quickly
> -                */
> -               list_move_tail(&kvm->vm_list, &vm_list);
> -               break;
> +               if (freed >= sc->nr_to_scan)
> +                       break;
>         }
>
> +       if (freed)
> +               percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
>         mutex_unlock(&kvm_lock);
> +       percpu_counter_sync(&kvm_total_unused_mmu_pages);
>         return freed;
>  }
>
>  static unsigned long
>  mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -       return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> +       return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
>  }
>
>  static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
>         if (!mmu_page_header_cache)
>                 goto out;
>
> -       if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> +       if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
>                 goto out;
>
>         ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
>         return 0;
>
>  out_shrinker:
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>  out:
>         mmu_destroy_caches();
>         return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  void kvm_mmu_vendor_module_exit(void)
>  {
>         mmu_destroy_caches();
> -       percpu_counter_destroy(&kvm_total_used_mmu_pages);
> +       percpu_counter_destroy(&kvm_total_unused_mmu_pages);
>         unregister_shrinker(&mmu_shrinker);
>  }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> +                                   spinlock_t *cache_lock);
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>         struct kvm_mmu_page *sp;
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> +                                               &vcpu->arch.mmu_shadow_page_cache_lock);
>
>         return sp;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
>  int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
>  int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
>         return mc->nobjs;
>  }
>
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
>  {
> +       int freed = mc->nobjs;
> +
>         while (mc->nobjs) {
>                 if (mc->kmem_cache)
>                         kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
>                         free_page((unsigned long)mc->objects[--mc->nobjs]);
>         }
>
> -       kvfree(mc->objects);
> +       return freed;
> +}
>
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> +       kvm_mmu_empty_memory_cache(mc);
> +       kvfree(mc->objects);
>         mc->objects = NULL;
>         mc->capacity = 0;
>  }
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-03 19:32   ` Mingwei Zhang
@ 2023-01-04  1:00     ` Vipin Sharma
  2023-01-04  6:29       ` Mingwei Zhang
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2023-01-04  1:00 UTC (permalink / raw)
  To: Mingwei Zhang; +Cc: seanjc, pbonzini, bgardon, dmatlack, kvm, linux-kernel

On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > +                                    spinlock_t *cache_lock)
> > +{
> > +       int orig_nobjs;
> > +
> > +       spin_lock(cache_lock);
> > +       orig_nobjs = cache->nobjs;
> > +       kvm_mmu_free_memory_cache(cache);
> > +       if (orig_nobjs)
> > +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > +       spin_unlock(cache_lock);
> > +}
>
> I think the mmu_cache allocation and deallocation may cause the usage
> of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> lock would definitely sound like a plan, but I think it might affect
> the performance. Alternatively, I am wondering if we could use a
> mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> concurrency?
>

Can you explain more about the performance impact? Each vcpu will have
its own mutex. So, only contention will be with the mmu_shrinker. This
shrinker will use mutex_try_lock() which will not block to wait for
the lock, it will just pass on to the next vcpu. While shrinker is
holding the lock, vcpu will be blocked in the page fault path but I
think it should not have a huge impact considering it will execute
rarely and for a small time.

> Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> mmu write lock. In the page fault path, each vcpu has to collect a
> snapshot of  mmu_cache_sequence before calling into
> mmu_topup_memory_caches() and check the value again when holding the
> mmu lock. If the value is different, that means the mmu_shrinker has
> removed the cache objects and because of that, the vcpu should retry.
>

Yeah, this can be one approach. I think it will come down to the
performance impact of using mutex which I don't think should be a
concern.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-04  1:00     ` Vipin Sharma
@ 2023-01-04  6:29       ` Mingwei Zhang
  2023-01-04  6:57         ` Mingwei Zhang
  2023-01-18 17:36         ` Sean Christopherson
  0 siblings, 2 replies; 46+ messages in thread
From: Mingwei Zhang @ 2023-01-04  6:29 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, dmatlack, kvm, linux-kernel

On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > +                                    spinlock_t *cache_lock)
> > > +{
> > > +       int orig_nobjs;
> > > +
> > > +       spin_lock(cache_lock);
> > > +       orig_nobjs = cache->nobjs;
> > > +       kvm_mmu_free_memory_cache(cache);
> > > +       if (orig_nobjs)
> > > +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > +
> > > +       spin_unlock(cache_lock);
> > > +}
> >
> > I think the mmu_cache allocation and deallocation may cause the usage
> > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > lock would definitely sound like a plan, but I think it might affect
> > the performance. Alternatively, I am wondering if we could use a
> > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > concurrency?
> >
>
> Can you explain more about the performance impact? Each vcpu will have
> its own mutex. So, only contention will be with the mmu_shrinker. This
> shrinker will use mutex_try_lock() which will not block to wait for
> the lock, it will just pass on to the next vcpu. While shrinker is
> holding the lock, vcpu will be blocked in the page fault path but I
> think it should not have a huge impact considering it will execute
> rarely and for a small time.
>
> > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > mmu write lock. In the page fault path, each vcpu has to collect a
> > snapshot of  mmu_cache_sequence before calling into
> > mmu_topup_memory_caches() and check the value again when holding the
> > mmu lock. If the value is different, that means the mmu_shrinker has
> > removed the cache objects and because of that, the vcpu should retry.
> >
>
> Yeah, this can be one approach. I think it will come down to the
> performance impact of using mutex which I don't think should be a
> concern.

hmm, I think you are right that there is no performance overhead by
adding a mutex and letting the shrinker using mutex_trylock(). The
point of using a sequence counter is to avoid the new lock, since
introducing a new lock will increase management burden. So unless it
is necessary, we probably should choose a simple solution first.

In this case, I think we do have such a choice and since a similar
mechanism has already been used by mmu_notifiers.

best
-Mingwei

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-04  6:29       ` Mingwei Zhang
@ 2023-01-04  6:57         ` Mingwei Zhang
  2023-01-18 17:36         ` Sean Christopherson
  1 sibling, 0 replies; 46+ messages in thread
From: Mingwei Zhang @ 2023-01-04  6:57 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, dmatlack, kvm, linux-kernel

On Tue, Jan 3, 2023 at 10:29 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > >
> > > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > > +                                    spinlock_t *cache_lock)
> > > > +{
> > > > +       int orig_nobjs;
> > > > +
> > > > +       spin_lock(cache_lock);
> > > > +       orig_nobjs = cache->nobjs;
> > > > +       kvm_mmu_free_memory_cache(cache);
> > > > +       if (orig_nobjs)
> > > > +               percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > > +
> > > > +       spin_unlock(cache_lock);
> > > > +}
> > >
> > > I think the mmu_cache allocation and deallocation may cause the usage
> > > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > > lock would definitely sound like a plan, but I think it might affect
> > > the performance. Alternatively, I am wondering if we could use a
> > > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > > concurrency?
> > >
> >
> > Can you explain more about the performance impact? Each vcpu will have
> > its own mutex. So, only contention will be with the mmu_shrinker. This
> > shrinker will use mutex_try_lock() which will not block to wait for
> > the lock, it will just pass on to the next vcpu. While shrinker is
> > holding the lock, vcpu will be blocked in the page fault path but I
> > think it should not have a huge impact considering it will execute
> > rarely and for a small time.
> >
> > > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > > mmu write lock. In the page fault path, each vcpu has to collect a
> > > snapshot of  mmu_cache_sequence before calling into
> > > mmu_topup_memory_caches() and check the value again when holding the
> > > mmu lock. If the value is different, that means the mmu_shrinker has
> > > removed the cache objects and because of that, the vcpu should retry.
> > >
> >
> > Yeah, this can be one approach. I think it will come down to the
> > performance impact of using mutex which I don't think should be a
> > concern.
>
> hmm, I think you are right that there is no performance overhead by
> adding a mutex and letting the shrinker using mutex_trylock(). The
> point of using a sequence counter is to avoid the new lock, since
> introducing a new lock will increase management burden. So unless it
> is necessary, we probably should choose a simple solution first.
>
> In this case, I think we do have such a choice and since a similar
> mechanism has already been used by mmu_notifiers.
>

Let me take it back. The per-vcpu sequence number in this case has to
be protected by a VM level mmu write lock. I think this might be less
performant than using a per-vcpu mutex.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
  2023-01-04  6:29       ` Mingwei Zhang
  2023-01-04  6:57         ` Mingwei Zhang
@ 2023-01-18 17:36         ` Sean Christopherson
  1 sibling, 0 replies; 46+ messages in thread
From: Sean Christopherson @ 2023-01-18 17:36 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Vipin Sharma, pbonzini, bgardon, dmatlack, kvm, linux-kernel

On Tue, Jan 03, 2023, Mingwei Zhang wrote:
> On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > I think the mmu_cache allocation and deallocation may cause the usage
> > > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > > lock would definitely sound like a plan, but I think it might affect
> > > the performance. Alternatively, I am wondering if we could use a
> > > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > > concurrency?
> > >
> >
> > Can you explain more about the performance impact? Each vcpu will have
> > its own mutex. So, only contention will be with the mmu_shrinker. This
> > shrinker will use mutex_try_lock() which will not block to wait for
> > the lock, it will just pass on to the next vcpu. While shrinker is
> > holding the lock, vcpu will be blocked in the page fault path but I
> > think it should not have a huge impact considering it will execute
> > rarely and for a small time.
> >
> > > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > > mmu write lock. In the page fault path, each vcpu has to collect a
> > > snapshot of  mmu_cache_sequence before calling into
> > > mmu_topup_memory_caches() and check the value again when holding the
> > > mmu lock. If the value is different, that means the mmu_shrinker has
> > > removed the cache objects and because of that, the vcpu should retry.
> > >
> >
> > Yeah, this can be one approach. I think it will come down to the
> > performance impact of using mutex which I don't think should be a
> > concern.
> 
> hmm, I think you are right that there is no performance overhead by
> adding a mutex and letting the shrinker using mutex_trylock(). The
> point of using a sequence counter is to avoid the new lock, since
> introducing a new lock will increase management burden.

No, more locks doesn't necessarily mean higher maintenance cost.  More complexity
definitely means more maintenance, but additional locks doesn't necessarily equate
to increased complexity.

Lockless algorithms are almost always more difficult to reason about, i.e. trying
to use a sequence counter for this case would be more complex than using a per-vCPU
mutex.  The only complexity in adding another mutex is understanding why an additional
lock necessary, and IMO that's fairly easy to explain/understand (the shrinker will
almost never succeed if it has to wait for vcpu->mutex to be dropped).

> So unless it is necessary, we probably should choose a simple solution first.
> 
> In this case, I think we do have such a choice and since a similar
> mechanism has already been used by mmu_notifiers.

The mmu_notifier case is very different.  The invalidations affect the entire VM,
notifiers _must_ succeed, may or may not allowing sleeping, the readers (vCPUs)
effectively need protection while running in the guest, and practically speaking
holding a per-VM (or global) lock of any kind while a vCPU is running in the guest
is not viable, e.g. even holding kvm->srcu is disallowed.

In other words, using a traditional locking scheme to serialize guest accesses
with host-initiated page table (or memslot) updates is simply not an option.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-29 21:59   ` David Matlack
  2022-12-22  2:34 ` [Patch v3 3/9] KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker Vipin Sharma
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

zapped_obsolete_pages list was used in struct kvm_arch{} to provide
pages for KVM MMU shrinker. This is not needed now as KVM MMU shrinker
has been repurposed to free shadow page caches and not
zapped_obsolete_pages.

Remove zapped_obsolete_pages from struct kvm_arch{} and use local list
in kvm_zap_obsolete_pages().

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/mmu/mmu.c          | 8 ++++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 89cc809e4a00..f89f02e18080 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1215,7 +1215,6 @@ struct kvm_arch {
 	u8 mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
-	struct list_head zapped_obsolete_pages;
 	/*
 	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
 	 * replaced by an NX huge page.  A shadow page is on this list if its
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 157417e1cb6e..3364760a1695 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5987,6 +5987,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
 	int nr_zapped, batch = 0;
+	LIST_HEAD(zapped_pages);
 	bool unstable;
 
 restart:
@@ -6019,8 +6020,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 			goto restart;
 		}
 
-		unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
-				&kvm->arch.zapped_obsolete_pages, &nr_zapped);
+		unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &zapped_pages,
+						      &nr_zapped);
 		batch += nr_zapped;
 
 		if (unstable)
@@ -6036,7 +6037,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 	 * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
 	 * running with an obsolete MMU.
 	 */
-	kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+	kvm_mmu_commit_zap_page(kvm, &zapped_pages);
 }
 
 /*
@@ -6112,7 +6113,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	int r;
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
-	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
  2022-12-22  2:34 ` [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{} Vipin Sharma
@ 2022-12-29 21:59   ` David Matlack
  0 siblings, 0 replies; 46+ messages in thread
From: David Matlack @ 2022-12-29 21:59 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:50PM -0800, Vipin Sharma wrote:
> zapped_obsolete_pages list was used in struct kvm_arch{} to provide
> pages for KVM MMU shrinker. This is not needed now as KVM MMU shrinker
> has been repurposed to free shadow page caches and not
> zapped_obsolete_pages.
> 
> Remove zapped_obsolete_pages from struct kvm_arch{} and use local list
> in kvm_zap_obsolete_pages().
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>

> ---
>  arch/x86/include/asm/kvm_host.h | 1 -
>  arch/x86/kvm/mmu/mmu.c          | 8 ++++----
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 89cc809e4a00..f89f02e18080 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1215,7 +1215,6 @@ struct kvm_arch {
>  	u8 mmu_valid_gen;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	struct list_head active_mmu_pages;
> -	struct list_head zapped_obsolete_pages;
>  	/*
>  	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
>  	 * replaced by an NX huge page.  A shadow page is on this list if its
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 157417e1cb6e..3364760a1695 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5987,6 +5987,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
>  {
>  	struct kvm_mmu_page *sp, *node;
>  	int nr_zapped, batch = 0;
> +	LIST_HEAD(zapped_pages);

optional nit: The common name of this is invalid_list (see other callers
of __kvm_mmu_prepare_zap_page()).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 3/9] KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{} Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware Vipin Sharma
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

split_shadow_page_cache is not used after dirty log is disabled. It is a
good candidate to free memory in case of mmu_shrink_scan kicks in.

Account for split_shadow_page_cache via kvm_total_unused_mmu_pages and
use it in mmu_shrink_scan.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++
 arch/x86/kvm/mmu/mmu.c          | 63 +++++++++++++++++++--------------
 2 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f89f02e18080..293994fabae3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1413,6 +1413,11 @@ struct kvm_arch {
 	struct kvm_mmu_memory_cache split_shadow_page_cache;
 	struct kvm_mmu_memory_cache split_page_header_cache;
 
+	/*
+	 * Protects change in size of split_shadow_page_cache cache.
+	 */
+	spinlock_t split_shadow_page_cache_lock;
+
 	/*
 	 * Memory cache used to allocate pte_list_desc structs while splitting
 	 * huge pages. In the worst case, to split one huge page, 512
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3364760a1695..6f6a10d7a871 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -659,14 +659,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 }
 
 static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
-				     spinlock_t *cache_lock)
+				     spinlock_t *cache_lock,
+				     int min)
 {
 	int orig_nobjs;
 	int r;
 
 	spin_lock(cache_lock);
 	orig_nobjs = cache->nobjs;
-	r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
+	r = kvm_mmu_topup_memory_cache(cache, min);
 	if (orig_nobjs != cache->nobjs)
 		percpu_counter_add(&kvm_total_unused_mmu_pages,
 				   (cache->nobjs - orig_nobjs));
@@ -684,7 +685,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 	if (r)
 		return r;
 	r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				      &vcpu->arch.mmu_shadow_page_cache_lock);
+				      &vcpu->arch.mmu_shadow_page_cache_lock,
+				      PT64_ROOT_MAX_LEVEL);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -2184,16 +2186,12 @@ void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cac
 	int orig_nobjs;
 	void *page;
 
-	if (!cache_lock) {
-		spin_lock(cache_lock);
-		orig_nobjs = shadow_page_cache->nobjs;
-	}
+	spin_lock(cache_lock);
+	orig_nobjs = shadow_page_cache->nobjs;
 	page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
-	if (!cache_lock) {
-		if (orig_nobjs)
-			percpu_counter_dec(&kvm_total_unused_mmu_pages);
-		spin_unlock(cache_lock);
-	}
+	if (orig_nobjs)
+		percpu_counter_dec(&kvm_total_unused_mmu_pages);
+	spin_unlock(cache_lock);
 	return page;
 }
 
@@ -6130,6 +6128,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
 
 	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
 
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
@@ -6141,7 +6140,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 {
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
-	kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+	mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
+				 &kvm->arch.split_shadow_page_cache_lock);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6295,7 +6295,9 @@ static int topup_split_caches(struct kvm *kvm)
 	if (r)
 		return r;
 
-	return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+	return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
+					 &kvm->arch.split_shadow_page_cache_lock,
+					 1);
 }
 
 static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
@@ -6320,7 +6322,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
 	/* Direct SPs do not require a shadowed_info_cache. */
 	caches.page_header_cache = &kvm->arch.split_page_header_cache;
 	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
-	caches.shadow_page_cache_lock = NULL;
+	caches.shadow_page_cache_lock = &kvm->arch.split_shadow_page_cache_lock;
 
 	/* Safe to pass NULL for vCPU since requesting a direct SP. */
 	return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6687,14 +6689,23 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	}
 }
 
+static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
+				      spinlock_t *cache_lock)
+{
+	unsigned long freed = 0;
+
+	spin_lock(cache_lock);
+	if (cache->nobjs)
+		freed = kvm_mmu_empty_memory_cache(cache);
+	spin_unlock(cache_lock);
+	return freed;
+}
+
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	struct kvm_mmu_memory_cache *cache;
 	struct kvm *kvm, *first_kvm = NULL;
 	unsigned long freed = 0;
-	/* spinlock for memory cache */
-	spinlock_t *cache_lock;
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
 
@@ -6707,15 +6718,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			first_kvm = kvm;
 		list_move_tail(&kvm->vm_list, &vm_list);
 
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			cache = &vcpu->arch.mmu_shadow_page_cache;
-			cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
-			if (READ_ONCE(cache->nobjs)) {
-				spin_lock(cache_lock);
-				freed += kvm_mmu_empty_memory_cache(cache);
-				spin_unlock(cache_lock);
-			}
+		freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
+					  &kvm->arch.split_shadow_page_cache_lock);
 
+		if (freed >= sc->nr_to_scan)
+			break;
+
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
+						  &vcpu->arch.mmu_shadow_page_cache_lock);
 		}
 
 		if (freed >= sc->nr_to_scan)
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (2 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 3/9] KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-29 22:05   ` David Matlack
  2022-12-22  2:34 ` [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split Vipin Sharma
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Add a numa_aware_page_table module param to make page tables NUMA aware.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index efd9b38ea9a2..d48064503b88 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1358,6 +1358,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
 
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 
+void *kvm_mmu_get_free_page(int nid, gfp_t gfp);
+
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f2d762878b97..d96c8146e9ba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -93,6 +93,13 @@ unsigned int halt_poll_ns_shrink;
 module_param(halt_poll_ns_shrink, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
 
+/*
+ * If possible, allocate page table's pages on the same node the underlying
+ * physical page is pointing to.
+ */
+static bool __read_mostly numa_aware_pagetable = true;
+module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644);
+
 /*
  * Ordering of locks:
  *
@@ -384,6 +391,21 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
 	kvm_arch_guest_memory_reclaimed(kvm);
 }
 
+void *kvm_mmu_get_free_page(int nid, gfp_t gfp)
+{
+	#ifdef CONFIG_NUMA
+	struct page *spt_page;
+
+	if (numa_aware_pagetable) {
+		spt_page = alloc_pages_node(nid, gfp, 0);
+		if (spt_page)
+			return page_address(spt_page);
+	}
+	#endif // CONFIG_NUMA
+
+	return (void *)__get_free_page(gfp);
+}
+
 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
 static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 					       gfp_t gfp_flags)
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware
  2022-12-22  2:34 ` [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware Vipin Sharma
@ 2022-12-29 22:05   ` David Matlack
  0 siblings, 0 replies; 46+ messages in thread
From: David Matlack @ 2022-12-29 22:05 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:52PM -0800, Vipin Sharma wrote:
> Add a numa_aware_page_table module param to make page tables NUMA aware.

Generally it's not good practice to introduce dead code. So I would
request merging this with the next patch.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (3 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 19:02   ` Ben Gardon
  2022-12-29 22:30   ` David Matlack
  2022-12-22  2:34 ` [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{} Vipin Sharma
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

When dirty log is enabled, huge pages are split. Page table's pages
during the split are allocated based on the current thread NUMA node or
mempolicy. This causes inefficient page table accesses if underlying
page is on a different NUMA node

Allocate page table's pages on the same NUMA node as the underlying huge
page when dirty log is enabled and huge pages are split.

The performance gain during the pre-copy phase of live migrations of a
416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
of 130% to 150%.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
 include/linux/kvm_host.h   | 18 ++++++++++++++++++
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4974fa96deff..376b8dceb3f9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 	if (!sp)
 		return NULL;
 
-	sp->spt = (void *)__get_free_page(gfp);
+	sp->spt = kvm_mmu_get_free_page(nid, gfp);
+
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
 		return NULL;
@@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       bool shared)
 {
 	struct kvm_mmu_page *sp;
+	int nid;
+
+	nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
 
 	/*
 	 * Since we are allocating while under the MMU lock we have to be
@@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	 * If this allocation fails we drop the lock and retry with reclaim
 	 * allowed.
 	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
 	if (sp)
 		return sp;
 
@@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d48064503b88..a262e15ebd19 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 
 struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
+
+/*
+ * Tells the appropriate NUMA node location of the page table's page based on
+ * pfn it will point to.
+ *
+ * Return the nid of the page if pfn is valid and backed by a refcounted page,
+ * otherwise, return the nearest memory node for the current CPU.
+ */
+static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
+{
+	struct page *page = kvm_pfn_to_refcounted_page(pfn);
+
+	if (page)
+		return page_to_nid(page);
+	else
+		return numa_mem_id();
+}
+
 bool kvm_is_zone_device_page(struct page *page);
 
 struct kvm_irq_ack_notifier {
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
  2022-12-22  2:34 ` [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split Vipin Sharma
@ 2022-12-27 19:02   ` Ben Gardon
  2022-12-28 22:07     ` Vipin Sharma
  2022-12-29 22:30   ` David Matlack
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 19:02 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> When dirty log is enabled, huge pages are split. Page table's pages

Nit: Suggest "When huge pages are split for dirty log" since this can
happen at various points during dirty logging.
Same below.

> during the split are allocated based on the current thread NUMA node or
> mempolicy. This causes inefficient page table accesses if underlying
> page is on a different NUMA node
>
> Allocate page table's pages on the same NUMA node as the underlying huge
> page when dirty log is enabled and huge pages are split.
>
> The performance gain during the pre-copy phase of live migrations of a
> 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> of 130% to 150%.
>
> Suggested-by: David Matlack <dmatlack@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
>  include/linux/kvm_host.h   | 18 ++++++++++++++++++
>  2 files changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4974fa96deff..376b8dceb3f9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>         return spte_set;
>  }
>
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
>  {
>         struct kvm_mmu_page *sp;
>
> @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
>         if (!sp)
>                 return NULL;
>
> -       sp->spt = (void *)__get_free_page(gfp);
> +       sp->spt = kvm_mmu_get_free_page(nid, gfp);
> +

Just so that kvm_mmu_get_free_page isn't dead code in the previous
commit, I'd do this refactor there and just pass NUMA_NO_NODE here.

>         if (!sp->spt) {
>                 kmem_cache_free(mmu_page_header_cache, sp);
>                 return NULL;
> @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>                                                        bool shared)
>  {
>         struct kvm_mmu_page *sp;
> +       int nid;
> +
> +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
>
>         /*
>          * Since we are allocating while under the MMU lock we have to be
> @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>          * If this allocation fails we drop the lock and retry with reclaim
>          * allowed.
>          */
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
>         if (sp)
>                 return sp;
>
> @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>                 write_unlock(&kvm->mmu_lock);
>
>         iter->yielded = true;
> -       sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
>
>         if (shared)
>                 read_lock(&kvm->mmu_lock);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d48064503b88..a262e15ebd19 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
>  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>
>  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> +
> +/*
> + * Tells the appropriate NUMA node location of the page table's page based on
> + * pfn it will point to.
> + *
> + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> + * otherwise, return the nearest memory node for the current CPU.

Nit: Should this be "current thread"?

> + */
> +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)

This could just be kvm_pfn_nid (or even better kvm_pfn_node_id) since
this really has nothing to do with page tables. We just want to know
which NUMA node backs the given PFN.

> +{
> +       struct page *page = kvm_pfn_to_refcounted_page(pfn);
> +
> +       if (page)
> +               return page_to_nid(page);
> +       else
> +               return numa_mem_id();
> +}
> +
>  bool kvm_is_zone_device_page(struct page *page);
>
>  struct kvm_irq_ack_notifier {
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
  2022-12-27 19:02   ` Ben Gardon
@ 2022-12-28 22:07     ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:07 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 11:02 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > When dirty log is enabled, huge pages are split. Page table's pages
>
> Nit: Suggest "When huge pages are split for dirty log" since this can
> happen at various points during dirty logging.
> Same below.
>

Yeah, this should be updated.

> > during the split are allocated based on the current thread NUMA node or
> > mempolicy. This causes inefficient page table accesses if underlying
> > page is on a different NUMA node
> >
> > Allocate page table's pages on the same NUMA node as the underlying huge
> > page when dirty log is enabled and huge pages are split.
> >
> > The performance gain during the pre-copy phase of live migrations of a
> > 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> > of 130% to 150%.
> >
> > Suggested-by: David Matlack <dmatlack@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
> >  include/linux/kvm_host.h   | 18 ++++++++++++++++++
> >  2 files changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4974fa96deff..376b8dceb3f9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >         return spte_set;
> >  }
> >
> > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
> >  {
> >         struct kvm_mmu_page *sp;
> >
> > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> >         if (!sp)
> >                 return NULL;
> >
> > -       sp->spt = (void *)__get_free_page(gfp);
> > +       sp->spt = kvm_mmu_get_free_page(nid, gfp);
> > +
>
> Just so that kvm_mmu_get_free_page isn't dead code in the previous
> commit, I'd do this refactor there and just pass NUMA_NO_NODE here.
>

Agreed.

> >         if (!sp->spt) {
> >                 kmem_cache_free(mmu_page_header_cache, sp);
> >                 return NULL;
> > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >                                                        bool shared)
> >  {
> >         struct kvm_mmu_page *sp;
> > +       int nid;
> > +
> > +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
> >
> >         /*
> >          * Since we are allocating while under the MMU lock we have to be
> > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >          * If this allocation fails we drop the lock and retry with reclaim
> >          * allowed.
> >          */
> > -       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> > +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
> >         if (sp)
> >                 return sp;
> >
> > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >                 write_unlock(&kvm->mmu_lock);
> >
> >         iter->yielded = true;
> > -       sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> > +       sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
> >
> >         if (shared)
> >                 read_lock(&kvm->mmu_lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d48064503b88..a262e15ebd19 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
> >  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >
> >  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> > +
> > +/*
> > + * Tells the appropriate NUMA node location of the page table's page based on
> > + * pfn it will point to.
> > + *
> > + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> > + * otherwise, return the nearest memory node for the current CPU.
>
> Nit: Should this be "current thread"?

I will say "current thread CPU". As memory nodes are near to CPUs
whereas threads can execute on multiple CPUs throughout its lifetime.

>
> > + */
> > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
>
> This could just be kvm_pfn_nid (or even better kvm_pfn_node_id) since
> this really has nothing to do with page tables. We just want to know
> which NUMA node backs the given PFN.

Apart from NUMA node backing the given PFN, it can also return the
nearest NUMA node via numa_mem_id(). So, it is actually telling which
NUMA node is the best one for the page table's page, given a PFN.


>
> > +{
> > +       struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +
> > +       if (page)
> > +               return page_to_nid(page);
> > +       else
> > +               return numa_mem_id();
> > +}
> > +
> >  bool kvm_is_zone_device_page(struct page *page);
> >
> >  struct kvm_irq_ack_notifier {
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
  2022-12-22  2:34 ` [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split Vipin Sharma
  2022-12-27 19:02   ` Ben Gardon
@ 2022-12-29 22:30   ` David Matlack
  2023-01-03 18:26     ` Vipin Sharma
  1 sibling, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 22:30 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:53PM -0800, Vipin Sharma wrote:
> When dirty log is enabled, huge pages are split. Page table's pages
> during the split are allocated based on the current thread NUMA node or
> mempolicy. This causes inefficient page table accesses if underlying
> page is on a different NUMA node
> 
> Allocate page table's pages on the same NUMA node as the underlying huge
> page when dirty log is enabled and huge pages are split.
> 
> The performance gain during the pre-copy phase of live migrations of a
> 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> of 130% to 150%.

Can you be more specific about this. "The performance" is vague. I know
it's an internal workload and fully explaining it would be difficult,
but you can give readers a slightly more specific idea of what improved.
e.g.

 When testing with a synthetic write-heavy workload in a 416 vCPU VM on
 an 8 NUMA node host, the throughput increased by 150% from X to Y
 operations per second.

It's also necessary to characterize the improvement relative to the
performance when dirty logging is not enabled. Whithout that information
it would be hard for an unfamiliar reader to understand how useful this
change really is.

For example, let's say the throughput of your workload is 100,000
operations per second before dirty logging is enabled, and that drops
down to 1,000 operations per second after dirty logging is enabled. This
commit could increase that by 150% to 2,500 operations per second, but
that's actually not a very meaningful improvement since, either way,
guest performance is degraded by 95+% during dirty logging.

On the other hand, if performance goes from 100,000 to 30,000 normally,
and this commit increases that 30,000 to 75,000 (150%), that's a much
more meaningful improvement.

> 
> Suggested-by: David Matlack <dmatlack@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
>  include/linux/kvm_host.h   | 18 ++++++++++++++++++
>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4974fa96deff..376b8dceb3f9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>  	return spte_set;
>  }
>  
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
>  {
>  	struct kvm_mmu_page *sp;
>  
> @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
>  	if (!sp)
>  		return NULL;
>  
> -	sp->spt = (void *)__get_free_page(gfp);
> +	sp->spt = kvm_mmu_get_free_page(nid, gfp);
> +
>  	if (!sp->spt) {
>  		kmem_cache_free(mmu_page_header_cache, sp);
>  		return NULL;
> @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  						       bool shared)
>  {
>  	struct kvm_mmu_page *sp;
> +	int nid;
> +
> +	nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
>  
>  	/*
>  	 * Since we are allocating while under the MMU lock we have to be
> @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  	 * If this allocation fails we drop the lock and retry with reclaim
>  	 * allowed.
>  	 */
> -	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> +	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
>  	if (sp)
>  		return sp;
>  
> @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  		write_unlock(&kvm->mmu_lock);
>  
>  	iter->yielded = true;
> -	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> +	sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
>  
>  	if (shared)
>  		read_lock(&kvm->mmu_lock);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d48064503b88..a262e15ebd19 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
>  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
>  
>  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> +
> +/*
> + * Tells the appropriate NUMA node location of the page table's page based on
> + * pfn it will point to.

I know what you are trying to say but the wording is a bit awkward. e.g.
"Tells" instead of "Returns", "location" is redundant, "page table's
page", etc. Suggest this:

/*
 * Returns an appropriate NUMA node on which to allocate a page table that
 * maps @pfn.
 */

> + *
> + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> + * otherwise, return the nearest memory node for the current CPU.

I would just drop this as it's just restating the code, which is already
very readable.

> + */
> +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
> +{
> +	struct page *page = kvm_pfn_to_refcounted_page(pfn);
> +
> +	if (page)
> +		return page_to_nid(page);
> +	else
> +		return numa_mem_id();
> +}
> +
>  bool kvm_is_zone_device_page(struct page *page);
>  
>  struct kvm_irq_ack_notifier {
> -- 
> 2.39.0.314.g84b9a713c41-goog
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
  2022-12-29 22:30   ` David Matlack
@ 2023-01-03 18:26     ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 18:26 UTC (permalink / raw)
  To: David Matlack; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Thu, Dec 29, 2022 at 2:30 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 06:34:53PM -0800, Vipin Sharma wrote:
> > When dirty log is enabled, huge pages are split. Page table's pages
> > during the split are allocated based on the current thread NUMA node or
> > mempolicy. This causes inefficient page table accesses if underlying
> > page is on a different NUMA node
> >
> > Allocate page table's pages on the same NUMA node as the underlying huge
> > page when dirty log is enabled and huge pages are split.
> >
> > The performance gain during the pre-copy phase of live migrations of a
> > 416 vCPUs and 11 TiB memory VM  on a 8 node host was seen in the range
> > of 130% to 150%.
>
> Can you be more specific about this. "The performance" is vague. I know
> it's an internal workload and fully explaining it would be difficult,
> but you can give readers a slightly more specific idea of what improved.
> e.g.
>
>  When testing with a synthetic write-heavy workload in a 416 vCPU VM on
>  an 8 NUMA node host, the throughput increased by 150% from X to Y
>  operations per second.
>
> It's also necessary to characterize the improvement relative to the
> performance when dirty logging is not enabled. Whithout that information
> it would be hard for an unfamiliar reader to understand how useful this
> change really is.
>
> For example, let's say the throughput of your workload is 100,000
> operations per second before dirty logging is enabled, and that drops
> down to 1,000 operations per second after dirty logging is enabled. This
> commit could increase that by 150% to 2,500 operations per second, but
> that's actually not a very meaningful improvement since, either way,
> guest performance is degraded by 95+% during dirty logging.
>
> On the other hand, if performance goes from 100,000 to 30,000 normally,
> and this commit increases that 30,000 to 75,000 (150%), that's a much
> more meaningful improvement.
>

Yeah, I will provide more insight in the next version.

> >
> > Suggested-by: David Matlack <dmatlack@google.com>
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
> >  include/linux/kvm_host.h   | 18 ++++++++++++++++++
> >  2 files changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4974fa96deff..376b8dceb3f9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >       return spte_set;
> >  }
> >
> > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
> >  {
> >       struct kvm_mmu_page *sp;
> >
> > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> >       if (!sp)
> >               return NULL;
> >
> > -     sp->spt = (void *)__get_free_page(gfp);
> > +     sp->spt = kvm_mmu_get_free_page(nid, gfp);
> > +
> >       if (!sp->spt) {
> >               kmem_cache_free(mmu_page_header_cache, sp);
> >               return NULL;
> > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >                                                      bool shared)
> >  {
> >       struct kvm_mmu_page *sp;
> > +     int nid;
> > +
> > +     nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
> >
> >       /*
> >        * Since we are allocating while under the MMU lock we have to be
> > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >        * If this allocation fails we drop the lock and retry with reclaim
> >        * allowed.
> >        */
> > -     sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> > +     sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
> >       if (sp)
> >               return sp;
> >
> > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >               write_unlock(&kvm->mmu_lock);
> >
> >       iter->yielded = true;
> > -     sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> > +     sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
> >
> >       if (shared)
> >               read_lock(&kvm->mmu_lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d48064503b88..a262e15ebd19 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
> >  int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >
> >  struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> > +
> > +/*
> > + * Tells the appropriate NUMA node location of the page table's page based on
> > + * pfn it will point to.
>
> I know what you are trying to say but the wording is a bit awkward. e.g.
> "Tells" instead of "Returns", "location" is redundant, "page table's
> page", etc. Suggest this:
>
> /*
>  * Returns an appropriate NUMA node on which to allocate a page table that
>  * maps @pfn.
>  */
>
> > + *
> > + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> > + * otherwise, return the nearest memory node for the current CPU.
>
> I would just drop this as it's just restating the code, which is already
> very readable.
>

Okay.

> > + */
> > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
> > +{
> > +     struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +
> > +     if (page)
> > +             return page_to_nid(page);
> > +     else
> > +             return numa_mem_id();
> > +}
> > +
> >  bool kvm_is_zone_device_page(struct page *page);
> >
> >  struct kvm_irq_ack_notifier {
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (4 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 19:09   ` Ben Gardon
  2022-12-29 23:08   ` David Matlack
  2022-12-22  2:34 ` [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages Vipin Sharma
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
this cache should allocate memory from. Default initialize to
NUMA_NO_NODE in all architectures.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/arm64/kvm/arm.c      |  2 +-
 arch/arm64/kvm/mmu.c      |  4 +++-
 arch/mips/kvm/mips.c      |  2 ++
 arch/riscv/kvm/mmu.c      |  2 +-
 arch/riscv/kvm/vcpu.c     |  2 +-
 arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
 include/linux/kvm_host.h  |  6 ++++++
 include/linux/kvm_types.h |  2 ++
 8 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..52a41f4532e2 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
 
 	/*
 	 * Default value for the FP state, will be overloaded at load
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 31d7fa4c7c14..bd07155e17fa 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
+	struct kvm_mmu_memory_cache cache;
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
 				     (writable ? KVM_PGTABLE_PROT_W : 0);
 
+	INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
+
 	if (is_protected_kvm_enabled())
 		return -EPERM;
 
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..b017c29a9340 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		     HRTIMER_MODE_REL);
 	vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
 
+	vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
+
 	/*
 	 * Allocate space for host mode exception handlers that handle
 	 * guest mode exits
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 34b57e0be2ef..119de4520cc6 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
 	phys_addr_t addr, end;
 	struct kvm_mmu_memory_cache pcache = {
 		.gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
-		.gfp_zero = __GFP_ZERO,
 	};
 
+	INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
 
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 7c08567097f0..189b14feb365 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
-	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
 	bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
 
 	/* Setup ISA features available to VCPU */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f6a10d7a871..23a3b82b2384 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
-	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
-	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
+				  pte_list_desc_cache, NUMA_NO_NODE);
 
-	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
-	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
+				  mmu_page_header_cache, NUMA_NO_NODE);
 
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
+				  NULL, NUMA_NO_NODE);
 	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
@@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
 
-	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
-	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
+				  mmu_page_header_cache, NUMA_NO_NODE);
 
-	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
+				  NULL, NUMA_NO_NODE);
 	spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
 
-	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
-	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
+				  pte_list_desc_cache, NUMA_NO_NODE);
 
 	return 0;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a262e15ebd19..719687a37ef7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({	\
+	(_cache)->kmem_cache = _kmem_cache;				\
+	(_cache)->gfp_zero = __GFP_ZERO;				\
+	(_cache)->node = _node;						\
+})
+
 #endif
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 76de36e56cdf..9c70ce95e51f 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
 	struct kmem_cache *kmem_cache;
 	int capacity;
 	void **objects;
+	/* Node on which memory should be allocated by default */
+	int node;
 };
 #endif
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-22  2:34 ` [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{} Vipin Sharma
@ 2022-12-27 19:09   ` Ben Gardon
  2022-12-28 22:07     ` Vipin Sharma
  2022-12-29 23:08   ` David Matlack
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 19:09 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> this cache should allocate memory from. Default initialize to
> NUMA_NO_NODE in all architectures.
>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/arm64/kvm/arm.c      |  2 +-
>  arch/arm64/kvm/mmu.c      |  4 +++-
>  arch/mips/kvm/mips.c      |  2 ++
>  arch/riscv/kvm/mmu.c      |  2 +-
>  arch/riscv/kvm/vcpu.c     |  2 +-
>  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
>  include/linux/kvm_host.h  |  6 ++++++
>  include/linux/kvm_types.h |  2 ++
>  8 files changed, 28 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..52a41f4532e2 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
>
>         /*
>          * Default value for the FP state, will be overloaded at load
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..bd07155e17fa 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> +       struct kvm_mmu_memory_cache cache;
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
>                                      (writable ? KVM_PGTABLE_PROT_W : 0);
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> +
>         if (is_protected_kvm_enabled())
>                 return -EPERM;
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..b017c29a9340 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>                      HRTIMER_MODE_REL);
>         vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
>
> +       vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> +

It looks weird to have MIPS not using the initialization MACRO. Should
it just have a GFP_ZERO parameter?

>         /*
>          * Allocate space for host mode exception handlers that handle
>          * guest mode exits
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 34b57e0be2ef..119de4520cc6 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
>         phys_addr_t addr, end;
>         struct kvm_mmu_memory_cache pcache = {
>                 .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> -               .gfp_zero = __GFP_ZERO,
>         };
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 7c08567097f0..189b14feb365 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
>         bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
>
>         /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f6a10d7a871..23a3b82b2384 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> +                                 pte_list_desc_cache, NUMA_NO_NODE);
>
> -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> +                                 mmu_page_header_cache, NUMA_NO_NODE);
>
> -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> +                                 NULL, NUMA_NO_NODE);
>         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>         kvm_page_track_register_notifier(kvm, node);
>
> -       kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> -       kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> +                                 mmu_page_header_cache, NUMA_NO_NODE);
>
> -       kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> +                                 NULL, NUMA_NO_NODE);
>         spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
>
> -       kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> -       kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> +                                 pte_list_desc_cache, NUMA_NO_NODE);
>
>         return 0;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a262e15ebd19..719687a37ef7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>
> +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({       \
> +       (_cache)->kmem_cache = _kmem_cache;                             \
> +       (_cache)->gfp_zero = __GFP_ZERO;                                \
> +       (_cache)->node = _node;                                         \
> +})
> +

Given that this initialization is probably not happening in a super
hot path, is there any downside to just using a function for the
initialization?

>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..9c70ce95e51f 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
>         struct kmem_cache *kmem_cache;
>         int capacity;
>         void **objects;
> +       /* Node on which memory should be allocated by default */
> +       int node;
>  };
>  #endif
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-27 19:09   ` Ben Gardon
@ 2022-12-28 22:07     ` Vipin Sharma
  2022-12-29 18:22       ` Ben Gardon
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:07 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 11:10 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > this cache should allocate memory from. Default initialize to
> > NUMA_NO_NODE in all architectures.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/arm64/kvm/arm.c      |  2 +-
> >  arch/arm64/kvm/mmu.c      |  4 +++-
> >  arch/mips/kvm/mips.c      |  2 ++
> >  arch/riscv/kvm/mmu.c      |  2 +-
> >  arch/riscv/kvm/vcpu.c     |  2 +-
> >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> >  include/linux/kvm_host.h  |  6 ++++++
> >  include/linux/kvm_types.h |  2 ++
> >  8 files changed, 28 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 9c5573bc4614..52a41f4532e2 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> >
> >         /*
> >          * Default value for the FP state, will be overloaded at load
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 31d7fa4c7c14..bd07155e17fa 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > +       struct kvm_mmu_memory_cache cache;
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> >                                      (writable ? KVM_PGTABLE_PROT_W : 0);
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> > +
> >         if (is_protected_kvm_enabled())
> >                 return -EPERM;
> >
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..b017c29a9340 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >                      HRTIMER_MODE_REL);
> >         vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> >
> > +       vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> > +
>
> It looks weird to have MIPS not using the initialization MACRO. Should
> it just have a GFP_ZERO parameter?

MIPS is not setting GFP_ZERO explicitly before my series, so, I didn't
make it GFP_ZERO. I am not sure if MIPS needs it or not, I tried to
keep the same functionality in my patch.

May be someone from MIPS can tell more about it.

>
> >         /*
> >          * Allocate space for host mode exception handlers that handle
> >          * guest mode exits
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index 34b57e0be2ef..119de4520cc6 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> >         phys_addr_t addr, end;
> >         struct kvm_mmu_memory_cache pcache = {
> >                 .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > -               .gfp_zero = __GFP_ZERO,
> >         };
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
> >         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> >         pfn = __phys_to_pfn(hpa);
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 7c08567097f0..189b14feb365 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> >         bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> >
> >         /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f6a10d7a871..23a3b82b2384 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > +                                 pte_list_desc_cache, NUMA_NO_NODE);
> >
> > -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > +                                 mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > +                                 NULL, NUMA_NO_NODE);
> >         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> >         kvm_page_track_register_notifier(kvm, node);
> >
> > -       kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -       kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> > +                                 mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -       kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > +                                 NULL, NUMA_NO_NODE);
> >         spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
> >
> > -       kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> > -       kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> > +                                 pte_list_desc_cache, NUMA_NO_NODE);
> >
> >         return 0;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a262e15ebd19..719687a37ef7 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >
> > +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({       \
> > +       (_cache)->kmem_cache = _kmem_cache;                             \
> > +       (_cache)->gfp_zero = __GFP_ZERO;                                \
> > +       (_cache)->node = _node;                                         \
> > +})
> > +
>
> Given that this initialization is probably not happening in a super
> hot path, is there any downside to just using a function for the
> initialization?
>

It can totally be a function as well. I will make it function in the
next version.


> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 76de36e56cdf..9c70ce95e51f 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
> >         struct kmem_cache *kmem_cache;
> >         int capacity;
> >         void **objects;
> > +       /* Node on which memory should be allocated by default */
> > +       int node;
> >  };
> >  #endif
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-28 22:07     ` Vipin Sharma
@ 2022-12-29 18:22       ` Ben Gardon
  2023-01-03 17:36         ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-29 18:22 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 28, 2022 at 2:08 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Tue, Dec 27, 2022 at 11:10 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > > this cache should allocate memory from. Default initialize to
> > > NUMA_NO_NODE in all architectures.
> > >
> > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > ---
> > >  arch/arm64/kvm/arm.c      |  2 +-
> > >  arch/arm64/kvm/mmu.c      |  4 +++-
> > >  arch/mips/kvm/mips.c      |  2 ++
> > >  arch/riscv/kvm/mmu.c      |  2 +-
> > >  arch/riscv/kvm/vcpu.c     |  2 +-
> > >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> > >  include/linux/kvm_host.h  |  6 ++++++
> > >  include/linux/kvm_types.h |  2 ++
> > >  8 files changed, 28 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 9c5573bc4614..52a41f4532e2 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >         vcpu->arch.target = -1;
> > >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> > >
> > > -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > >
> > >         /*
> > >          * Default value for the FP state, will be overloaded at load
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 31d7fa4c7c14..bd07155e17fa 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >  {
> > >         phys_addr_t addr;
> > >         int ret = 0;
> > > -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > > +       struct kvm_mmu_memory_cache cache;
> > >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > >                                      KVM_PGTABLE_PROT_R |
> > >                                      (writable ? KVM_PGTABLE_PROT_W : 0);
> > >
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> > > +
> > >         if (is_protected_kvm_enabled())
> > >                 return -EPERM;
> > >
> > > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > > index a25e0b73ee70..b017c29a9340 100644
> > > --- a/arch/mips/kvm/mips.c
> > > +++ b/arch/mips/kvm/mips.c
> > > @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >                      HRTIMER_MODE_REL);
> > >         vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> > >
> > > +       vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> > > +
> >
> > It looks weird to have MIPS not using the initialization MACRO. Should
> > it just have a GFP_ZERO parameter?
>
> MIPS is not setting GFP_ZERO explicitly before my series, so, I didn't
> make it GFP_ZERO. I am not sure if MIPS needs it or not, I tried to
> keep the same functionality in my patch.
>
> May be someone from MIPS can tell more about it.

That makes sense, I just don't want to see MIPS get left behind
because we move the cache init logic to a macro or function. Folks
might update the init function but forget to update MIPS too.

>
> >
> > >         /*
> > >          * Allocate space for host mode exception handlers that handle
> > >          * guest mode exits
> > > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > > index 34b57e0be2ef..119de4520cc6 100644
> > > --- a/arch/riscv/kvm/mmu.c
> > > +++ b/arch/riscv/kvm/mmu.c
> > > @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> > >         phys_addr_t addr, end;
> > >         struct kvm_mmu_memory_cache pcache = {
> > >                 .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > > -               .gfp_zero = __GFP_ZERO,
> > >         };
> > >
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
> > >         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> > >         pfn = __phys_to_pfn(hpa);
> > >
> > > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > > index 7c08567097f0..189b14feb365 100644
> > > --- a/arch/riscv/kvm/vcpu.c
> > > +++ b/arch/riscv/kvm/vcpu.c
> > > @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >
> > >         /* Mark this VCPU never ran */
> > >         vcpu->arch.ran_atleast_once = false;
> > > -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > >         bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> > >
> > >         /* Setup ISA features available to VCPU */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6f6a10d7a871..23a3b82b2384 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > >  {
> > >         int ret;
> > >
> > > -       vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > > -       vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > > +                                 pte_list_desc_cache, NUMA_NO_NODE);
> > >
> > > -       vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > -       vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > > +                                 mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > -       vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > > +                                 NULL, NUMA_NO_NODE);
> > >         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > > @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > >         kvm_page_track_register_notifier(kvm, node);
> > >
> > > -       kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > -       kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> > > +                                 mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > -       kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > > +                                 NULL, NUMA_NO_NODE);
> > >         spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
> > >
> > > -       kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> > > -       kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> > > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> > > +                                 pte_list_desc_cache, NUMA_NO_NODE);
> > >
> > >         return 0;
> > >  }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a262e15ebd19..719687a37ef7 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> > >  /* Max number of entries allowed for each kvm dirty ring */
> > >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > >
> > > +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({       \
> > > +       (_cache)->kmem_cache = _kmem_cache;                             \
> > > +       (_cache)->gfp_zero = __GFP_ZERO;                                \
> > > +       (_cache)->node = _node;                                         \
> > > +})
> > > +
> >
> > Given that this initialization is probably not happening in a super
> > hot path, is there any downside to just using a function for the
> > initialization?
> >
>
> It can totally be a function as well. I will make it function in the
> next version.

Awesome, thanks.

>
>
> > >  #endif
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 76de36e56cdf..9c70ce95e51f 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
> > >         struct kmem_cache *kmem_cache;
> > >         int capacity;
> > >         void **objects;
> > > +       /* Node on which memory should be allocated by default */
> > > +       int node;
> > >  };
> > >  #endif
> > >
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-29 18:22       ` Ben Gardon
@ 2023-01-03 17:36         ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 17:36 UTC (permalink / raw)
  To: chenhuacai, aleksandar.qemu.devel
  Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel, Ben Gardon

On Thu, Dec 29, 2022 at 10:22 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 28, 2022 at 2:08 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > On Tue, Dec 27, 2022 at 11:10 AM Ben Gardon <bgardon@google.com> wrote:
> > >
> > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > > >
> > > > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > > > this cache should allocate memory from. Default initialize to
> > > > NUMA_NO_NODE in all architectures.
> > > >
> > > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > > ---
> > > >  arch/arm64/kvm/arm.c      |  2 +-
> > > >  arch/arm64/kvm/mmu.c      |  4 +++-
> > > >  arch/mips/kvm/mips.c      |  2 ++
> > > >  arch/riscv/kvm/mmu.c      |  2 +-
> > > >  arch/riscv/kvm/vcpu.c     |  2 +-
> > > >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> > > >  include/linux/kvm_host.h  |  6 ++++++
> > > >  include/linux/kvm_types.h |  2 ++
> > > >  8 files changed, 28 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 9c5573bc4614..52a41f4532e2 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > >         vcpu->arch.target = -1;
> > > >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> > > >
> > > > -       vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > > >
> > > >         /*
> > > >          * Default value for the FP state, will be overloaded at load
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 31d7fa4c7c14..bd07155e17fa 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > >  {
> > > >         phys_addr_t addr;
> > > >         int ret = 0;
> > > > -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > > > +       struct kvm_mmu_memory_cache cache;
> > > >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > > >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > > >                                      KVM_PGTABLE_PROT_R |
> > > >                                      (writable ? KVM_PGTABLE_PROT_W : 0);
> > > >
> > > > +       INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> > > > +
> > > >         if (is_protected_kvm_enabled())
> > > >                 return -EPERM;
> > > >
> > > > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > > > index a25e0b73ee70..b017c29a9340 100644
> > > > --- a/arch/mips/kvm/mips.c
> > > > +++ b/arch/mips/kvm/mips.c
> > > > @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > >                      HRTIMER_MODE_REL);
> > > >         vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> > > >
> > > > +       vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> > > > +
> > >
> > > It looks weird to have MIPS not using the initialization MACRO. Should
> > > it just have a GFP_ZERO parameter?
> >
> > MIPS is not setting GFP_ZERO explicitly before my series, so, I didn't
> > make it GFP_ZERO. I am not sure if MIPS needs it or not, I tried to
> > keep the same functionality in my patch.
> >
> > May be someone from MIPS can tell more about it.
>
> That makes sense, I just don't want to see MIPS get left behind
> because we move the cache init logic to a macro or function. Folks
> might update the init function but forget to update MIPS too.
>

Hi Huacai, Aleksandar,

I have noticed that MIPS doesn't use __GFP_ZERO flag for
mmu_page_cache in KVM. Is it intentional? I was wondering if it will
be useful if I add zero flag for cache in this patch for MIPS? All
other architectures seem to use __GFP_ZERO flag for their caches.

Thanks
Vipin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-22  2:34 ` [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{} Vipin Sharma
  2022-12-27 19:09   ` Ben Gardon
@ 2022-12-29 23:08   ` David Matlack
  2022-12-29 23:11     ` David Matlack
  1 sibling, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 23:08 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:54PM -0800, Vipin Sharma wrote:
> Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> this cache should allocate memory from. Default initialize to
> NUMA_NO_NODE in all architectures.
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/arm64/kvm/arm.c      |  2 +-
>  arch/arm64/kvm/mmu.c      |  4 +++-
>  arch/mips/kvm/mips.c      |  2 ++
>  arch/riscv/kvm/mmu.c      |  2 +-
>  arch/riscv/kvm/vcpu.c     |  2 +-
>  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
>  include/linux/kvm_host.h  |  6 ++++++
>  include/linux/kvm_types.h |  2 ++
>  8 files changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..52a41f4532e2 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.target = -1;
>  	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>  
> -	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
>  
>  	/*
>  	 * Default value for the FP state, will be overloaded at load
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..bd07155e17fa 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>  	phys_addr_t addr;
>  	int ret = 0;
> -	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> +	struct kvm_mmu_memory_cache cache;
>  	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>  				     KVM_PGTABLE_PROT_R |
>  				     (writable ? KVM_PGTABLE_PROT_W : 0);
>  
> +	INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);

This is not any better than setting cache.node = NUMA_NO_NODE directly.
Yes it's less lines of code, but it's harder to read (what does NULL
mean here?), and every user of kvm_mmu_memory_cache still has to know to
pass NUMA_NO_NODE.

When I originally gave this suggestion, I intended to suggest that
INIT_KVM_MMU_MEMORY_CACHE() provide just default initialization.
Non-default initialization for gfp_zero, gfp_custom, kmem_cache, and
node would remain as they are.

Yes this adds some more lines, but keeps things readable, and doesn't
every initialization site of kvm_mmu_memory_cache to know what to pass
for gfp_zero, node, and kmem_cache. It only needs to set the fields
*it* cares about.

Here's what I mean specifically, based on INIT_LIST_HEAD. I don't think
I got all the kvm_mmu_memory_cache users, but you get the point.


diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..0e138dcaf4d4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -340,6 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.target = -1;
 	bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
 
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
 	/*
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 31d7fa4c7c14..f5fd78a4f084 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 {
 	phys_addr_t addr;
 	int ret = 0;
-	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
+	KVM_MMU_MEMORY_CACHE(cache);
 	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
 				     KVM_PGTABLE_PROT_R |
 				     (writable ? KVM_PGTABLE_PROT_W : 0);
 
+	cache.gfp_zero = __GFP_ZERO;
+
 	if (is_protected_kvm_enabled())
 		return -EPERM;
 
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 34b57e0be2ef..7915a5a2d104 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
 	int ret = 0;
 	unsigned long pfn;
 	phys_addr_t addr, end;
-	struct kvm_mmu_memory_cache pcache = {
-		.gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
-		.gfp_zero = __GFP_ZERO,
-	};
+	KVM_MMU_MEMORY_CACHE(pcache);
+
+	pcache.gfp_zero = __GFP_ZERO;
+	if (in_atomic)
+		pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
 
 	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
 	pfn = __phys_to_pfn(hpa);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 7c08567097f0..3d73ab3ec9a4 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -161,6 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	/* Mark this VCPU never ran */
 	vcpu->arch.ran_atleast_once = false;
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 	bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..d4cd8e64cc03 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5909,14 +5909,19 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
 	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
+
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
 
@@ -6083,11 +6088,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
 
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
 	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
 	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
 	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
+	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
 
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 76de36e56cdf..eb7ff9afa5c7 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -98,6 +98,17 @@ struct kvm_mmu_memory_cache {
 	int capacity;
 	void **objects;
 };
+
+#define KVM_MMU_MEMORY_CACHE_INIT() (struct kvm_mmu_memory_cache) { \
+}
+
+#define KVM_MMU_MEMORY_CACHE(_name) \
+	struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
+
+static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
+{
+	*cache = KVM_MMU_MEMORY_CACHE_INIT();
+}
 #endif
 
 #define HALT_POLL_HIST_COUNT			32

> +
>  	if (is_protected_kvm_enabled())
>  		return -EPERM;
>  
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index a25e0b73ee70..b017c29a9340 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  		     HRTIMER_MODE_REL);
>  	vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
>  
> +	vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> +
>  	/*
>  	 * Allocate space for host mode exception handlers that handle
>  	 * guest mode exits
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 34b57e0be2ef..119de4520cc6 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
>  	phys_addr_t addr, end;
>  	struct kvm_mmu_memory_cache pcache = {
>  		.gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> -		.gfp_zero = __GFP_ZERO,
>  	};
>  
> +	INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
>  	end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>  	pfn = __phys_to_pfn(hpa);
>  
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 7c08567097f0..189b14feb365 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  	/* Mark this VCPU never ran */
>  	vcpu->arch.ran_atleast_once = false;
> -	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
>  	bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
>  
>  	/* Setup ISA features available to VCPU */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f6a10d7a871..23a3b82b2384 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>  	int ret;
>  
> -	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> -	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> +				  pte_list_desc_cache, NUMA_NO_NODE);
>  
> -	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> -	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> +				  mmu_page_header_cache, NUMA_NO_NODE);
>  
> -	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> +				  NULL, NUMA_NO_NODE);
>  	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>  
>  	vcpu->arch.mmu = &vcpu->arch.root_mmu;
> @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);
>  
> -	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> -	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> +				  mmu_page_header_cache, NUMA_NO_NODE);
>  
> -	kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> +				  NULL, NUMA_NO_NODE);
>  	spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
>  
> -	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> -	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> +				  pte_list_desc_cache, NUMA_NO_NODE);
>  
>  	return 0;
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a262e15ebd19..719687a37ef7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({	\
> +	(_cache)->kmem_cache = _kmem_cache;				\
> +	(_cache)->gfp_zero = __GFP_ZERO;				\
> +	(_cache)->node = _node;						\
> +})
> +
>  #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..9c70ce95e51f 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
>  	struct kmem_cache *kmem_cache;
>  	int capacity;
>  	void **objects;
> +	/* Node on which memory should be allocated by default */
> +	int node;
>  };
>  #endif
>  
> -- 
> 2.39.0.314.g84b9a713c41-goog
> 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-29 23:08   ` David Matlack
@ 2022-12-29 23:11     ` David Matlack
  2023-01-03 18:45       ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 23:11 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Thu, Dec 29, 2022 at 3:08 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 06:34:54PM -0800, Vipin Sharma wrote:
> > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > this cache should allocate memory from. Default initialize to
> > NUMA_NO_NODE in all architectures.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/arm64/kvm/arm.c      |  2 +-
> >  arch/arm64/kvm/mmu.c      |  4 +++-
> >  arch/mips/kvm/mips.c      |  2 ++
> >  arch/riscv/kvm/mmu.c      |  2 +-
> >  arch/riscv/kvm/vcpu.c     |  2 +-
> >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> >  include/linux/kvm_host.h  |  6 ++++++
> >  include/linux/kvm_types.h |  2 ++
> >  8 files changed, 28 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 9c5573bc4614..52a41f4532e2 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >       vcpu->arch.target = -1;
> >       bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > -     vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> >
> >       /*
> >        * Default value for the FP state, will be overloaded at load
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 31d7fa4c7c14..bd07155e17fa 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >       phys_addr_t addr;
> >       int ret = 0;
> > -     struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > +     struct kvm_mmu_memory_cache cache;
> >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                    KVM_PGTABLE_PROT_R |
> >                                    (writable ? KVM_PGTABLE_PROT_W : 0);
> >
> > +     INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
>
> This is not any better than setting cache.node = NUMA_NO_NODE directly.
> Yes it's less lines of code, but it's harder to read (what does NULL
> mean here?), and every user of kvm_mmu_memory_cache still has to know to
> pass NUMA_NO_NODE.
>
> When I originally gave this suggestion, I intended to suggest that
> INIT_KVM_MMU_MEMORY_CACHE() provide just default initialization.
> Non-default initialization for gfp_zero, gfp_custom, kmem_cache, and
> node would remain as they are.
>
> Yes this adds some more lines, but keeps things readable, and doesn't
> every initialization site of kvm_mmu_memory_cache to know what to pass
> for gfp_zero, node, and kmem_cache. It only needs to set the fields
> *it* cares about.

And to offset the extra lines to call INIT_KVM_MMU_MEMORY_CACHE(), we
could finally invert the meaning of gfp_zero so that caches use
__GFP_ZERO by default. The majority of caches want __GFP_ZERO, so that
should cut down a bunch of lines.

>
> Here's what I mean specifically, based on INIT_LIST_HEAD. I don't think
> I got all the kvm_mmu_memory_cache users, but you get the point.
>
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..0e138dcaf4d4 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -340,6 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.target = -1;
>         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
>         /*
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..f5fd78a4f084 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  {
>         phys_addr_t addr;
>         int ret = 0;
> -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> +       KVM_MMU_MEMORY_CACHE(cache);
>         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
>                                      KVM_PGTABLE_PROT_R |
>                                      (writable ? KVM_PGTABLE_PROT_W : 0);
>
> +       cache.gfp_zero = __GFP_ZERO;
> +
>         if (is_protected_kvm_enabled())
>                 return -EPERM;
>
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 34b57e0be2ef..7915a5a2d104 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
>         int ret = 0;
>         unsigned long pfn;
>         phys_addr_t addr, end;
> -       struct kvm_mmu_memory_cache pcache = {
> -               .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> -               .gfp_zero = __GFP_ZERO,
> -       };
> +       KVM_MMU_MEMORY_CACHE(pcache);
> +
> +       pcache.gfp_zero = __GFP_ZERO;
> +       if (in_atomic)
> +               pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
>
>         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
>         pfn = __phys_to_pfn(hpa);
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 7c08567097f0..3d73ab3ec9a4 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -161,6 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
>         /* Mark this VCPU never ran */
>         vcpu->arch.ran_atleast_once = false;
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
>         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>         bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..d4cd8e64cc03 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5909,14 +5909,19 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
>         int ret;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
>         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
>         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
>         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> +
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
>         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
>
> @@ -6083,11 +6088,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>         kvm_page_track_register_notifier(kvm, node);
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
>         kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
>         kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
>         kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
>         kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
>         kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
>
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..eb7ff9afa5c7 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -98,6 +98,17 @@ struct kvm_mmu_memory_cache {
>         int capacity;
>         void **objects;
>  };
> +
> +#define KVM_MMU_MEMORY_CACHE_INIT() (struct kvm_mmu_memory_cache) { \
> +}
> +
> +#define KVM_MMU_MEMORY_CACHE(_name) \
> +       struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
> +
> +static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
> +{
> +       *cache = KVM_MMU_MEMORY_CACHE_INIT();
> +}
>  #endif
>
>  #define HALT_POLL_HIST_COUNT                   32
>
> > +
> >       if (is_protected_kvm_enabled())
> >               return -EPERM;
> >
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index a25e0b73ee70..b017c29a9340 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >                    HRTIMER_MODE_REL);
> >       vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> >
> > +     vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> > +
> >       /*
> >        * Allocate space for host mode exception handlers that handle
> >        * guest mode exits
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index 34b57e0be2ef..119de4520cc6 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> >       phys_addr_t addr, end;
> >       struct kvm_mmu_memory_cache pcache = {
> >               .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > -             .gfp_zero = __GFP_ZERO,
> >       };
> >
> > +     INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
> >       end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> >       pfn = __phys_to_pfn(hpa);
> >
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 7c08567097f0..189b14feb365 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >       /* Mark this VCPU never ran */
> >       vcpu->arch.ran_atleast_once = false;
> > -     vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> >       bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> >
> >       /* Setup ISA features available to VCPU */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f6a10d7a871..23a3b82b2384 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >       int ret;
> >
> > -     vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > -     vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > +                               pte_list_desc_cache, NUMA_NO_NODE);
> >
> > -     vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -     vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > +                               mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -     vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > +                               NULL, NUMA_NO_NODE);
> >       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >       vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >       node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> >       kvm_page_track_register_notifier(kvm, node);
> >
> > -     kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> > -     kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> > +                               mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -     kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > +                               NULL, NUMA_NO_NODE);
> >       spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
> >
> > -     kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> > -     kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> > +                               pte_list_desc_cache, NUMA_NO_NODE);
> >
> >       return 0;
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a262e15ebd19..719687a37ef7 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >
> > +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({     \
> > +     (_cache)->kmem_cache = _kmem_cache;                             \
> > +     (_cache)->gfp_zero = __GFP_ZERO;                                \
> > +     (_cache)->node = _node;                                         \
> > +})
> > +
> >  #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 76de36e56cdf..9c70ce95e51f 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
> >       struct kmem_cache *kmem_cache;
> >       int capacity;
> >       void **objects;
> > +     /* Node on which memory should be allocated by default */
> > +     int node;
> >  };
> >  #endif
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2022-12-29 23:11     ` David Matlack
@ 2023-01-03 18:45       ` Vipin Sharma
  2023-01-03 18:55         ` David Matlack
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 18:45 UTC (permalink / raw)
  To: David Matlack; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Thu, Dec 29, 2022 at 3:12 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, Dec 29, 2022 at 3:08 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 06:34:54PM -0800, Vipin Sharma wrote:
> > > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > > this cache should allocate memory from. Default initialize to
> > > NUMA_NO_NODE in all architectures.
> > >
> > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > ---
> > >  arch/arm64/kvm/arm.c      |  2 +-
> > >  arch/arm64/kvm/mmu.c      |  4 +++-
> > >  arch/mips/kvm/mips.c      |  2 ++
> > >  arch/riscv/kvm/mmu.c      |  2 +-
> > >  arch/riscv/kvm/vcpu.c     |  2 +-
> > >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> > >  include/linux/kvm_host.h  |  6 ++++++
> > >  include/linux/kvm_types.h |  2 ++
> > >  8 files changed, 28 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 9c5573bc4614..52a41f4532e2 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >       vcpu->arch.target = -1;
> > >       bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> > >
> > > -     vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > >
> > >       /*
> > >        * Default value for the FP state, will be overloaded at load
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 31d7fa4c7c14..bd07155e17fa 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > >  {
> > >       phys_addr_t addr;
> > >       int ret = 0;
> > > -     struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > > +     struct kvm_mmu_memory_cache cache;
> > >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > >                                    KVM_PGTABLE_PROT_R |
> > >                                    (writable ? KVM_PGTABLE_PROT_W : 0);
> > >
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> >
> > This is not any better than setting cache.node = NUMA_NO_NODE directly.
> > Yes it's less lines of code, but it's harder to read (what does NULL
> > mean here?), and every user of kvm_mmu_memory_cache still has to know to
> > pass NUMA_NO_NODE.
> >
> > When I originally gave this suggestion, I intended to suggest that
> > INIT_KVM_MMU_MEMORY_CACHE() provide just default initialization.
> > Non-default initialization for gfp_zero, gfp_custom, kmem_cache, and
> > node would remain as they are.
> >
> > Yes this adds some more lines, but keeps things readable, and doesn't
> > every initialization site of kvm_mmu_memory_cache to know what to pass
> > for gfp_zero, node, and kmem_cache. It only needs to set the fields
> > *it* cares about.
>
> And to offset the extra lines to call INIT_KVM_MMU_MEMORY_CACHE(), we
> could finally invert the meaning of gfp_zero so that caches use
> __GFP_ZERO by default. The majority of caches want __GFP_ZERO, so that
> should cut down a bunch of lines.
>

Can you clarify what you mean by invert?

Caches which don't want __GFP_ZERO will explicitly set gfp_zero to 0.
Is this what you intend?


> >
> > Here's what I mean specifically, based on INIT_LIST_HEAD. I don't think
> > I got all the kvm_mmu_memory_cache users, but you get the point.
> >
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 9c5573bc4614..0e138dcaf4d4 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -340,6 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >         vcpu->arch.target = -1;
> >         bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> >         /*
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 31d7fa4c7c14..f5fd78a4f084 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> >  {
> >         phys_addr_t addr;
> >         int ret = 0;
> > -       struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > +       KVM_MMU_MEMORY_CACHE(cache);
> >         struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> >                                      KVM_PGTABLE_PROT_R |
> >                                      (writable ? KVM_PGTABLE_PROT_W : 0);
> >
> > +       cache.gfp_zero = __GFP_ZERO;
> > +
> >         if (is_protected_kvm_enabled())
> >                 return -EPERM;
> >
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index 34b57e0be2ef..7915a5a2d104 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> >         int ret = 0;
> >         unsigned long pfn;
> >         phys_addr_t addr, end;
> > -       struct kvm_mmu_memory_cache pcache = {
> > -               .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > -               .gfp_zero = __GFP_ZERO,
> > -       };
> > +       KVM_MMU_MEMORY_CACHE(pcache);
> > +
> > +       pcache.gfp_zero = __GFP_ZERO;
> > +       if (in_atomic)
> > +               pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
> >
> >         end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> >         pfn = __phys_to_pfn(hpa);
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 7c08567097f0..3d73ab3ec9a4 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -161,6 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> >         /* Mark this VCPU never ran */
> >         vcpu->arch.ran_atleast_once = false;
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> >         vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >         bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..d4cd8e64cc03 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5909,14 +5909,19 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> >         int ret;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
> >         vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> >         vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> >         vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
> >         vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> > +
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> >         vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> >
> > @@ -6083,11 +6088,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> >         kvm_page_track_register_notifier(kvm, node);
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
> >         kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> >         kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
> >         kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> > +       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
> >         kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> >         kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 76de36e56cdf..eb7ff9afa5c7 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -98,6 +98,17 @@ struct kvm_mmu_memory_cache {
> >         int capacity;
> >         void **objects;
> >  };
> > +
> > +#define KVM_MMU_MEMORY_CACHE_INIT() (struct kvm_mmu_memory_cache) { \
> > +}
> > +
> > +#define KVM_MMU_MEMORY_CACHE(_name) \
> > +       struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
> > +
> > +static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
> > +{
> > +       *cache = KVM_MMU_MEMORY_CACHE_INIT();
> > +}
> >  #endif
> >
> >  #define HALT_POLL_HIST_COUNT                   32
> >
> > > +
> > >       if (is_protected_kvm_enabled())
> > >               return -EPERM;
> > >
> > > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > > index a25e0b73ee70..b017c29a9340 100644
> > > --- a/arch/mips/kvm/mips.c
> > > +++ b/arch/mips/kvm/mips.c
> > > @@ -304,6 +304,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >                    HRTIMER_MODE_REL);
> > >       vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> > >
> > > +     vcpu->arch.mmu_page_cache.node = NUMA_NO_NODE;
> > > +
> > >       /*
> > >        * Allocate space for host mode exception handlers that handle
> > >        * guest mode exits
> > > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > > index 34b57e0be2ef..119de4520cc6 100644
> > > --- a/arch/riscv/kvm/mmu.c
> > > +++ b/arch/riscv/kvm/mmu.c
> > > @@ -353,9 +353,9 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> > >       phys_addr_t addr, end;
> > >       struct kvm_mmu_memory_cache pcache = {
> > >               .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > > -             .gfp_zero = __GFP_ZERO,
> > >       };
> > >
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&pcache, NULL, NUMA_NO_NODE);
> > >       end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> > >       pfn = __phys_to_pfn(hpa);
> > >
> > > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > > index 7c08567097f0..189b14feb365 100644
> > > --- a/arch/riscv/kvm/vcpu.c
> > > +++ b/arch/riscv/kvm/vcpu.c
> > > @@ -161,7 +161,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >
> > >       /* Mark this VCPU never ran */
> > >       vcpu->arch.ran_atleast_once = false;
> > > -     vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > >       bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> > >
> > >       /* Setup ISA features available to VCPU */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6f6a10d7a871..23a3b82b2384 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5954,13 +5954,14 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > >  {
> > >       int ret;
> > >
> > > -     vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > > -     vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > > +                               pte_list_desc_cache, NUMA_NO_NODE);
> > >
> > > -     vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > -     vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > > +                               mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > -     vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > > +                               NULL, NUMA_NO_NODE);
> > >       spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >       vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > > @@ -6124,14 +6125,15 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > >       node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > >       kvm_page_track_register_notifier(kvm, node);
> > >
> > > -     kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > -     kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> > > +                               mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > -     kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > > +                               NULL, NUMA_NO_NODE);
> > >       spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
> > >
> > > -     kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> > > -     kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> > > +     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> > > +                               pte_list_desc_cache, NUMA_NO_NODE);
> > >
> > >       return 0;
> > >  }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a262e15ebd19..719687a37ef7 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2302,4 +2302,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> > >  /* Max number of entries allowed for each kvm dirty ring */
> > >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> > >
> > > +#define INIT_KVM_MMU_MEMORY_CACHE(_cache, _kmem_cache, _node) ({     \
> > > +     (_cache)->kmem_cache = _kmem_cache;                             \
> > > +     (_cache)->gfp_zero = __GFP_ZERO;                                \
> > > +     (_cache)->node = _node;                                         \
> > > +})
> > > +
> > >  #endif
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 76de36e56cdf..9c70ce95e51f 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -97,6 +97,8 @@ struct kvm_mmu_memory_cache {
> > >       struct kmem_cache *kmem_cache;
> > >       int capacity;
> > >       void **objects;
> > > +     /* Node on which memory should be allocated by default */
> > > +     int node;
> > >  };
> > >  #endif
> > >
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
  2023-01-03 18:45       ` Vipin Sharma
@ 2023-01-03 18:55         ` David Matlack
  0 siblings, 0 replies; 46+ messages in thread
From: David Matlack @ 2023-01-03 18:55 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Tue, Jan 3, 2023 at 10:46 AM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Thu, Dec 29, 2022 at 3:12 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Thu, Dec 29, 2022 at 3:08 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Wed, Dec 21, 2022 at 06:34:54PM -0800, Vipin Sharma wrote:
> > > > Add 'node' variable in kvm_mmu_memory_cache{} to denote which NUMA node
> > > > this cache should allocate memory from. Default initialize to
> > > > NUMA_NO_NODE in all architectures.
> > > >
> > > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > > ---
> > > >  arch/arm64/kvm/arm.c      |  2 +-
> > > >  arch/arm64/kvm/mmu.c      |  4 +++-
> > > >  arch/mips/kvm/mips.c      |  2 ++
> > > >  arch/riscv/kvm/mmu.c      |  2 +-
> > > >  arch/riscv/kvm/vcpu.c     |  2 +-
> > > >  arch/x86/kvm/mmu/mmu.c    | 22 ++++++++++++----------
> > > >  include/linux/kvm_host.h  |  6 ++++++
> > > >  include/linux/kvm_types.h |  2 ++
> > > >  8 files changed, 28 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 9c5573bc4614..52a41f4532e2 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -340,7 +340,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > >       vcpu->arch.target = -1;
> > > >       bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> > > >
> > > > -     vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > > > +     INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache, NULL, NUMA_NO_NODE);
> > > >
> > > >       /*
> > > >        * Default value for the FP state, will be overloaded at load
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 31d7fa4c7c14..bd07155e17fa 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -894,12 +894,14 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > >  {
> > > >       phys_addr_t addr;
> > > >       int ret = 0;
> > > > -     struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > > > +     struct kvm_mmu_memory_cache cache;
> > > >       struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > > >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > > >                                    KVM_PGTABLE_PROT_R |
> > > >                                    (writable ? KVM_PGTABLE_PROT_W : 0);
> > > >
> > > > +     INIT_KVM_MMU_MEMORY_CACHE(&cache, NULL, NUMA_NO_NODE);
> > >
> > > This is not any better than setting cache.node = NUMA_NO_NODE directly.
> > > Yes it's less lines of code, but it's harder to read (what does NULL
> > > mean here?), and every user of kvm_mmu_memory_cache still has to know to
> > > pass NUMA_NO_NODE.
> > >
> > > When I originally gave this suggestion, I intended to suggest that
> > > INIT_KVM_MMU_MEMORY_CACHE() provide just default initialization.
> > > Non-default initialization for gfp_zero, gfp_custom, kmem_cache, and
> > > node would remain as they are.
> > >
> > > Yes this adds some more lines, but keeps things readable, and doesn't
> > > every initialization site of kvm_mmu_memory_cache to know what to pass
> > > for gfp_zero, node, and kmem_cache. It only needs to set the fields
> > > *it* cares about.
> >
> > And to offset the extra lines to call INIT_KVM_MMU_MEMORY_CACHE(), we
> > could finally invert the meaning of gfp_zero so that caches use
> > __GFP_ZERO by default. The majority of caches want __GFP_ZERO, so that
> > should cut down a bunch of lines.
> >
>
> Can you clarify what you mean by invert?
>
> Caches which don't want __GFP_ZERO will explicitly set gfp_zero to 0.
> Is this what you intend?

When I wrote that comment I was thinking you can change `gfp_t
gfp_zero` to e.g. `bool skip_gfp_zero` so that the default initialized
value (false/0) means "use __GFP_ZERO".

However, that's silly once we have INIT_KVM_MMU_MEMORY_CACHE(). We can
do what you suggest: set gfp_zero to __GFP_ZERO in
INIT_KVM_MMU_MEMORY_CACHE() and then explicitly set it to 0 in caches
that don't need __GFP_ZERO.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (5 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{} Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 19:34   ` Ben Gardon
  2022-12-22  2:34 ` [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware Vipin Sharma
  2022-12-22  2:34 ` [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL Vipin Sharma
  8 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Page table pages of a VM are currently allocated based on the current
task's NUMA node or its mempolicy. This can cause suboptimal remote
accesses by the vCPU if it is accessing physical pages local to its NUMA
node but the page table pages mapping those physcal pages were created
by some other vCPU which was on different NUMA node or had different
policy.

Allocate page table pages on the same NUMA node where underlying
physical page exists. Page table at level 5, 4, and 3 might not end up
on the same NUMA node as they can span multiple NUMA nodes.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 63 ++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h  |  4 +--
 arch/x86/kvm/mmu/tdp_mmu.c      | 11 +++---
 virt/kvm/kvm_main.c             |  2 +-
 5 files changed, 53 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 293994fabae3..b1f319ad6f89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu *walk_mmu;
 
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
-	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
+	struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23a3b82b2384..511c6ef265ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
 
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
-	int r;
+	int r, nid;
 
 	/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				      &vcpu->arch.mmu_shadow_page_cache_lock,
-				      PT64_ROOT_MAX_LEVEL);
-	if (r)
-		return r;
+
+	for_each_online_node(nid) {
+		r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
+					      &vcpu->arch.mmu_shadow_page_cache_lock,
+					      PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
+
 	if (maybe_indirect) {
 		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
 					       PT64_ROOT_MAX_LEVEL);
 		if (r)
 			return r;
 	}
+
 	return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
 					  PT64_ROOT_MAX_LEVEL);
 }
@@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
 
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
+	int nid;
+
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
-	mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				 &vcpu->arch.mmu_shadow_page_cache_lock);
+
+	for_each_node(nid)
+		mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
+					 &vcpu->arch.mmu_shadow_page_cache_lock);
+
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
@@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 
 static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
 						    gfn_t gfn,
-						    union kvm_mmu_page_role role)
+						    union kvm_mmu_page_role role,
+						    int nid)
 {
 	struct shadow_page_caches caches = {
 		.page_header_cache = &vcpu->arch.mmu_page_header_cache,
-		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+		.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
 		.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
 		.shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
 	};
@@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
 
 static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 						 u64 *sptep, gfn_t gfn,
-						 bool direct, unsigned int access)
+						 bool direct, unsigned int access,
+						 kvm_pfn_t pfn)
 {
 	union kvm_mmu_page_role role;
+	int nid;
 
 	if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
 		return ERR_PTR(-EEXIST);
 
 	role = kvm_mmu_child_role(sptep, direct, access);
-	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
+	nid = kvm_pfn_to_page_table_nid(pfn);
+
+	return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
 }
 
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (it.level == fault->goal_level)
 			break;
 
-		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
+					  ACC_ALL, fault->pfn);
 		if (sp == ERR_PTR(-EEXIST))
 			continue;
 
@@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
 	WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
 	WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
-	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
+	sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id());
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
-	int ret;
+	int ret, nid;
 
 	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
 				  pte_list_desc_cache, NUMA_NO_NODE);
@@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
 				  mmu_page_header_cache, NUMA_NO_NODE);
 
-	INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
-				  NULL, NUMA_NO_NODE);
+	for_each_node(nid)
+		INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid],
+					  NULL, nid);
 	spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
@@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 }
 
 static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
+				      int cache_count,
 				      spinlock_t *cache_lock)
 {
 	unsigned long freed = 0;
+	int nid;
 
 	spin_lock(cache_lock);
-	if (cache->nobjs)
-		freed = kvm_mmu_empty_memory_cache(cache);
+	for (nid = 0; nid < cache_count; nid++) {
+		if (node_online(nid) && cache[nid].nobjs)
+			freed += kvm_mmu_empty_memory_cache(&cache[nid]);
+	}
 	spin_unlock(cache_lock);
 	return freed;
 }
@@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		list_move_tail(&kvm->vm_list, &vm_list);
 
 		freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
+					  1,
 					  &kvm->arch.split_shadow_page_cache_lock);
 
 		if (freed >= sc->nr_to_scan)
 			break;
 
 		kvm_for_each_vcpu(i, vcpu, kvm) {
-			freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
+			freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
+						  MAX_NUMNODES,
 						  &vcpu->arch.mmu_shadow_page_cache_lock);
 		}
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e5662dbd519c..1ceca62ec4cf 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		table_gfn = gw->table_gfn[it.level - 2];
 		access = gw->pt_access[it.level - 2];
 		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
-					  false, access);
+					  false, access, fault->pfn);
 
 		if (sp != ERR_PTR(-EEXIST)) {
 			/*
@@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		validate_direct_spte(vcpu, it.sptep, direct_access);
 
 		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
-					  true, direct_access);
+					  true, direct_access, fault->pfn);
 		if (sp == ERR_PTR(-EEXIST))
 			continue;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 376b8dceb3f9..b5abae2366dd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		    kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
+	sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid],
 						&vcpu->arch.mmu_shadow_page_cache_lock);
 
 	return sp;
@@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = tdp_mmu_alloc_sp(vcpu);
+	root = tdp_mmu_alloc_sp(vcpu, numa_mem_id());
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
 	refcount_set(&root->tdp_mmu_root_count, 1);
@@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	int ret = RET_PF_RETRY;
+	int ret = RET_PF_RETRY, nid;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    !is_large_pte(iter.old_spte))
 			continue;
 
+		nid = kvm_pfn_to_page_table_nid(fault->pfn);
 		/*
 		 * The SPTE is either non-present or points to a huge page that
 		 * needs to be split.
 		 */
-		sp = tdp_mmu_alloc_sp(vcpu);
+		sp = tdp_mmu_alloc_sp(vcpu, nid);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d96c8146e9ba..4f3db7ffeba8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 	if (mc->kmem_cache)
 		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
 	else
-		return (void *)__get_free_page(gfp_flags);
+		return kvm_mmu_get_free_page(mc->node, gfp_flags);
 }
 
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages
  2022-12-22  2:34 ` [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages Vipin Sharma
@ 2022-12-27 19:34   ` Ben Gardon
  2022-12-28 22:08     ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 19:34 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> Page table pages of a VM are currently allocated based on the current
> task's NUMA node or its mempolicy. This can cause suboptimal remote
> accesses by the vCPU if it is accessing physical pages local to its NUMA
> node but the page table pages mapping those physcal pages were created
> by some other vCPU which was on different NUMA node or had different
> policy.
>
> Allocate page table pages on the same NUMA node where underlying
> physical page exists. Page table at level 5, 4, and 3 might not end up
> on the same NUMA node as they can span multiple NUMA nodes.

A page table at any level could map memory spanning multiple NUMA
nodes, it just becomes more likely at higher levels.
We're only guaranteed that a page table maps memory all on the same
node if it's a split hugepage.
This change can only guarantee that the page table pages are allocated
on the same node as at least some of the memory they map.
Of course in practice, the above is absolutely correct since we'd
expect to have multi-GB continuous ranges of GFNs allocated on the
same node via huge pages.

And since the root pages are allocated based only on where the thread
allocating them is running, they're not actually guaranteed to be on
the same node as any of the memory they map. (Though they probably
will be.)

>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 63 ++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/paging_tmpl.h  |  4 +--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 11 +++---
>  virt/kvm/kvm_main.c             |  2 +-
>  5 files changed, 53 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 293994fabae3..b1f319ad6f89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu *walk_mmu;
>
>         struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> +       struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 23a3b82b2384..511c6ef265ee 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
>
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
> -       int r;
> +       int r, nid;
>
>         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
>         r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
>                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> -       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                     &vcpu->arch.mmu_shadow_page_cache_lock,
> -                                     PT64_ROOT_MAX_LEVEL);
> -       if (r)
> -               return r;
> +
> +       for_each_online_node(nid) {
> +               r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> +                                             &vcpu->arch.mmu_shadow_page_cache_lock,
> +                                             PT64_ROOT_MAX_LEVEL);
> +               if (r)
> +                       return r;
> +       }
> +
>         if (maybe_indirect) {
>                 r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
>                                                PT64_ROOT_MAX_LEVEL);
>                 if (r)
>                         return r;
>         }
> +
>         return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
>                                           PT64_ROOT_MAX_LEVEL);
>  }
> @@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
>
>  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  {
> +       int nid;
> +
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> -       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -                                &vcpu->arch.mmu_shadow_page_cache_lock);
> +
> +       for_each_node(nid)
> +               mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> +                                        &vcpu->arch.mmu_shadow_page_cache_lock);
> +

Was just trying to think if there could be any issue with memory
leakage if the online nodes changed, though IDK if any hardware does
that.
Still, it might be more robust to use ARRAY_SIZE and cover the whole array.

>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
> @@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>
>  static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
>                                                     gfn_t gfn,
> -                                                   union kvm_mmu_page_role role)
> +                                                   union kvm_mmu_page_role role,
> +                                                   int nid)
>  {
>         struct shadow_page_caches caches = {
>                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> -               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> +               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
>                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
>                 .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
>         };
> @@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
>
>  static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
>                                                  u64 *sptep, gfn_t gfn,
> -                                                bool direct, unsigned int access)
> +                                                bool direct, unsigned int access,
> +                                                kvm_pfn_t pfn)
>  {
>         union kvm_mmu_page_role role;
> +       int nid;
>
>         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>                 return ERR_PTR(-EEXIST);
>
>         role = kvm_mmu_child_role(sptep, direct, access);
> -       return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> +       nid = kvm_pfn_to_page_table_nid(pfn);
> +
> +       return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
>  }
>
>  static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> @@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 if (it.level == fault->goal_level)
>                         break;
>
> -               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> +               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> +                                         ACC_ALL, fault->pfn);
>                 if (sp == ERR_PTR(-EEXIST))
>                         continue;
>
> @@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
>         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
>         WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>
> -       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> +       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id());
>         ++sp->root_count;
>
>         return __pa(sp->spt);
> @@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
>  int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  {
> -       int ret;
> +       int ret, nid;
>
>         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
>                                   pte_list_desc_cache, NUMA_NO_NODE);
> @@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
>                                   mmu_page_header_cache, NUMA_NO_NODE);
>
> -       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> -                                 NULL, NUMA_NO_NODE);
> +       for_each_node(nid)
> +               INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid],
> +                                         NULL, nid);
>         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
>         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> @@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>  }
>
>  static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> +                                     int cache_count,
>                                       spinlock_t *cache_lock)
>  {
>         unsigned long freed = 0;
> +       int nid;
>
>         spin_lock(cache_lock);
> -       if (cache->nobjs)
> -               freed = kvm_mmu_empty_memory_cache(cache);
> +       for (nid = 0; nid < cache_count; nid++) {
> +               if (node_online(nid) && cache[nid].nobjs)

Is there any reason to keep the cache if !node_online(nid)?
Actually, I'd also just drop the cache_count argument and always
iterate over the entire array, only checking nobjs. There's no
guarantee I'm aware of that the set of nodes has a sequential series
of IDs starting at 0 and you'd get a bug if that wasn't the case since
it only iterates to  nid < cache_count here but some of the earlier
nids might not have been online.

> +                       freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> +       }
>         spin_unlock(cache_lock);
>         return freed;
>  }
> @@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>                 list_move_tail(&kvm->vm_list, &vm_list);
>
>                 freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> +                                         1,

So lonely.
One.
All by itself,
with only a coma for company.

NIT: This could be merged to the previous or subsequent lines.

>                                           &kvm->arch.split_shadow_page_cache_lock);
>
>                 if (freed >= sc->nr_to_scan)
>                         break;
>
>                 kvm_for_each_vcpu(i, vcpu, kvm) {
> -                       freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
> +                       freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> +                                                 MAX_NUMNODES,
>                                                   &vcpu->arch.mmu_shadow_page_cache_lock);
>                 }
>
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index e5662dbd519c..1ceca62ec4cf 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>                 table_gfn = gw->table_gfn[it.level - 2];
>                 access = gw->pt_access[it.level - 2];
>                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> -                                         false, access);
> +                                         false, access, fault->pfn);
>
>                 if (sp != ERR_PTR(-EEXIST)) {
>                         /*
> @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>                 validate_direct_spte(vcpu, it.sptep, direct_access);
>
>                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> -                                         true, direct_access);
> +                                         true, direct_access, fault->pfn);
>                 if (sp == ERR_PTR(-EEXIST))
>                         continue;
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 376b8dceb3f9..b5abae2366dd 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>                     kvm_mmu_page_as_id(_root) != _as_id) {              \
>                 } else
>
> -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
>  {
>         struct kvm_mmu_page *sp;
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid],
>                                                 &vcpu->arch.mmu_shadow_page_cache_lock);
>
>         return sp;
> @@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>                         goto out;
>         }
>
> -       root = tdp_mmu_alloc_sp(vcpu);
> +       root = tdp_mmu_alloc_sp(vcpu, numa_mem_id());

Might be worth calling out somewhere that the root page is just
allocated based on where the thread allocating it runs.

>         tdp_mmu_init_sp(root, NULL, 0, role);
>
>         refcount_set(&root->tdp_mmu_root_count, 1);
> @@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> -       int ret = RET_PF_RETRY;
> +       int ret = RET_PF_RETRY, nid;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
>
> @@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                     !is_large_pte(iter.old_spte))
>                         continue;
>
> +               nid = kvm_pfn_to_page_table_nid(fault->pfn);
>                 /*
>                  * The SPTE is either non-present or points to a huge page that
>                  * needs to be split.
>                  */
> -               sp = tdp_mmu_alloc_sp(vcpu);
> +               sp = tdp_mmu_alloc_sp(vcpu, nid);
>                 tdp_mmu_init_child_sp(sp, &iter);
>
>                 sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d96c8146e9ba..4f3db7ffeba8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>         if (mc->kmem_cache)
>                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
>         else
> -               return (void *)__get_free_page(gfp_flags);
> +               return kvm_mmu_get_free_page(mc->node, gfp_flags);

You could do part of this change in the commit that introduced
kvm_mmu_get_free_page too.
>  }
>
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages
  2022-12-27 19:34   ` Ben Gardon
@ 2022-12-28 22:08     ` Vipin Sharma
  2022-12-29 18:20       ` Ben Gardon
  0 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:08 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 11:34 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > Page table pages of a VM are currently allocated based on the current
> > task's NUMA node or its mempolicy. This can cause suboptimal remote
> > accesses by the vCPU if it is accessing physical pages local to its NUMA
> > node but the page table pages mapping those physcal pages were created
> > by some other vCPU which was on different NUMA node or had different
> > policy.
> >
> > Allocate page table pages on the same NUMA node where underlying
> > physical page exists. Page table at level 5, 4, and 3 might not end up
> > on the same NUMA node as they can span multiple NUMA nodes.
>
> A page table at any level could map memory spanning multiple NUMA
> nodes, it just becomes more likely at higher levels.
> We're only guaranteed that a page table maps memory all on the same
> node if it's a split hugepage.

Even in this case, it is a best effort.

> This change can only guarantee that the page table pages are allocated
> on the same node as at least some of the memory they map.
> Of course in practice, the above is absolutely correct since we'd
> expect to have multi-GB continuous ranges of GFNs allocated on the
> same node via huge pages.
>
> And since the root pages are allocated based only on where the thread
> allocating them is running, they're not actually guaranteed to be on
> the same node as any of the memory they map. (Though they probably
> will be.)
>

I will add more details in the commit in the next version.

> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 63 ++++++++++++++++++++++-----------
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  4 +--
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 11 +++---
> >  virt/kvm/kvm_main.c             |  2 +-
> >  5 files changed, 53 insertions(+), 29 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 293994fabae3..b1f319ad6f89 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
> >         struct kvm_mmu *walk_mmu;
> >
> >         struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > +       struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
> >         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> >         struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 23a3b82b2384..511c6ef265ee 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> >
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> > -       int r;
> > +       int r, nid;
> >
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> >         r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> >                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> >         if (r)
> >                 return r;
> > -       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > -                                     &vcpu->arch.mmu_shadow_page_cache_lock,
> > -                                     PT64_ROOT_MAX_LEVEL);
> > -       if (r)
> > -               return r;
> > +
> > +       for_each_online_node(nid) {
> > +               r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > +                                             &vcpu->arch.mmu_shadow_page_cache_lock,
> > +                                             PT64_ROOT_MAX_LEVEL);
> > +               if (r)
> > +                       return r;
> > +       }
> > +
> >         if (maybe_indirect) {
> >                 r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> >                                                PT64_ROOT_MAX_LEVEL);
> >                 if (r)
> >                         return r;
> >         }
> > +
> >         return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> >                                           PT64_ROOT_MAX_LEVEL);
> >  }
> > @@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> >
> >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> >  {
> > +       int nid;
> > +
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > -       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > -                                &vcpu->arch.mmu_shadow_page_cache_lock);
> > +
> > +       for_each_node(nid)
> > +               mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > +                                        &vcpu->arch.mmu_shadow_page_cache_lock);
> > +
>
> Was just trying to think if there could be any issue with memory
> leakage if the online nodes changed, though IDK if any hardware does
> that.
> Still, it might be more robust to use ARRAY_SIZE and cover the whole array.

for_each_node() goes through all of the possible nodes on the system,
whereas, for_each_online_node() goes through only online nodes.
Current code seems right to me, let me know if I am overlooking
something.

>
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> >  }
> > @@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> >
> >  static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> >                                                     gfn_t gfn,
> > -                                                   union kvm_mmu_page_role role)
> > +                                                   union kvm_mmu_page_role role,
> > +                                                   int nid)
> >  {
> >         struct shadow_page_caches caches = {
> >                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > -               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > +               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
> >                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> >                 .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> >         };
> > @@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
> >
> >  static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> >                                                  u64 *sptep, gfn_t gfn,
> > -                                                bool direct, unsigned int access)
> > +                                                bool direct, unsigned int access,
> > +                                                kvm_pfn_t pfn)
> >  {
> >         union kvm_mmu_page_role role;
> > +       int nid;
> >
> >         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> >                 return ERR_PTR(-EEXIST);
> >
> >         role = kvm_mmu_child_role(sptep, direct, access);
> > -       return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > +       nid = kvm_pfn_to_page_table_nid(pfn);
> > +
> > +       return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
> >  }
> >
> >  static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> > @@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                 if (it.level == fault->goal_level)
> >                         break;
> >
> > -               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> > +               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> > +                                         ACC_ALL, fault->pfn);
> >                 if (sp == ERR_PTR(-EEXIST))
> >                         continue;
> >
> > @@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> >         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> >         WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
> >
> > -       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > +       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id());
> >         ++sp->root_count;
> >
> >         return __pa(sp->spt);
> > @@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> >
> >  int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >  {
> > -       int ret;
> > +       int ret, nid;
> >
> >         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> >                                   pte_list_desc_cache, NUMA_NO_NODE);
> > @@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> >         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> >                                   mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > -                                 NULL, NUMA_NO_NODE);
> > +       for_each_node(nid)
> > +               INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid],
> > +                                         NULL, nid);
> >         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > @@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> >  }
> >
> >  static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> > +                                     int cache_count,
> >                                       spinlock_t *cache_lock)
> >  {
> >         unsigned long freed = 0;
> > +       int nid;
> >
> >         spin_lock(cache_lock);
> > -       if (cache->nobjs)
> > -               freed = kvm_mmu_empty_memory_cache(cache);
> > +       for (nid = 0; nid < cache_count; nid++) {
> > +               if (node_online(nid) && cache[nid].nobjs)
>
> Is there any reason to keep the cache if !node_online(nid)?
> Actually, I'd also just drop the cache_count argument and always
> iterate over the entire array, only checking nobjs. There's no
> guarantee I'm aware of that the set of nodes has a sequential series
> of IDs starting at 0 and you'd get a bug if that wasn't the case since
> it only iterates to  nid < cache_count here but some of the earlier
> nids might not have been online.
>

This is just temporary and will be removed in the next patch in the series.

mmu_shrink_cache() is used for both split_shadow_page_cache (single
object) and mmu_shadow_page_cache[MAX_NUMANODES].

In next patch of this series, I used for_each_online_node(nide), I
will change it to for_each_node() in the next version.

> > +                       freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> > +       }
> >         spin_unlock(cache_lock);
> >         return freed;
> >  }
> > @@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >                 list_move_tail(&kvm->vm_list, &vm_list);
> >
> >                 freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> > +                                         1,
>
> So lonely.
> One.
> All by itself,
> with only a coma for company.
>
> NIT: This could be merged to the previous or subsequent lines.

This is a strong and independent '1'.

>
> >                                           &kvm->arch.split_shadow_page_cache_lock);
> >
> >                 if (freed >= sc->nr_to_scan)
> >                         break;
> >
> >                 kvm_for_each_vcpu(i, vcpu, kvm) {
> > -                       freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
> > +                       freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> > +                                                 MAX_NUMNODES,
> >                                                   &vcpu->arch.mmu_shadow_page_cache_lock);
> >                 }
> >
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index e5662dbd519c..1ceca62ec4cf 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >                 table_gfn = gw->table_gfn[it.level - 2];
> >                 access = gw->pt_access[it.level - 2];
> >                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> > -                                         false, access);
> > +                                         false, access, fault->pfn);
> >
> >                 if (sp != ERR_PTR(-EEXIST)) {
> >                         /*
> > @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >                 validate_direct_spte(vcpu, it.sptep, direct_access);
> >
> >                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> > -                                         true, direct_access);
> > +                                         true, direct_access, fault->pfn);
> >                 if (sp == ERR_PTR(-EEXIST))
> >                         continue;
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 376b8dceb3f9..b5abae2366dd 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> >                     kvm_mmu_page_as_id(_root) != _as_id) {              \
> >                 } else
> >
> > -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
> >  {
> >         struct kvm_mmu_page *sp;
> >
> >         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > -       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid],
> >                                                 &vcpu->arch.mmu_shadow_page_cache_lock);
> >
> >         return sp;
> > @@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >                         goto out;
> >         }
> >
> > -       root = tdp_mmu_alloc_sp(vcpu);
> > +       root = tdp_mmu_alloc_sp(vcpu, numa_mem_id());
>
> Might be worth calling out somewhere that the root page is just
> allocated based on where the thread allocating it runs.
>

How about a comment just up here or do you prefer at tdp_mmu_roots in
struct kvm_arch{}?

> >         tdp_mmu_init_sp(root, NULL, 0, role);
> >
> >         refcount_set(&root->tdp_mmu_root_count, 1);
> > @@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >         struct kvm *kvm = vcpu->kvm;
> >         struct tdp_iter iter;
> >         struct kvm_mmu_page *sp;
> > -       int ret = RET_PF_RETRY;
> > +       int ret = RET_PF_RETRY, nid;
> >
> >         kvm_mmu_hugepage_adjust(vcpu, fault);
> >
> > @@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                     !is_large_pte(iter.old_spte))
> >                         continue;
> >
> > +               nid = kvm_pfn_to_page_table_nid(fault->pfn);
> >                 /*
> >                  * The SPTE is either non-present or points to a huge page that
> >                  * needs to be split.
> >                  */
> > -               sp = tdp_mmu_alloc_sp(vcpu);
> > +               sp = tdp_mmu_alloc_sp(vcpu, nid);
> >                 tdp_mmu_init_child_sp(sp, &iter);
> >
> >                 sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d96c8146e9ba..4f3db7ffeba8 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> >         if (mc->kmem_cache)
> >                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> >         else
> > -               return (void *)__get_free_page(gfp_flags);
> > +               return kvm_mmu_get_free_page(mc->node, gfp_flags);
>
> You could do part of this change in the commit that introduced
> kvm_mmu_get_free_page too.

Yeah, I can do it there as well. No strong opinions. I will update in
the next version.

> >  }
> >
> >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages
  2022-12-28 22:08     ` Vipin Sharma
@ 2022-12-29 18:20       ` Ben Gardon
  0 siblings, 0 replies; 46+ messages in thread
From: Ben Gardon @ 2022-12-29 18:20 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 28, 2022 at 2:08 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> On Tue, Dec 27, 2022 at 11:34 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> > >
> > > Page table pages of a VM are currently allocated based on the current
> > > task's NUMA node or its mempolicy. This can cause suboptimal remote
> > > accesses by the vCPU if it is accessing physical pages local to its NUMA
> > > node but the page table pages mapping those physcal pages were created
> > > by some other vCPU which was on different NUMA node or had different
> > > policy.
> > >
> > > Allocate page table pages on the same NUMA node where underlying
> > > physical page exists. Page table at level 5, 4, and 3 might not end up
> > > on the same NUMA node as they can span multiple NUMA nodes.
> >
> > A page table at any level could map memory spanning multiple NUMA
> > nodes, it just becomes more likely at higher levels.
> > We're only guaranteed that a page table maps memory all on the same
> > node if it's a split hugepage.
>
> Even in this case, it is a best effort.
>
> > This change can only guarantee that the page table pages are allocated
> > on the same node as at least some of the memory they map.
> > Of course in practice, the above is absolutely correct since we'd
> > expect to have multi-GB continuous ranges of GFNs allocated on the
> > same node via huge pages.
> >
> > And since the root pages are allocated based only on where the thread
> > allocating them is running, they're not actually guaranteed to be on
> > the same node as any of the memory they map. (Though they probably
> > will be.)
> >
>
> I will add more details in the commit in the next version.
>
> > >
> > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  2 +-
> > >  arch/x86/kvm/mmu/mmu.c          | 63 ++++++++++++++++++++++-----------
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  4 +--
> > >  arch/x86/kvm/mmu/tdp_mmu.c      | 11 +++---
> > >  virt/kvm/kvm_main.c             |  2 +-
> > >  5 files changed, 53 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 293994fabae3..b1f319ad6f89 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
> > >         struct kvm_mmu *walk_mmu;
> > >
> > >         struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > > -       struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > > +       struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
> > >         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > >         struct kvm_mmu_memory_cache mmu_page_header_cache;
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 23a3b82b2384..511c6ef265ee 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > >
> > >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > >  {
> > > -       int r;
> > > +       int r, nid;
> > >
> > >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > >         r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > >                                        1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > >         if (r)
> > >                 return r;
> > > -       r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > -                                     &vcpu->arch.mmu_shadow_page_cache_lock,
> > > -                                     PT64_ROOT_MAX_LEVEL);
> > > -       if (r)
> > > -               return r;
> > > +
> > > +       for_each_online_node(nid) {
> > > +               r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > +                                             &vcpu->arch.mmu_shadow_page_cache_lock,
> > > +                                             PT64_ROOT_MAX_LEVEL);
> > > +               if (r)
> > > +                       return r;
> > > +       }
> > > +
> > >         if (maybe_indirect) {
> > >                 r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> > >                                                PT64_ROOT_MAX_LEVEL);
> > >                 if (r)
> > >                         return r;
> > >         }
> > > +
> > >         return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> > >                                           PT64_ROOT_MAX_LEVEL);
> > >  }
> > > @@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > >
> > >  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > >  {
> > > +       int nid;
> > > +
> > >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > > -       mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > -                                &vcpu->arch.mmu_shadow_page_cache_lock);
> > > +
> > > +       for_each_node(nid)
> > > +               mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > +                                        &vcpu->arch.mmu_shadow_page_cache_lock);
> > > +
> >
> > Was just trying to think if there could be any issue with memory
> > leakage if the online nodes changed, though IDK if any hardware does
> > that.
> > Still, it might be more robust to use ARRAY_SIZE and cover the whole array.
>
> for_each_node() goes through all of the possible nodes on the system,
> whereas, for_each_online_node() goes through only online nodes.
> Current code seems right to me, let me know if I am overlooking
> something.

Ah okay, I didn't see the distinction. That sounds good to me.

>
> >
> > >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> > >         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > >  }
> > > @@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> > >
> > >  static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > >                                                     gfn_t gfn,
> > > -                                                   union kvm_mmu_page_role role)
> > > +                                                   union kvm_mmu_page_role role,
> > > +                                                   int nid)
> > >  {
> > >         struct shadow_page_caches caches = {
> > >                 .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > > -               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > > +               .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
> > >                 .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > >                 .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > >         };
> > > @@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
> > >
> > >  static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> > >                                                  u64 *sptep, gfn_t gfn,
> > > -                                                bool direct, unsigned int access)
> > > +                                                bool direct, unsigned int access,
> > > +                                                kvm_pfn_t pfn)
> > >  {
> > >         union kvm_mmu_page_role role;
> > > +       int nid;
> > >
> > >         if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> > >                 return ERR_PTR(-EEXIST);
> > >
> > >         role = kvm_mmu_child_role(sptep, direct, access);
> > > -       return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > > +       nid = kvm_pfn_to_page_table_nid(pfn);
> > > +
> > > +       return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
> > >  }
> > >
> > >  static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> > > @@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >                 if (it.level == fault->goal_level)
> > >                         break;
> > >
> > > -               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> > > +               sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> > > +                                         ACC_ALL, fault->pfn);
> > >                 if (sp == ERR_PTR(-EEXIST))
> > >                         continue;
> > >
> > > @@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> > >         WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> > >         WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
> > >
> > > -       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > > +       sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id());
> > >         ++sp->root_count;
> > >
> > >         return __pa(sp->spt);
> > > @@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > >
> > >  int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > >  {
> > > -       int ret;
> > > +       int ret, nid;
> > >
> > >         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > >                                   pte_list_desc_cache, NUMA_NO_NODE);
> > > @@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > >         INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > >                                   mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > -       INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > > -                                 NULL, NUMA_NO_NODE);
> > > +       for_each_node(nid)
> > > +               INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > +                                         NULL, nid);
> > >         spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >         vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > > @@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > >  }
> > >
> > >  static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> > > +                                     int cache_count,
> > >                                       spinlock_t *cache_lock)
> > >  {
> > >         unsigned long freed = 0;
> > > +       int nid;
> > >
> > >         spin_lock(cache_lock);
> > > -       if (cache->nobjs)
> > > -               freed = kvm_mmu_empty_memory_cache(cache);
> > > +       for (nid = 0; nid < cache_count; nid++) {
> > > +               if (node_online(nid) && cache[nid].nobjs)
> >
> > Is there any reason to keep the cache if !node_online(nid)?
> > Actually, I'd also just drop the cache_count argument and always
> > iterate over the entire array, only checking nobjs. There's no
> > guarantee I'm aware of that the set of nodes has a sequential series
> > of IDs starting at 0 and you'd get a bug if that wasn't the case since
> > it only iterates to  nid < cache_count here but some of the earlier
> > nids might not have been online.
> >
>
> This is just temporary and will be removed in the next patch in the series.
>
> mmu_shrink_cache() is used for both split_shadow_page_cache (single
> object) and mmu_shadow_page_cache[MAX_NUMANODES].
>
> In next patch of this series, I used for_each_online_node(nide), I
> will change it to for_each_node() in the next version.
>
> > > +                       freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> > > +       }
> > >         spin_unlock(cache_lock);
> > >         return freed;
> > >  }
> > > @@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > >                 list_move_tail(&kvm->vm_list, &vm_list);
> > >
> > >                 freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> > > +                                         1,
> >
> > So lonely.
> > One.
> > All by itself,
> > with only a coma for company.
> >
> > NIT: This could be merged to the previous or subsequent lines.
>
> This is a strong and independent '1'.
>
> >
> > >                                           &kvm->arch.split_shadow_page_cache_lock);
> > >
> > >                 if (freed >= sc->nr_to_scan)
> > >                         break;
> > >
> > >                 kvm_for_each_vcpu(i, vcpu, kvm) {
> > > -                       freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > +                       freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> > > +                                                 MAX_NUMNODES,
> > >                                                   &vcpu->arch.mmu_shadow_page_cache_lock);
> > >                 }
> > >
> > > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > > index e5662dbd519c..1ceca62ec4cf 100644
> > > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > > @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >                 table_gfn = gw->table_gfn[it.level - 2];
> > >                 access = gw->pt_access[it.level - 2];
> > >                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> > > -                                         false, access);
> > > +                                         false, access, fault->pfn);
> > >
> > >                 if (sp != ERR_PTR(-EEXIST)) {
> > >                         /*
> > > @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >                 validate_direct_spte(vcpu, it.sptep, direct_access);
> > >
> > >                 sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> > > -                                         true, direct_access);
> > > +                                         true, direct_access, fault->pfn);
> > >                 if (sp == ERR_PTR(-EEXIST))
> > >                         continue;
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 376b8dceb3f9..b5abae2366dd 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > >                     kvm_mmu_page_as_id(_root) != _as_id) {              \
> > >                 } else
> > >
> > > -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > > +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
> > >  {
> > >         struct kvm_mmu_page *sp;
> > >
> > >         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > > -       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > > +       sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid],
> > >                                                 &vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > >         return sp;
> > > @@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > >                         goto out;
> > >         }
> > >
> > > -       root = tdp_mmu_alloc_sp(vcpu);
> > > +       root = tdp_mmu_alloc_sp(vcpu, numa_mem_id());
> >
> > Might be worth calling out somewhere that the root page is just
> > allocated based on where the thread allocating it runs.
> >
>
> How about a comment just up here or do you prefer at tdp_mmu_roots in
> struct kvm_arch{}?

Here or just in the commit description or cover letter.
Thanks!

>
> > >         tdp_mmu_init_sp(root, NULL, 0, role);
> > >
> > >         refcount_set(&root->tdp_mmu_root_count, 1);
> > > @@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >         struct kvm *kvm = vcpu->kvm;
> > >         struct tdp_iter iter;
> > >         struct kvm_mmu_page *sp;
> > > -       int ret = RET_PF_RETRY;
> > > +       int ret = RET_PF_RETRY, nid;
> > >
> > >         kvm_mmu_hugepage_adjust(vcpu, fault);
> > >
> > > @@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >                     !is_large_pte(iter.old_spte))
> > >                         continue;
> > >
> > > +               nid = kvm_pfn_to_page_table_nid(fault->pfn);
> > >                 /*
> > >                  * The SPTE is either non-present or points to a huge page that
> > >                  * needs to be split.
> > >                  */
> > > -               sp = tdp_mmu_alloc_sp(vcpu);
> > > +               sp = tdp_mmu_alloc_sp(vcpu, nid);
> > >                 tdp_mmu_init_child_sp(sp, &iter);
> > >
> > >                 sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index d96c8146e9ba..4f3db7ffeba8 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > >         if (mc->kmem_cache)
> > >                 return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> > >         else
> > > -               return (void *)__get_free_page(gfp_flags);
> > > +               return kvm_mmu_get_free_page(mc->node, gfp_flags);
> >
> > You could do part of this change in the commit that introduced
> > kvm_mmu_get_free_page too.
>
> Yeah, I can do it there as well. No strong opinions. I will update in
> the next version.
>
> > >  }
> > >
> > >  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (6 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 19:42   ` Ben Gardon
  2022-12-29 23:18   ` David Matlack
  2022-12-22  2:34 ` [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL Vipin Sharma
  8 siblings, 2 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Make split_shadow_page_cache NUMA aware and allocate page table's pages
during the split based on the underlying physical page's NUMA node.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 50 ++++++++++++++++++---------------
 2 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b1f319ad6f89..7b3f36ae37a4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1410,7 +1410,7 @@ struct kvm_arch {
 	 *
 	 * Protected by kvm->slots_lock.
 	 */
-	struct kvm_mmu_memory_cache split_shadow_page_cache;
+	struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
 	struct kvm_mmu_memory_cache split_page_header_cache;
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 511c6ef265ee..7454bfc49a51 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6126,7 +6126,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 int kvm_mmu_init_vm(struct kvm *kvm)
 {
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
-	int r;
+	int r, nid;
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
@@ -6145,8 +6145,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
 				  mmu_page_header_cache, NUMA_NO_NODE);
 
-	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
-				  NULL, NUMA_NO_NODE);
+	for_each_node(nid)
+		INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid],
+					  NULL, NUMA_NO_NODE);
 	spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
 
 	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
@@ -6157,10 +6158,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 
 static void mmu_free_vm_memory_caches(struct kvm *kvm)
 {
+	int nid;
+
 	kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
 	kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
-	mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
-				 &kvm->arch.split_shadow_page_cache_lock);
+	for_each_node(nid)
+		mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
+					 &kvm->arch.split_shadow_page_cache_lock);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6269,7 +6273,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
 	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
 }
 
-static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
 {
 	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
 		return true;
@@ -6281,10 +6285,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
 	 */
 	return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
 	       need_topup(&kvm->arch.split_page_header_cache, 1) ||
-	       need_topup(&kvm->arch.split_shadow_page_cache, 1);
+	       need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
 }
 
-static int topup_split_caches(struct kvm *kvm)
+static int topup_split_caches(struct kvm *kvm, int nid)
 {
 	/*
 	 * Allocating rmap list entries when splitting huge pages for nested
@@ -6314,18 +6318,21 @@ static int topup_split_caches(struct kvm *kvm)
 	if (r)
 		return r;
 
-	return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
+	return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
 					 &kvm->arch.split_shadow_page_cache_lock,
 					 1);
 }
 
-static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm,
+							u64 *huge_sptep,
+							u64 huge_spte)
 {
 	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
 	struct shadow_page_caches caches = {};
 	union kvm_mmu_page_role role;
 	unsigned int access;
 	gfn_t gfn;
+	int nid;
 
 	gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
 	access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
@@ -6338,9 +6345,11 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
 	 */
 	role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
 
+	nid = kvm_pfn_to_page_table_nid(spte_to_pfn(huge_spte));
+
 	/* Direct SPs do not require a shadowed_info_cache. */
 	caches.page_header_cache = &kvm->arch.split_page_header_cache;
-	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+	caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
 	caches.shadow_page_cache_lock = &kvm->arch.split_shadow_page_cache_lock;
 
 	/* Safe to pass NULL for vCPU since requesting a direct SP. */
@@ -6360,7 +6369,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
 	gfn_t gfn;
 	int index;
 
-	sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
+	sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte);
 
 	for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
 		sptep = &sp->spt[index];
@@ -6398,7 +6407,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 					  u64 *huge_sptep)
 {
 	struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
-	int level, r = 0;
+	int level, r = 0, nid;
 	gfn_t gfn;
 	u64 spte;
 
@@ -6406,13 +6415,14 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 	gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
 	level = huge_sp->role.level;
 	spte = *huge_sptep;
+	nid = kvm_pfn_to_page_table_nid(spte_to_pfn(spte));
 
 	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
 		r = -ENOSPC;
 		goto out;
 	}
 
-	if (need_topup_split_caches_or_resched(kvm)) {
+	if (need_topup_split_caches_or_resched(kvm, nid)) {
 		write_unlock(&kvm->mmu_lock);
 		cond_resched();
 		/*
@@ -6420,7 +6430,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
 		 * rmap iterator should be restarted because the MMU lock was
 		 * dropped.
 		 */
-		r = topup_split_caches(kvm) ?: -EAGAIN;
+		r = topup_split_caches(kvm, nid) ?: -EAGAIN;
 		write_lock(&kvm->mmu_lock);
 		goto out;
 	}
@@ -6709,17 +6719,15 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 }
 
 static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
-				      int cache_count,
 				      spinlock_t *cache_lock)
 {
 	unsigned long freed = 0;
 	int nid;
 
 	spin_lock(cache_lock);
-	for (nid = 0; nid < cache_count; nid++) {
-		if (node_online(nid) && cache[nid].nobjs)
+	for_each_online_node(nid)
+		if (cache[nid].nobjs)
 			freed += kvm_mmu_empty_memory_cache(&cache[nid]);
-	}
 	spin_unlock(cache_lock);
 	return freed;
 }
@@ -6741,8 +6749,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			first_kvm = kvm;
 		list_move_tail(&kvm->vm_list, &vm_list);
 
-		freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
-					  1,
+		freed += mmu_shrink_cache(kvm->arch.split_shadow_page_cache,
 					  &kvm->arch.split_shadow_page_cache_lock);
 
 		if (freed >= sc->nr_to_scan)
@@ -6750,7 +6757,6 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
-						  MAX_NUMNODES,
 						  &vcpu->arch.mmu_shadow_page_cache_lock);
 		}
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  2022-12-22  2:34 ` [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware Vipin Sharma
@ 2022-12-27 19:42   ` Ben Gardon
  2022-12-28 22:08     ` Vipin Sharma
  2022-12-29 23:18   ` David Matlack
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 19:42 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> Make split_shadow_page_cache NUMA aware and allocate page table's pages
> during the split based on the underlying physical page's NUMA node.
>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 50 ++++++++++++++++++---------------
>  2 files changed, 29 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b1f319ad6f89..7b3f36ae37a4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1410,7 +1410,7 @@ struct kvm_arch {
>          *
>          * Protected by kvm->slots_lock.
>          */
> -       struct kvm_mmu_memory_cache split_shadow_page_cache;
> +       struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
>         struct kvm_mmu_memory_cache split_page_header_cache;
>
>         /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 511c6ef265ee..7454bfc49a51 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6126,7 +6126,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
>         struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> -       int r;
> +       int r, nid;
>
>         INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>         INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> @@ -6145,8 +6145,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>         INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
>                                   mmu_page_header_cache, NUMA_NO_NODE);
>
> -       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> -                                 NULL, NUMA_NO_NODE);
> +       for_each_node(nid)

Again, assuming no one sets CONFIG_NODE_SHIFT to a ridiculous value,
it would probably be fine to initialize the entire array here since
that doesn't take any extra memory and we're not in a super hot path.

> +               INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid],
> +                                         NULL, NUMA_NO_NODE);
>         spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
>
>         INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> @@ -6157,10 +6158,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>
>  static void mmu_free_vm_memory_caches(struct kvm *kvm)
>  {
> +       int nid;
> +
>         kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
>         kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> -       mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
> -                                &kvm->arch.split_shadow_page_cache_lock);
> +       for_each_node(nid)

Again, could just iterate over the whole array here.

> +               mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
> +                                        &kvm->arch.split_shadow_page_cache_lock);
>  }
>
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
> @@ -6269,7 +6273,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
>         return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
>  }
>
> -static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> +static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
>  {
>         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
>                 return true;
> @@ -6281,10 +6285,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
>          */
>         return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
>                need_topup(&kvm->arch.split_page_header_cache, 1) ||
> -              need_topup(&kvm->arch.split_shadow_page_cache, 1);
> +              need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
>  }
>
> -static int topup_split_caches(struct kvm *kvm)
> +static int topup_split_caches(struct kvm *kvm, int nid)
>  {
>         /*
>          * Allocating rmap list entries when splitting huge pages for nested
> @@ -6314,18 +6318,21 @@ static int topup_split_caches(struct kvm *kvm)
>         if (r)
>                 return r;
>
> -       return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
> +       return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
>                                          &kvm->arch.split_shadow_page_cache_lock,
>                                          1);
>  }
>
> -static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm,
> +                                                       u64 *huge_sptep,
> +                                                       u64 huge_spte)

These can go on the same line.

>  {
>         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
>         struct shadow_page_caches caches = {};
>         union kvm_mmu_page_role role;
>         unsigned int access;
>         gfn_t gfn;
> +       int nid;
>
>         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
>         access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
> @@ -6338,9 +6345,11 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
>          */
>         role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
>
> +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(huge_spte));
> +
>         /* Direct SPs do not require a shadowed_info_cache. */
>         caches.page_header_cache = &kvm->arch.split_page_header_cache;
> -       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> +       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
>         caches.shadow_page_cache_lock = &kvm->arch.split_shadow_page_cache_lock;
>
>         /* Safe to pass NULL for vCPU since requesting a direct SP. */
> @@ -6360,7 +6369,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
>         gfn_t gfn;
>         int index;
>
> -       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
> +       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte);
>
>         for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
>                 sptep = &sp->spt[index];
> @@ -6398,7 +6407,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>                                           u64 *huge_sptep)
>  {
>         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> -       int level, r = 0;
> +       int level, r = 0, nid;
>         gfn_t gfn;
>         u64 spte;
>
> @@ -6406,13 +6415,14 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
>         level = huge_sp->role.level;
>         spte = *huge_sptep;
> +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(spte));
>
>         if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
>                 r = -ENOSPC;
>                 goto out;
>         }
>
> -       if (need_topup_split_caches_or_resched(kvm)) {
> +       if (need_topup_split_caches_or_resched(kvm, nid)) {
>                 write_unlock(&kvm->mmu_lock);
>                 cond_resched();
>                 /*
> @@ -6420,7 +6430,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
>                  * rmap iterator should be restarted because the MMU lock was
>                  * dropped.
>                  */
> -               r = topup_split_caches(kvm) ?: -EAGAIN;
> +               r = topup_split_caches(kvm, nid) ?: -EAGAIN;
>                 write_lock(&kvm->mmu_lock);
>                 goto out;
>         }
> @@ -6709,17 +6719,15 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
>  }
>
>  static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> -                                     int cache_count,
>                                       spinlock_t *cache_lock)
>  {
>         unsigned long freed = 0;
>         int nid;
>
>         spin_lock(cache_lock);
> -       for (nid = 0; nid < cache_count; nid++) {
> -               if (node_online(nid) && cache[nid].nobjs)
> +       for_each_online_node(nid)
> +               if (cache[nid].nobjs)
>                         freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> -       }
>         spin_unlock(cache_lock);
>         return freed;
>  }
> @@ -6741,8 +6749,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>                         first_kvm = kvm;
>                 list_move_tail(&kvm->vm_list, &vm_list);
>
> -               freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> -                                         1,
> +               freed += mmu_shrink_cache(kvm->arch.split_shadow_page_cache,
>                                           &kvm->arch.split_shadow_page_cache_lock);
>
>                 if (freed >= sc->nr_to_scan)
> @@ -6750,7 +6757,6 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>
>                 kvm_for_each_vcpu(i, vcpu, kvm) {
>                         freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> -                                                 MAX_NUMNODES,
>                                                   &vcpu->arch.mmu_shadow_page_cache_lock);
>                 }
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  2022-12-27 19:42   ` Ben Gardon
@ 2022-12-28 22:08     ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:08 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 11:43 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > Make split_shadow_page_cache NUMA aware and allocate page table's pages
> > during the split based on the underlying physical page's NUMA node.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 50 ++++++++++++++++++---------------
> >  2 files changed, 29 insertions(+), 23 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index b1f319ad6f89..7b3f36ae37a4 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1410,7 +1410,7 @@ struct kvm_arch {
> >          *
> >          * Protected by kvm->slots_lock.
> >          */
> > -       struct kvm_mmu_memory_cache split_shadow_page_cache;
> > +       struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
> >         struct kvm_mmu_memory_cache split_page_header_cache;
> >
> >         /*
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 511c6ef265ee..7454bfc49a51 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6126,7 +6126,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> >  int kvm_mmu_init_vm(struct kvm *kvm)
> >  {
> >         struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> > -       int r;
> > +       int r, nid;
> >
> >         INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> >         INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> > @@ -6145,8 +6145,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >         INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> >                                   mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > -                                 NULL, NUMA_NO_NODE);
> > +       for_each_node(nid)
>
> Again, assuming no one sets CONFIG_NODE_SHIFT to a ridiculous value,
> it would probably be fine to initialize the entire array here since
> that doesn't take any extra memory and we're not in a super hot path.

This goes through the entire array. I think you are confusing it with
for_each_online_node().

>
> > +               INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid],
> > +                                         NULL, NUMA_NO_NODE);
> >         spin_lock_init(&kvm->arch.split_shadow_page_cache_lock);
> >
> >         INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache,
> > @@ -6157,10 +6158,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >
> >  static void mmu_free_vm_memory_caches(struct kvm *kvm)
> >  {
> > +       int nid;
> > +
> >         kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> >         kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> > -       mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
> > -                                &kvm->arch.split_shadow_page_cache_lock);
> > +       for_each_node(nid)
>
> Again, could just iterate over the whole array here.
>
> > +               mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
> > +                                        &kvm->arch.split_shadow_page_cache_lock);
> >  }
> >
> >  void kvm_mmu_uninit_vm(struct kvm *kvm)
> > @@ -6269,7 +6273,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> >         return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> >  }
> >
> > -static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> > +static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
> >  {
> >         if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> >                 return true;
> > @@ -6281,10 +6285,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
> >          */
> >         return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
> >                need_topup(&kvm->arch.split_page_header_cache, 1) ||
> > -              need_topup(&kvm->arch.split_shadow_page_cache, 1);
> > +              need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
> >  }
> >
> > -static int topup_split_caches(struct kvm *kvm)
> > +static int topup_split_caches(struct kvm *kvm, int nid)
> >  {
> >         /*
> >          * Allocating rmap list entries when splitting huge pages for nested
> > @@ -6314,18 +6318,21 @@ static int topup_split_caches(struct kvm *kvm)
> >         if (r)
> >                 return r;
> >
> > -       return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache,
> > +       return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid],
> >                                          &kvm->arch.split_shadow_page_cache_lock,
> >                                          1);
> >  }
> >
> > -static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> > +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm,
> > +                                                       u64 *huge_sptep,
> > +                                                       u64 huge_spte)
>
> These can go on the same line.

Git diff is showing it weirdly. They are aligned to "struct kvm *kvm"
and both will be on different lines to keep them in the 80 char limit.


>
> >  {
> >         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> >         struct shadow_page_caches caches = {};
> >         union kvm_mmu_page_role role;
> >         unsigned int access;
> >         gfn_t gfn;
> > +       int nid;
> >
> >         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
> >         access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep));
> > @@ -6338,9 +6345,11 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> >          */
> >         role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access);
> >
> > +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(huge_spte));
> > +
> >         /* Direct SPs do not require a shadowed_info_cache. */
> >         caches.page_header_cache = &kvm->arch.split_page_header_cache;
> > -       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > +       caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
> >         caches.shadow_page_cache_lock = &kvm->arch.split_shadow_page_cache_lock;
> >
> >         /* Safe to pass NULL for vCPU since requesting a direct SP. */
> > @@ -6360,7 +6369,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
> >         gfn_t gfn;
> >         int index;
> >
> > -       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
> > +       sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte);
> >
> >         for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
> >                 sptep = &sp->spt[index];
> > @@ -6398,7 +6407,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
> >                                           u64 *huge_sptep)
> >  {
> >         struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
> > -       int level, r = 0;
> > +       int level, r = 0, nid;
> >         gfn_t gfn;
> >         u64 spte;
> >
> > @@ -6406,13 +6415,14 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
> >         gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep));
> >         level = huge_sp->role.level;
> >         spte = *huge_sptep;
> > +       nid = kvm_pfn_to_page_table_nid(spte_to_pfn(spte));
> >
> >         if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) {
> >                 r = -ENOSPC;
> >                 goto out;
> >         }
> >
> > -       if (need_topup_split_caches_or_resched(kvm)) {
> > +       if (need_topup_split_caches_or_resched(kvm, nid)) {
> >                 write_unlock(&kvm->mmu_lock);
> >                 cond_resched();
> >                 /*
> > @@ -6420,7 +6430,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
> >                  * rmap iterator should be restarted because the MMU lock was
> >                  * dropped.
> >                  */
> > -               r = topup_split_caches(kvm) ?: -EAGAIN;
> > +               r = topup_split_caches(kvm, nid) ?: -EAGAIN;
> >                 write_lock(&kvm->mmu_lock);
> >                 goto out;
> >         }
> > @@ -6709,17 +6719,15 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> >  }
> >
> >  static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> > -                                     int cache_count,
> >                                       spinlock_t *cache_lock)
> >  {
> >         unsigned long freed = 0;
> >         int nid;
> >
> >         spin_lock(cache_lock);
> > -       for (nid = 0; nid < cache_count; nid++) {
> > -               if (node_online(nid) && cache[nid].nobjs)
> > +       for_each_online_node(nid)
> > +               if (cache[nid].nobjs)
> >                         freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> > -       }
> >         spin_unlock(cache_lock);
> >         return freed;
> >  }
> > @@ -6741,8 +6749,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >                         first_kvm = kvm;
> >                 list_move_tail(&kvm->vm_list, &vm_list);
> >
> > -               freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> > -                                         1,
> > +               freed += mmu_shrink_cache(kvm->arch.split_shadow_page_cache,
> >                                           &kvm->arch.split_shadow_page_cache_lock);
> >
> >                 if (freed >= sc->nr_to_scan)
> > @@ -6750,7 +6757,6 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >
> >                 kvm_for_each_vcpu(i, vcpu, kvm) {
> >                         freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> > -                                                 MAX_NUMNODES,
> >                                                   &vcpu->arch.mmu_shadow_page_cache_lock);
> >                 }
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  2022-12-22  2:34 ` [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware Vipin Sharma
  2022-12-27 19:42   ` Ben Gardon
@ 2022-12-29 23:18   ` David Matlack
  2023-01-03 18:49     ` Vipin Sharma
  1 sibling, 1 reply; 46+ messages in thread
From: David Matlack @ 2022-12-29 23:18 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Wed, Dec 21, 2022 at 06:34:56PM -0800, Vipin Sharma wrote:
> Make split_shadow_page_cache NUMA aware and allocate page table's pages
> during the split based on the underlying physical page's NUMA node.
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 50 ++++++++++++++++++---------------
>  2 files changed, 29 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b1f319ad6f89..7b3f36ae37a4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1410,7 +1410,7 @@ struct kvm_arch {
>  	 *
>  	 * Protected by kvm->slots_lock.
>  	 */
> -	struct kvm_mmu_memory_cache split_shadow_page_cache;
> +	struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
>  	struct kvm_mmu_memory_cache split_page_header_cache;
>  
>  	/*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 511c6ef265ee..7454bfc49a51 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6126,7 +6126,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
>  int kvm_mmu_init_vm(struct kvm *kvm)
>  {
>  	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> -	int r;
> +	int r, nid;
>  
>  	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>  	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> @@ -6145,8 +6145,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
>  				  mmu_page_header_cache, NUMA_NO_NODE);
>  
> -	INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> -				  NULL, NUMA_NO_NODE);
> +	for_each_node(nid)
> +		INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid],
> +					  NULL, NUMA_NO_NODE);
                                                ^^^^^^^^^^^^
						Should this be nid?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
  2022-12-29 23:18   ` David Matlack
@ 2023-01-03 18:49     ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2023-01-03 18:49 UTC (permalink / raw)
  To: David Matlack; +Cc: seanjc, pbonzini, bgardon, kvm, linux-kernel

On Thu, Dec 29, 2022 at 3:18 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 06:34:56PM -0800, Vipin Sharma wrote:
> > Make split_shadow_page_cache NUMA aware and allocate page table's pages
> > during the split based on the underlying physical page's NUMA node.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 50 ++++++++++++++++++---------------
> >  2 files changed, 29 insertions(+), 23 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index b1f319ad6f89..7b3f36ae37a4 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1410,7 +1410,7 @@ struct kvm_arch {
> >        *
> >        * Protected by kvm->slots_lock.
> >        */
> > -     struct kvm_mmu_memory_cache split_shadow_page_cache;
> > +     struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
> >       struct kvm_mmu_memory_cache split_page_header_cache;
> >
> >       /*
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 511c6ef265ee..7454bfc49a51 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6126,7 +6126,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> >  int kvm_mmu_init_vm(struct kvm *kvm)
> >  {
> >       struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
> > -     int r;
> > +     int r, nid;
> >
> >       INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> >       INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> > @@ -6145,8 +6145,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >       INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache,
> >                                 mmu_page_header_cache, NUMA_NO_NODE);
> >
> > -     INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache,
> > -                               NULL, NUMA_NO_NODE);
> > +     for_each_node(nid)
> > +             INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid],
> > +                                       NULL, NUMA_NO_NODE);
>                                                 ^^^^^^^^^^^^
>                                                 Should this be nid?
Yes, I will fix it in the next version. Thanks

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL
  2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
                   ` (7 preceding siblings ...)
  2022-12-22  2:34 ` [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware Vipin Sharma
@ 2022-12-22  2:34 ` Vipin Sharma
  2022-12-27 19:52   ` Ben Gardon
  8 siblings, 1 reply; 46+ messages in thread
From: Vipin Sharma @ 2022-12-22  2:34 UTC (permalink / raw)
  To: seanjc, pbonzini, bgardon, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE is set to 40 without any specific
reason. Reduce default size to PT64_ROOT_MAX_LEVEL, which is currently
5.

Change mmu_pte_list_desc_cache size to what is needed as it is more than
5 but way less than 40.

Tested by running dirty_log_perf_test on both tdp and shadow MMU with 48
vcpu and 2GB/vcpu size on a 2 NUMA node machine. No impact on
performance noticed.

Ran perf on dirty_log_perf_test and found kvm_mmu_get_free_page() calls
reduced by ~3300 which is near to 48 (vcpus) * 2 (nodes) * 35 (cache
size).

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_types.h | 2 +-
 arch/x86/kvm/mmu/mmu.c           | 7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_types.h b/arch/x86/include/asm/kvm_types.h
index 08f1b57d3b62..752dab218a62 100644
--- a/arch/x86/include/asm/kvm_types.h
+++ b/arch/x86/include/asm/kvm_types.h
@@ -2,6 +2,6 @@
 #ifndef _ASM_X86_KVM_TYPES_H
 #define _ASM_X86_KVM_TYPES_H
 
-#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE 40
+#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE PT64_ROOT_MAX_LEVEL
 
 #endif /* _ASM_X86_KVM_TYPES_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7454bfc49a51..f89d933ff380 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -677,11 +677,12 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
 
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
-	int r, nid;
+	int r, nid, desc_capacity;
 
 	/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
+	desc_capacity = 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM;
+	r = __kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+					 desc_capacity, desc_capacity);
 	if (r)
 		return r;
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL
  2022-12-22  2:34 ` [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL Vipin Sharma
@ 2022-12-27 19:52   ` Ben Gardon
  2022-12-28 22:08     ` Vipin Sharma
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Gardon @ 2022-12-27 19:52 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
>
> KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE is set to 40 without any specific
> reason. Reduce default size to PT64_ROOT_MAX_LEVEL, which is currently
> 5.
>
> Change mmu_pte_list_desc_cache size to what is needed as it is more than
> 5 but way less than 40.

Why do you say more than 5? At least to resolve a page fault we'll
never need more than 4 pages on a system with 5 level paging since the
root is already allocated.

>
> Tested by running dirty_log_perf_test on both tdp and shadow MMU with 48
> vcpu and 2GB/vcpu size on a 2 NUMA node machine. No impact on
> performance noticed.
>
> Ran perf on dirty_log_perf_test and found kvm_mmu_get_free_page() calls
> reduced by ~3300 which is near to 48 (vcpus) * 2 (nodes) * 35 (cache
> size).
>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_types.h | 2 +-
>  arch/x86/kvm/mmu/mmu.c           | 7 ++++---
>  2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_types.h b/arch/x86/include/asm/kvm_types.h
> index 08f1b57d3b62..752dab218a62 100644
> --- a/arch/x86/include/asm/kvm_types.h
> +++ b/arch/x86/include/asm/kvm_types.h
> @@ -2,6 +2,6 @@
>  #ifndef _ASM_X86_KVM_TYPES_H
>  #define _ASM_X86_KVM_TYPES_H
>
> -#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE 40
> +#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE PT64_ROOT_MAX_LEVEL

Please add a comment explaining why this value was chosen.

>
>  #endif /* _ASM_X86_KVM_TYPES_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7454bfc49a51..f89d933ff380 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -677,11 +677,12 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
>
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
> -       int r, nid;
> +       int r, nid, desc_capacity;
>
>         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> -                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> +       desc_capacity = 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM;
> +       r = __kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> +                                        desc_capacity, desc_capacity);
>         if (r)
>                 return r;
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL
  2022-12-27 19:52   ` Ben Gardon
@ 2022-12-28 22:08     ` Vipin Sharma
  0 siblings, 0 replies; 46+ messages in thread
From: Vipin Sharma @ 2022-12-28 22:08 UTC (permalink / raw)
  To: Ben Gardon; +Cc: seanjc, pbonzini, dmatlack, kvm, linux-kernel

On Tue, Dec 27, 2022 at 11:52 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@google.com> wrote:
> >
> > KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE is set to 40 without any specific
> > reason. Reduce default size to PT64_ROOT_MAX_LEVEL, which is currently
> > 5.
> >
> > Change mmu_pte_list_desc_cache size to what is needed as it is more than
> > 5 but way less than 40.
>
> Why do you say more than 5? At least to resolve a page fault we'll
> never need more than 4 pages on a system with 5 level paging since the
> root is already allocated.

Because of the comment in code:
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > -                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);

>
> >
> > Tested by running dirty_log_perf_test on both tdp and shadow MMU with 48
> > vcpu and 2GB/vcpu size on a 2 NUMA node machine. No impact on
> > performance noticed.
> >
> > Ran perf on dirty_log_perf_test and found kvm_mmu_get_free_page() calls
> > reduced by ~3300 which is near to 48 (vcpus) * 2 (nodes) * 35 (cache
> > size).
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/include/asm/kvm_types.h | 2 +-
> >  arch/x86/kvm/mmu/mmu.c           | 7 ++++---
> >  2 files changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_types.h b/arch/x86/include/asm/kvm_types.h
> > index 08f1b57d3b62..752dab218a62 100644
> > --- a/arch/x86/include/asm/kvm_types.h
> > +++ b/arch/x86/include/asm/kvm_types.h
> > @@ -2,6 +2,6 @@
> >  #ifndef _ASM_X86_KVM_TYPES_H
> >  #define _ASM_X86_KVM_TYPES_H
> >
> > -#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE 40
> > +#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE PT64_ROOT_MAX_LEVEL
>
> Please add a comment explaining why this value was chosen.

Okay


>
> >
> >  #endif /* _ASM_X86_KVM_TYPES_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 7454bfc49a51..f89d933ff380 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -677,11 +677,12 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> >
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> > -       int r, nid;
> > +       int r, nid, desc_capacity;
> >
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > -       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > -                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > +       desc_capacity = 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM;
> > +       r = __kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > +                                        desc_capacity, desc_capacity);
> >         if (r)
> >                 return r;
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-01-18 17:45 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-22  2:34 [Patch v3 0/9] NUMA aware page table's pages allocation Vipin Sharma
2022-12-22  2:34 ` [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches Vipin Sharma
2022-12-27 18:37   ` Ben Gardon
2022-12-28 22:07     ` Vipin Sharma
2022-12-29 21:15       ` David Matlack
2023-01-03 17:38         ` Vipin Sharma
2022-12-29 21:54   ` David Matlack
2023-01-03 18:01     ` Vipin Sharma
2023-01-04  0:25       ` Vipin Sharma
2023-01-18 17:43         ` Sean Christopherson
2023-01-03 19:32   ` Mingwei Zhang
2023-01-04  1:00     ` Vipin Sharma
2023-01-04  6:29       ` Mingwei Zhang
2023-01-04  6:57         ` Mingwei Zhang
2023-01-18 17:36         ` Sean Christopherson
2022-12-22  2:34 ` [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{} Vipin Sharma
2022-12-29 21:59   ` David Matlack
2022-12-22  2:34 ` [Patch v3 3/9] KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker Vipin Sharma
2022-12-22  2:34 ` [Patch v3 4/9] KVM: Add module param to make page tables NUMA aware Vipin Sharma
2022-12-29 22:05   ` David Matlack
2022-12-22  2:34 ` [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split Vipin Sharma
2022-12-27 19:02   ` Ben Gardon
2022-12-28 22:07     ` Vipin Sharma
2022-12-29 22:30   ` David Matlack
2023-01-03 18:26     ` Vipin Sharma
2022-12-22  2:34 ` [Patch v3 6/9] KVM: Provide NUMA node support to kvm_mmu_memory_cache{} Vipin Sharma
2022-12-27 19:09   ` Ben Gardon
2022-12-28 22:07     ` Vipin Sharma
2022-12-29 18:22       ` Ben Gardon
2023-01-03 17:36         ` Vipin Sharma
2022-12-29 23:08   ` David Matlack
2022-12-29 23:11     ` David Matlack
2023-01-03 18:45       ` Vipin Sharma
2023-01-03 18:55         ` David Matlack
2022-12-22  2:34 ` [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages Vipin Sharma
2022-12-27 19:34   ` Ben Gardon
2022-12-28 22:08     ` Vipin Sharma
2022-12-29 18:20       ` Ben Gardon
2022-12-22  2:34 ` [Patch v3 8/9] KVM: x86/mmu: Make split_shadow_page_cache NUMA aware Vipin Sharma
2022-12-27 19:42   ` Ben Gardon
2022-12-28 22:08     ` Vipin Sharma
2022-12-29 23:18   ` David Matlack
2023-01-03 18:49     ` Vipin Sharma
2022-12-22  2:34 ` [Patch v3 9/9] KVM: x86/mmu: Reduce default cache size in KVM from 40 to PT64_ROOT_MAX_LEVEL Vipin Sharma
2022-12-27 19:52   ` Ben Gardon
2022-12-28 22:08     ` Vipin Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox