[PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow

Linux kernel -stable discussions
 help / color / mirror / Atom feed

* [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
       [not found] <20260504224213.1049426-1-jthoughton@google.com>
@ 2026-05-04 22:42 ` James Houghton
  2026-05-04 23:10   ` James Houghton
  2026-05-06  2:27   ` Bibo Mao
  2026-05-04 22:42 ` [PATCH 2/5] KVM: loongarch: Grab MMU " James Houghton
  2026-05-04 22:42 ` [PATCH 3/5] KVM: mips: " James Houghton
  2 siblings, 2 replies; 12+ messages in thread
From: James Houghton @ 2026-05-04 22:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Gavin Shan, Shaoqin Huang,
	Ricardo Koller, Tianrui Zhao, Bibo Mao, Huacai Chen, James Hogan,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, James Houghton, stable

kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
concurrently in the event that the KVM's `mm` is __mmput() at the
same time that last reference to the KVM is being dropped.

T1              T2
KVM_CREATE_VM
                Get VM file from T1
close VM
exit_mm()       close VM

T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
    with only the KVM srcu read lock held.

T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
    kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
    again, with only the KVM srcu read lock held.

This leads to a potential double-free of
kvm->arch.kvm_mmu_free_memory_cache and now with NV
kvm->arch.nested_mmus.

Cc: stable@vger.kernel.org
Fixes: e7bf7a490c68 ("KVM: arm64: Split huge pages when dirty logging is enabled")
Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  1 +
 arch/arm64/kvm/mmu.c             | 23 +++++++++++++++++++----
 arch/arm64/kvm/nested.c          |  4 +++-
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 01e9c72d6aa7..30d5c24fcebb 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -178,6 +178,7 @@ void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
 void kvm_uninit_stage2_mmu(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
+void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d089c107d9b7..4bab407d43bb 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1021,7 +1021,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	kvm_free_stage2_pgd_locked(&kvm->arch.mmu);
 	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
@@ -1095,12 +1097,14 @@ void stage2_unmap_vm(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
+static void __kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu, bool locked)
 {
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	struct kvm_pgtable *pgt = NULL;
 
-	write_lock(&kvm->mmu_lock);
+	if (!locked)
+		write_lock(&kvm->mmu_lock);
+
 	pgt = mmu->pgt;
 	if (pgt) {
 		mmu->pgd_phys = 0;
@@ -1111,7 +1115,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	if (kvm_is_nested_s2_mmu(kvm, mmu))
 		kvm_init_nested_s2_mmu(mmu);
 
-	write_unlock(&kvm->mmu_lock);
+	if (!locked)
+		write_unlock(&kvm->mmu_lock);
 
 	if (pgt) {
 		kvm_stage2_destroy(pgt);
@@ -1119,6 +1124,16 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	}
 }
 
+void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
+{
+	__kvm_free_stage2_pgd(mmu, false);
+}
+
+void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu)
+{
+	__kvm_free_stage2_pgd(mmu, true);
+}
+
 static void hyp_mc_free_fn(void *addr, void *mc)
 {
 	struct kvm_hyp_memcache *memcache = mc;
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 883b6c1008fb..977598bff5e6 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1190,11 +1190,13 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
 	int i;
 
+	guard(write_lock)(&kvm->mmu_lock);
+
 	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
 		if (!WARN_ON(atomic_read(&mmu->refcnt)))
-			kvm_free_stage2_pgd(mmu);
+			kvm_free_stage2_pgd_locked(mmu);
 	}
 	kvfree(kvm->arch.nested_mmus);
 	kvm->arch.nested_mmus = NULL;
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] KVM: loongarch: Grab MMU lock in kvm_arch_flush_shadow_all()
       [not found] <20260504224213.1049426-1-jthoughton@google.com>
  2026-05-04 22:42 ` [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all() James Houghton
@ 2026-05-04 22:42 ` James Houghton
  2026-05-08  7:21   ` Bibo Mao
  2026-05-04 22:42 ` [PATCH 3/5] KVM: mips: " James Houghton
  2 siblings, 1 reply; 12+ messages in thread
From: James Houghton @ 2026-05-04 22:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Gavin Shan, Shaoqin Huang,
	Ricardo Koller, Tianrui Zhao, Bibo Mao, Huacai Chen, James Hogan,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, James Houghton, stable

kvm_arch_flush_shadow_all() may be called concurrently on the same
`kvm`. This could at least result in accounting mistakes (e.g.
underflows on `kvm->stat.*pages`).

Cc: stable@vger.kernel.org
Fixes: 752e2cd7b4fb ("LoongArch: KVM: Implement kvm mmu operations")
Signed-off-by: James Houghton <jthoughton@google.com>
---
Note: This is compile-tested only!

 arch/loongarch/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a7fa458e3360..5dbce9b18e1c 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -486,7 +486,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_flush_range(kvm, 0, kvm->arch.gpa_size >> PAGE_SHIFT, 0);
+	kvm_flush_range(kvm, 0, kvm->arch.gpa_size >> PAGE_SHIFT, 1);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] KVM: mips: Grab MMU lock in kvm_arch_flush_shadow_all()
       [not found] <20260504224213.1049426-1-jthoughton@google.com>
  2026-05-04 22:42 ` [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all() James Houghton
  2026-05-04 22:42 ` [PATCH 2/5] KVM: loongarch: Grab MMU " James Houghton
@ 2026-05-04 22:42 ` James Houghton
  2 siblings, 0 replies; 12+ messages in thread
From: James Houghton @ 2026-05-04 22:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Gavin Shan, Shaoqin Huang,
	Ricardo Koller, Tianrui Zhao, Bibo Mao, Huacai Chen, James Hogan,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, James Houghton, stable

kvm_mips_flush_gpa_pt() expects the MMU lock to be held; it is not in
this path.

This can lead to a double-free of page table entries if
kvm_arch_flush_shadow_all() is called twice on the same `kvm`
concurrently; such a scenario is possible.

Cc: stable@vger.kernel.org
Fixes: b62091108633 ("KVM: MIPS: Implement kvm_arch_flush_shadow_all/memslot")
Signed-off-by: James Houghton <jthoughton@google.com>
---
Note: This is compile-tested only!

 arch/mips/kvm/mips.c | 2 ++
 arch/mips/kvm/mmu.c  | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a53abbba43ea..463b6c4aa62c 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -180,6 +180,8 @@ long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
+	guard(spinlock)(&kvm->mmu_lock);
+
 	/* Flush whole GPA */
 	kvm_mips_flush_gpa_pt(kvm, 0, ~0);
 	kvm_flush_remote_tlbs(kvm);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index d2c3b6b41f18..5045833f8116 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -269,6 +269,8 @@ static bool kvm_mips_flush_gpa_pgd(pgd_t *pgd, unsigned long start_gpa,
  */
 bool kvm_mips_flush_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn)
 {
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	return kvm_mips_flush_gpa_pgd(kvm->arch.gpa_mm.pgd,
 				      start_gfn << PAGE_SHIFT,
 				      end_gfn << PAGE_SHIFT);
-- 
2.54.0.545.g6539524ca2-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-04 22:42 ` [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all() James Houghton
@ 2026-05-04 23:10   ` James Houghton
  2026-05-05 17:05     ` Sean Christopherson
  2026-05-06  2:27   ` Bibo Mao
  1 sibling, 1 reply; 12+ messages in thread
From: James Houghton @ 2026-05-04 23:10 UTC (permalink / raw)
  To: jthoughton
  Cc: chenhuacai, gshan, jhogan, joey.gouly, kvm, kvmarm,
	linux-arm-kernel, linux-kernel, linux-mips, loongarch, maobibo,
	maz, oupton, pbonzini, ricarkol, seanjc, shahuang, stable,
	suzuki.poulose, yuzenghui, zhaotianrui

On Mon, May 4, 2026 at 3:42 PM James Houghton <jthoughton@google.com> wrote:
>
> kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
> concurrently in the event that the KVM's `mm` is __mmput() at the
> same time that last reference to the KVM is being dropped.
>
> T1              T2
> KVM_CREATE_VM
>                 Get VM file from T1
> close VM
> exit_mm()       close VM
>
> T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>     with only the KVM srcu read lock held.
>
> T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
>     kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>     again, with only the KVM srcu read lock held.
>
> This leads to a potential double-free of
> kvm->arch.kvm_mmu_free_memory_cache and now with NV
> kvm->arch.nested_mmus.
>
> Cc: stable@vger.kernel.org
> Fixes: e7bf7a490c68 ("KVM: arm64: Split huge pages when dirty logging is enabled")
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h |  1 +
>  arch/arm64/kvm/mmu.c             | 23 +++++++++++++++++++----
>  arch/arm64/kvm/nested.c          |  4 +++-
>  3 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 01e9c72d6aa7..30d5c24fcebb 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -178,6 +178,7 @@ void stage2_unmap_vm(struct kvm *kvm);
>  int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
>  void kvm_uninit_stage2_mmu(struct kvm *kvm);
>  void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
> +void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu);
>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>                           phys_addr_t pa, unsigned long size, bool writable);
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..4bab407d43bb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1021,7 +1021,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>
>  void kvm_uninit_stage2_mmu(struct kvm *kvm)
>  {
> -       kvm_free_stage2_pgd(&kvm->arch.mmu);
> +       lockdep_assert_held_write(&kvm->mmu_lock);

*facepalm*.... this doesn't account for the other callers of
kvm_uninit_stage2_mmu(). They will get lockdep warnings.

I've attached a diff to the bottom of this reply that *does* deal with them.
:( Sorry.

I'm guessing Marc or Oliver will probably want this patch to look quite
different, so I'll wait to hear from them before actually sending a v2.

In the meantime, I'll properly retest with lockdep enabled.

> +
> +       kvm_free_stage2_pgd_locked(&kvm->arch.mmu);
>         kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>  }
>
> @@ -1095,12 +1097,14 @@ void stage2_unmap_vm(struct kvm *kvm)
>         srcu_read_unlock(&kvm->srcu, idx);
>  }
>
> -void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> +static void __kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu, bool locked)
>  {
>         struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>         struct kvm_pgtable *pgt = NULL;
>
> -       write_lock(&kvm->mmu_lock);
> +       if (!locked)
> +               write_lock(&kvm->mmu_lock);
> +
>         pgt = mmu->pgt;
>         if (pgt) {
>                 mmu->pgd_phys = 0;
> @@ -1111,7 +1115,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>         if (kvm_is_nested_s2_mmu(kvm, mmu))
>                 kvm_init_nested_s2_mmu(mmu);
>
> -       write_unlock(&kvm->mmu_lock);
> +       if (!locked)
> +               write_unlock(&kvm->mmu_lock);
>
>         if (pgt) {
>                 kvm_stage2_destroy(pgt);
> @@ -1119,6 +1124,16 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>         }
>  }
>
> +void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> +{
> +       __kvm_free_stage2_pgd(mmu, false);
> +}
> +
> +void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu)
> +{
> +       __kvm_free_stage2_pgd(mmu, true);
> +}
> +
>  static void hyp_mc_free_fn(void *addr, void *mc)
>  {
>         struct kvm_hyp_memcache *memcache = mc;
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 883b6c1008fb..977598bff5e6 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -1190,11 +1190,13 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
>         int i;
>
> +       guard(write_lock)(&kvm->mmu_lock);
> +
>         for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
>                 struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
>
>                 if (!WARN_ON(atomic_read(&mmu->refcnt)))
> -                       kvm_free_stage2_pgd(mmu);
> +                       kvm_free_stage2_pgd_locked(mmu);
>         }
>         kvfree(kvm->arch.nested_mmus);
>         kvm->arch.nested_mmus = NULL;
> --
> 2.54.0.545.g6539524ca2-goog

And here is the diff that should fix this patch. (Sorry!!)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 30d5c24fcebb..e32e844943be 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -177,6 +177,7 @@ void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t e
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
 void kvm_uninit_stage2_mmu(struct kvm *kvm);
+void kvm_uninit_stage2_mmu_locked(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4bab407d43bb..98ba8116676c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1019,14 +1019,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	return err;
 }
 
-void kvm_uninit_stage2_mmu(struct kvm *kvm)
-{
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
-	kvm_free_stage2_pgd_locked(&kvm->arch.mmu);
-	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
-}
-
 static void stage2_unmap_memslot(struct kvm *kvm,
 				 struct kvm_memory_slot *memslot)
 {
@@ -1134,6 +1126,24 @@ void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu)
 	__kvm_free_stage2_pgd(mmu, true);
 }
 
+static void __kvm_uninit_stage2_mmu(struct kvm *kvm, bool locked)
+{
+	__kvm_free_stage2_pgd(&kvm->arch.mmu, locked);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
+}
+
+void kvm_uninit_stage2_mmu(struct kvm *kvm)
+{
+	__kvm_uninit_stage2_mmu(kvm, false);
+}
+
+void kvm_uninit_stage2_mmu_locked(struct kvm *kvm)
+{
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	__kvm_uninit_stage2_mmu(kvm, true);
+}
+
 static void hyp_mc_free_fn(void *addr, void *mc)
 {
 	struct kvm_hyp_memcache *memcache = mc;
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 977598bff5e6..f61f0244f0fb 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1201,7 +1201,7 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
 	kvfree(kvm->arch.nested_mmus);
 	kvm->arch.nested_mmus = NULL;
 	kvm->arch.nested_mmus_size = 0;
-	kvm_uninit_stage2_mmu(kvm);
+	kvm_uninit_stage2_mmu_locked(kvm);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-04 23:10   ` James Houghton
@ 2026-05-05 17:05     ` Sean Christopherson
  2026-05-05 18:01       ` James Houghton
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-05-05 17:05 UTC (permalink / raw)
  To: James Houghton
  Cc: chenhuacai, gshan, jhogan, joey.gouly, kvm, kvmarm,
	linux-arm-kernel, linux-kernel, linux-mips, loongarch, maobibo,
	maz, oupton, pbonzini, ricarkol, shahuang, stable, suzuki.poulose,
	yuzenghui, zhaotianrui

On Mon, May 04, 2026, James Houghton wrote:
> On Mon, May 4, 2026 at 3:42 PM James Houghton <jthoughton@google.com> wrote:
> >
> > kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
> > concurrently in the event that the KVM's `mm` is __mmput() at the
> > same time that last reference to the KVM is being dropped.
> >
> > T1              T2
> > KVM_CREATE_VM
> >                 Get VM file from T1
> > close VM
> > exit_mm()       close VM
> >
> > T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> >     with only the KVM srcu read lock held.
> >
> > T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
> >     kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> >     again, with only the KVM srcu read lock held.
> >
> > This leads to a potential double-free of
> > kvm->arch.kvm_mmu_free_memory_cache and now with NV
> > kvm->arch.nested_mmus.

...

> >  void kvm_uninit_stage2_mmu(struct kvm *kvm)
> >  {
> > -       kvm_free_stage2_pgd(&kvm->arch.mmu);
> > +       lockdep_assert_held_write(&kvm->mmu_lock);
> 
> *facepalm*.... this doesn't account for the other callers of
> kvm_uninit_stage2_mmu(). They will get lockdep warnings.
> 
> I've attached a diff to the bottom of this reply that *does* deal with them.
> :( Sorry.

...

> > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> > index 883b6c1008fb..977598bff5e6 100644
> > --- a/arch/arm64/kvm/nested.c
> > +++ b/arch/arm64/kvm/nested.c
> > @@ -1190,11 +1190,13 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> >  {
> >         int i;
> >
> > +       guard(write_lock)(&kvm->mmu_lock);
> > +
> >         for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> >                 struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >
> >                 if (!WARN_ON(atomic_read(&mmu->refcnt)))
> > -                       kvm_free_stage2_pgd(mmu);
> > +                       kvm_free_stage2_pgd_locked(mmu);
> >         }
> >         kvfree(kvm->arch.nested_mmus);
> >         kvm->arch.nested_mmus = NULL;
> > --
> > 2.54.0.545.g6539524ca2-goog
> 
> And here is the diff that should fix this patch. (Sorry!!)

There are more issues.  kvm->arch.mmu.split_page_cache can be freed by
kvm_arch_commit_memory_region(), which holds slots_lock and slots_arch_lock,
but not mmu_lock.

IMO, the handling of kvm->arch.mmu.split_page_cache should be reworked.  I don't
entirely get the motivation for aggressively freeing the cache.  The cache will
only be filled if KVM actually does eager page splitting, so it's not like KVM is
burning pages for setups that will never use the cache.

Maybe I'm underestimating how many pages arm64 needs in the worst case scenario?
(I can't follow the math, too many macros).  But if KVM is configuring the cache
with a capacity that's _so_ high that the "wasted" memory is problematic, then we
probably should we revisit the capacity and algorithm.  E.g. if KVM is splitting
from 1GiB => 4KiB in a single pass (I can't tell if KVM does this on arm64), then
we could break that into a 1GiB => 2MiB => 4KiB sequence.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-05 17:05     ` Sean Christopherson
@ 2026-05-05 18:01       ` James Houghton
  2026-05-05 18:16         ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: James Houghton @ 2026-05-05 18:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: chenhuacai, gshan, jhogan, joey.gouly, kvm, kvmarm,
	linux-arm-kernel, linux-kernel, linux-mips, loongarch, maobibo,
	maz, oupton, pbonzini, ricarkol, shahuang, stable, suzuki.poulose,
	yuzenghui, zhaotianrui

On Tue, May 5, 2026 at 10:05 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 04, 2026, James Houghton wrote:
> > On Mon, May 4, 2026 at 3:42 PM James Houghton <jthoughton@google.com> wrote:
> > >
> > > kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
> > > concurrently in the event that the KVM's `mm` is __mmput() at the
> > > same time that last reference to the KVM is being dropped.
> > >
> > > T1              T2
> > > KVM_CREATE_VM
> > >                 Get VM file from T1
> > > close VM
> > > exit_mm()       close VM
> > >
> > > T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> > >     with only the KVM srcu read lock held.
> > >
> > > T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
> > >     kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> > >     again, with only the KVM srcu read lock held.
> > >
> > > This leads to a potential double-free of
> > > kvm->arch.kvm_mmu_free_memory_cache and now with NV
> > > kvm->arch.nested_mmus.
>
> ...
>
> > >  void kvm_uninit_stage2_mmu(struct kvm *kvm)
> > >  {
> > > -       kvm_free_stage2_pgd(&kvm->arch.mmu);
> > > +       lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > *facepalm*.... this doesn't account for the other callers of
> > kvm_uninit_stage2_mmu(). They will get lockdep warnings.
> >
> > I've attached a diff to the bottom of this reply that *does* deal with them.
> > :( Sorry.
>
> ...
>
> > > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> > > index 883b6c1008fb..977598bff5e6 100644
> > > --- a/arch/arm64/kvm/nested.c
> > > +++ b/arch/arm64/kvm/nested.c
> > > @@ -1190,11 +1190,13 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
> > >  {
> > >         int i;
> > >
> > > +       guard(write_lock)(&kvm->mmu_lock);
> > > +
> > >         for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> > >                 struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> > >
> > >                 if (!WARN_ON(atomic_read(&mmu->refcnt)))
> > > -                       kvm_free_stage2_pgd(mmu);
> > > +                       kvm_free_stage2_pgd_locked(mmu);
> > >         }
> > >         kvfree(kvm->arch.nested_mmus);
> > >         kvm->arch.nested_mmus = NULL;
> > > --
> > > 2.54.0.545.g6539524ca2-goog
> >
> > And here is the diff that should fix this patch. (Sorry!!)
>
> There are more issues.  kvm->arch.mmu.split_page_cache can be freed by
> kvm_arch_commit_memory_region(), which holds slots_lock and slots_arch_lock,
> but not mmu_lock.

Thanks. I also noticed that kvm->arch.mmu.split_page_cache is
documented as being protected by kvm->slots_lock; we should be holding
it here. But we cannot take it here because we are already holding the
KVM srcu lock.

> IMO, the handling of kvm->arch.mmu.split_page_cache should be reworked.  I don't
> entirely get the motivation for aggressively freeing the cache.  The cache will
> only be filled if KVM actually does eager page splitting, so it's not like KVM is
> burning pages for setups that will never use the cache.
>
> Maybe I'm underestimating how many pages arm64 needs in the worst case scenario?
> (I can't follow the math, too many macros).  But if KVM is configuring the cache
> with a capacity that's _so_ high that the "wasted" memory is problematic, then we
> probably should we revisit the capacity and algorithm.  E.g. if KVM is splitting
> from 1GiB => 4KiB in a single pass (I can't tell if KVM does this on arm64), then
> we could break that into a 1GiB => 2MiB => 4KiB sequence.

I'm not sure I've fully understood the point you're making, but I
*think* we can just drop the
    kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
line from kvm_uninit_stage2_mmu(). It will get freed when the VM is
destroyed anyway.

So I'm thinking of splitting this patch into two (unless someone tells
me otherwise):

1. Drop the kvm_mmu_free_memory_cache() from kvm_uninit_stage2_mmu()
    Fixes: e7bf7a490c68 ("KVM: arm64: Split huge pages when dirty
logging is enabled")

2. Grab the MMU write lock around the kvfree(nested_mmus) bit in
kvm_arch_flush_shadow_all(); do the kvfree() without holding the the
lock.
    Fixes: 4f128f8e1aaac ("KVM: arm64: nv: Support multiple nested
Stage-2 mmu structures")

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-05 18:01       ` James Houghton
@ 2026-05-05 18:16         ` Sean Christopherson
  2026-05-05 20:14           ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-05-05 18:16 UTC (permalink / raw)
  To: James Houghton
  Cc: chenhuacai, gshan, jhogan, joey.gouly, kvm, kvmarm,
	linux-arm-kernel, linux-kernel, linux-mips, loongarch, maobibo,
	maz, oupton, pbonzini, ricarkol, shahuang, stable, suzuki.poulose,
	yuzenghui, zhaotianrui

On Tue, May 05, 2026, James Houghton wrote:
> On Tue, May 5, 2026 at 10:05 AM Sean Christopherson <seanjc@google.com> wrote:
> > There are more issues.  kvm->arch.mmu.split_page_cache can be freed by
> > kvm_arch_commit_memory_region(), which holds slots_lock and slots_arch_lock,
> > but not mmu_lock.
> 
> Thanks. I also noticed that kvm->arch.mmu.split_page_cache is
> documented as being protected by kvm->slots_lock; we should be holding
> it here. But we cannot take it here because we are already holding the
> KVM srcu lock.
> 
> > IMO, the handling of kvm->arch.mmu.split_page_cache should be reworked.  I don't
> > entirely get the motivation for aggressively freeing the cache.  The cache will
> > only be filled if KVM actually does eager page splitting, so it's not like KVM is
> > burning pages for setups that will never use the cache.
> >
> > Maybe I'm underestimating how many pages arm64 needs in the worst case scenario?
> > (I can't follow the math, too many macros).  But if KVM is configuring the cache
> > with a capacity that's _so_ high that the "wasted" memory is problematic, then we
> > probably should we revisit the capacity and algorithm.  E.g. if KVM is splitting
> > from 1GiB => 4KiB in a single pass (I can't tell if KVM does this on arm64), then
> > we could break that into a 1GiB => 2MiB => 4KiB sequence.
> 
> I'm not sure I've fully understood the point you're making, but I
> *think* we can just drop the
>     kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
> line from kvm_uninit_stage2_mmu(). It will get freed when the VM is
> destroyed anyway.

It's not that simple.  KVM arm64 allows userspace to reconfigure the capacity of
the cache via KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE.  kvm_vm_ioctl_enable_cap()
currently allows userspace to do that so long as there are no memslots.
__kvm_mmu_topup_memory_cache() will (rightly) yell and fail if it's called with
the "wrong" capacity, so we'd need to sort that out.

The other issue is that it's not clear to me what happens for large "chunk" sizes.
If KVM is splitting from 1GiB (or whatever huge-hugepage sizes are supported on
arm64) all the way to 4KiB, e.g. to optimize against break-before-make, then the
capacity of the cache could be significant, e.g. MiB of memory or worse.  My read
of things is that purging the cache when dirty logging is disabled is a guard
against consuming too much memory when the chunk size is large.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-05 18:16         ` Sean Christopherson
@ 2026-05-05 20:14           ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2026-05-05 20:14 UTC (permalink / raw)
  To: James Houghton
  Cc: chenhuacai, gshan, jhogan, joey.gouly, kvm, kvmarm,
	linux-arm-kernel, linux-kernel, linux-mips, loongarch, maobibo,
	maz, oupton, pbonzini, ricarkol, shahuang, stable, suzuki.poulose,
	yuzenghui, zhaotianrui

On Tue, May 05, 2026, Sean Christopherson wrote:
> On Tue, May 05, 2026, James Houghton wrote:
> > On Tue, May 5, 2026 at 10:05 AM Sean Christopherson <seanjc@google.com> wrote:
> > > There are more issues.  kvm->arch.mmu.split_page_cache can be freed by
> > > kvm_arch_commit_memory_region(), which holds slots_lock and slots_arch_lock,
> > > but not mmu_lock.
> > 
> > Thanks. I also noticed that kvm->arch.mmu.split_page_cache is
> > documented as being protected by kvm->slots_lock; we should be holding
> > it here. But we cannot take it here because we are already holding the
> > KVM srcu lock.
> > 
> > > IMO, the handling of kvm->arch.mmu.split_page_cache should be reworked.  I don't
> > > entirely get the motivation for aggressively freeing the cache.  The cache will
> > > only be filled if KVM actually does eager page splitting, so it's not like KVM is
> > > burning pages for setups that will never use the cache.
> > >
> > > Maybe I'm underestimating how many pages arm64 needs in the worst case scenario?
> > > (I can't follow the math, too many macros).  But if KVM is configuring the cache
> > > with a capacity that's _so_ high that the "wasted" memory is problematic, then we
> > > probably should we revisit the capacity and algorithm.  E.g. if KVM is splitting
> > > from 1GiB => 4KiB in a single pass (I can't tell if KVM does this on arm64), then
> > > we could break that into a 1GiB => 2MiB => 4KiB sequence.
> > 
> > I'm not sure I've fully understood the point you're making, but I
> > *think* we can just drop the
> >     kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
> > line from kvm_uninit_stage2_mmu(). It will get freed when the VM is
> > destroyed anyway.
> 
> It's not that simple.  KVM arm64 allows userspace to reconfigure the capacity of
> the cache via KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE.  kvm_vm_ioctl_enable_cap()
> currently allows userspace to do that so long as there are no memslots.
> __kvm_mmu_topup_memory_cache() will (rightly) yell and fail if it's called with
> the "wrong" capacity, so we'd need to sort that out.
> 
> The other issue is that it's not clear to me what happens for large "chunk" sizes.
> If KVM is splitting from 1GiB (or whatever huge-hugepage sizes are supported on
> arm64) all the way to 4KiB, e.g. to optimize against break-before-make, then the
> capacity of the cache could be significant, e.g. MiB of memory or worse.  My read
> of things is that purging the cache when dirty logging is disabled is a guard
> against consuming too much memory when the chunk size is large.

If we go down the "let's simplify things" path, one way to simplify the handling
of KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE would be to disallow changing the chunk size
once the cache has been used.  That would allow deferring teardown of the cache
until VM destruction.

diff --git arch/arm64/kvm/arm.c arch/arm64/kvm/arm.c
index 176cbe8baad3..671f7ac5e31d 100644
--- arch/arm64/kvm/arm.c
+++ arch/arm64/kvm/arm.c
@@ -165,7 +165,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
                 * To keep things simple, allow changing the chunk
                 * size only when no memory slots have been created.
                 */
-               if (kvm_are_all_memslots_empty(kvm)) {
+               if (kvm_are_all_memslots_empty(kvm) &&
+                   !kvm->arch.mmu.split_page_cache.capacity) {
                        u64 new_cap = cap->args[0];
 
                        if (!new_cap || kvm_is_block_size_supported(new_cap)) {


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-04 22:42 ` [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all() James Houghton
  2026-05-04 23:10   ` James Houghton
@ 2026-05-06  2:27   ` Bibo Mao
  2026-05-06 13:55     ` Sean Christopherson
  1 sibling, 1 reply; 12+ messages in thread
From: Bibo Mao @ 2026-05-06  2:27 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Gavin Shan, Shaoqin Huang,
	Ricardo Koller, Tianrui Zhao, Huacai Chen, James Hogan,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, stable



On 2026/5/5 上午6:42, James Houghton wrote:
> kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
> concurrently in the event that the KVM's `mm` is __mmput() at the
> same time that last reference to the KVM is being dropped.
> 
> T1              T2
> KVM_CREATE_VM
>                  Get VM file from T1
> close VM
> exit_mm()       close VM
> 
> T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>      with only the KVM srcu read lock held.
> 
> T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
>      kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>      again, with only the KVM srcu read lock held.
By looking through the code, kvm_arch_destroy_vm() will free PGD page 
only, page table walking is executing in deleting memslot or exit_mm().

With normal code, life cycle of VM is something like this:
   KVM_CREATE_VM
     Create_VCPUs
     Create memslots
     Destroy_VCPUs
     Destroy memslots
   close VM
   exit_mm()

And there is kvm_get_kvm()/kvm_put_kvm() function call with 
creating/destroy vCPUs, however no such operations with memslot 
operation. Is it possible that VM is destroyed without removing 
memslots, such as the following operation.
   KVM_CREATE_VM
     Create memslots
   close VM
   exit_mm()

Regards
Bibo Mao

> 
> This leads to a potential double-free of
> kvm->arch.kvm_mmu_free_memory_cache and now with NV
> kvm->arch.nested_mmus.
> 
> Cc: stable@vger.kernel.org
> Fixes: e7bf7a490c68 ("KVM: arm64: Split huge pages when dirty logging is enabled")
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>   arch/arm64/include/asm/kvm_mmu.h |  1 +
>   arch/arm64/kvm/mmu.c             | 23 +++++++++++++++++++----
>   arch/arm64/kvm/nested.c          |  4 +++-
>   3 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 01e9c72d6aa7..30d5c24fcebb 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -178,6 +178,7 @@ void stage2_unmap_vm(struct kvm *kvm);
>   int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
>   void kvm_uninit_stage2_mmu(struct kvm *kvm);
>   void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
> +void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu);
>   int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>   			  phys_addr_t pa, unsigned long size, bool writable);
>   
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..4bab407d43bb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1021,7 +1021,9 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>   
>   void kvm_uninit_stage2_mmu(struct kvm *kvm)
>   {
> -	kvm_free_stage2_pgd(&kvm->arch.mmu);
> +	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	kvm_free_stage2_pgd_locked(&kvm->arch.mmu);
>   	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>   }
>   
> @@ -1095,12 +1097,14 @@ void stage2_unmap_vm(struct kvm *kvm)
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> +static void __kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu, bool locked)
>   {
>   	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>   	struct kvm_pgtable *pgt = NULL;
>   
> -	write_lock(&kvm->mmu_lock);
> +	if (!locked)
> +		write_lock(&kvm->mmu_lock);
> +
>   	pgt = mmu->pgt;
>   	if (pgt) {
>   		mmu->pgd_phys = 0;
> @@ -1111,7 +1115,8 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   	if (kvm_is_nested_s2_mmu(kvm, mmu))
>   		kvm_init_nested_s2_mmu(mmu);
>   
> -	write_unlock(&kvm->mmu_lock);
> +	if (!locked)
> +		write_unlock(&kvm->mmu_lock);
>   
>   	if (pgt) {
>   		kvm_stage2_destroy(pgt);
> @@ -1119,6 +1124,16 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>   	}
>   }
>   
> +void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> +{
> +	__kvm_free_stage2_pgd(mmu, false);
> +}
> +
> +void kvm_free_stage2_pgd_locked(struct kvm_s2_mmu *mmu)
> +{
> +	__kvm_free_stage2_pgd(mmu, true);
> +}
> +
>   static void hyp_mc_free_fn(void *addr, void *mc)
>   {
>   	struct kvm_hyp_memcache *memcache = mc;
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 883b6c1008fb..977598bff5e6 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -1190,11 +1190,13 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
>   {
>   	int i;
>   
> +	guard(write_lock)(&kvm->mmu_lock);
> +
>   	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
>   		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
>   
>   		if (!WARN_ON(atomic_read(&mmu->refcnt)))
> -			kvm_free_stage2_pgd(mmu);
> +			kvm_free_stage2_pgd_locked(mmu);
>   	}
>   	kvfree(kvm->arch.nested_mmus);
>   	kvm->arch.nested_mmus = NULL;
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-06  2:27   ` Bibo Mao
@ 2026-05-06 13:55     ` Sean Christopherson
  2026-05-07  1:40       ` Bibo Mao
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-05-06 13:55 UTC (permalink / raw)
  To: Bibo Mao
  Cc: James Houghton, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Joey Gouly, Suzuki K Poulose, Zenghui Yu, Gavin Shan,
	Shaoqin Huang, Ricardo Koller, Tianrui Zhao, Huacai Chen,
	James Hogan, linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, stable

On Wed, May 06, 2026, Bibo Mao wrote:
> On 2026/5/5 上午6:42, James Houghton wrote:
> > kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
> > concurrently in the event that the KVM's `mm` is __mmput() at the
> > same time that last reference to the KVM is being dropped.
> > 
> > T1              T2
> > KVM_CREATE_VM
> >                  Get VM file from T1
> > close VM
> > exit_mm()       close VM
> > 
> > T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> >      with only the KVM srcu read lock held.
> > 
> > T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
> >      kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
> >      again, with only the KVM srcu read lock held.
> By looking through the code, kvm_arch_destroy_vm() will free PGD page only,
> page table walking is executing in deleting memslot or exit_mm().
> 
> With normal code, life cycle of VM is something like this:

Not necessarily.  Abruptly closing the VM, as described below, is also "normal"
(though likely uncommon).

>   KVM_CREATE_VM
>     Create_VCPUs
>     Create memslots
>     Destroy_VCPUs

This is incorrect.  KVM doesn't provide any way for userspace to destroy vCPUs.
Userspace can fully release every vCPU fd, but the vCPU object within KVM stays
alive (and indirectly reachable) until the VM is destroyed.

>     Destroy memslots
>   close VM
>   exit_mm()

Note, exit_mm() may or may not be called.  E.g. there are VMMs that will destroy
a VM and start a new one (perhaps even the same conceptual virtual machine) in
the same process / mm_struct / address space.

> And there is kvm_get_kvm()/kvm_put_kvm() function call with creating/destroy
> vCPUs, however no such operations with memslot operation. Is it possible
> that VM is destroyed without removing memslots, such as the following
> operation.
>   KVM_CREATE_VM
>     Create memslots
>   close VM
>   exit_mm()

Yep.  KVM cannot make any assumptions when it comes to userspace-initiated
operations.  Even a VMM that super strictly follows the first approach may exit
abruptly, without destroying memslots, e.g. if it's OOM-killed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all()
  2026-05-06 13:55     ` Sean Christopherson
@ 2026-05-07  1:40       ` Bibo Mao
  0 siblings, 0 replies; 12+ messages in thread
From: Bibo Mao @ 2026-05-07  1:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, Paolo Bonzini, Marc Zyngier, Oliver Upton,
	Joey Gouly, Suzuki K Poulose, Zenghui Yu, Gavin Shan,
	Shaoqin Huang, Ricardo Koller, Tianrui Zhao, Huacai Chen,
	James Hogan, linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, stable



On 2026/5/6 下午9:55, Sean Christopherson wrote:
> On Wed, May 06, 2026, Bibo Mao wrote:
>> On 2026/5/5 上午6:42, James Houghton wrote:
>>> kvm_arch_flush_shadow_all() may sometimes be called on the same `kvm`
>>> concurrently in the event that the KVM's `mm` is __mmput() at the
>>> same time that last reference to the KVM is being dropped.
>>>
>>> T1              T2
>>> KVM_CREATE_VM
>>>                   Get VM file from T1
>>> close VM
>>> exit_mm()       close VM
>>>
>>> T1: exit_mm() -> kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>>>       with only the KVM srcu read lock held.
>>>
>>> T2: kvm_vm_release() ---> mmu_notifier_unregister() ->
>>>       kvm_mmu_notifier_release() -> kvm_flush_shadow_all(),
>>>       again, with only the KVM srcu read lock held.
>> By looking through the code, kvm_arch_destroy_vm() will free PGD page only,
>> page table walking is executing in deleting memslot or exit_mm().
>>
>> With normal code, life cycle of VM is something like this:
> 
> Not necessarily.  Abruptly closing the VM, as described below, is also "normal"
> (though likely uncommon).
> 
>>    KVM_CREATE_VM
>>      Create_VCPUs
>>      Create memslots
>>      Destroy_VCPUs
> 
> This is incorrect.  KVM doesn't provide any way for userspace to destroy vCPUs.
> Userspace can fully release every vCPU fd, but the vCPU object within KVM stays
> alive (and indirectly reachable) until the VM is destroyed.
yes, that is so. For users, there is vCPU fd release interface. As for 
vCPU object release, it is kind of internal implementation in KVM.
> 
>>      Destroy memslots
>>    close VM
>>    exit_mm()
> 
> Note, exit_mm() may or may not be called.  E.g. there are VMMs that will destroy
> a VM and start a new one (perhaps even the same conceptual virtual machine) in
> the same process / mm_struct / address space.
> 
>> And there is kvm_get_kvm()/kvm_put_kvm() function call with creating/destroy
>> vCPUs, however no such operations with memslot operation. Is it possible
>> that VM is destroyed without removing memslots, such as the following
>> operation.
>>    KVM_CREATE_VM
>>      Create memslots
>>    close VM
>>    exit_mm()
> 
> Yep.  KVM cannot make any assumptions when it comes to userspace-initiated
> operations.  Even a VMM that super strictly follows the first approach may exit
> abruptly, without destroying memslots, e.g. if it's OOM-killed.
oh, I see. If so, the implementation about kvm_arch_destroy_vm() on 
Loongarch is problematic, it needs consider this kind of abnormal VM 
destroy, take page table walk and free page tables. I had thought that 
memslots create/destroy should call kvm_get_kvm()/kvm_put_kvm() :(

Regards
Bibo Mao


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] KVM: loongarch: Grab MMU lock in kvm_arch_flush_shadow_all()
  2026-05-04 22:42 ` [PATCH 2/5] KVM: loongarch: Grab MMU " James Houghton
@ 2026-05-08  7:21   ` Bibo Mao
  0 siblings, 0 replies; 12+ messages in thread
From: Bibo Mao @ 2026-05-08  7:21 UTC (permalink / raw)
  To: James Houghton, Paolo Bonzini
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Sean Christopherson, Gavin Shan, Shaoqin Huang,
	Ricardo Koller, Tianrui Zhao, Huacai Chen, James Hogan,
	linux-arm-kernel, kvmarm, loongarch, linux-mips, kvm,
	linux-kernel, stable



On 2026/5/5 上午6:42, James Houghton wrote:
> kvm_arch_flush_shadow_all() may be called concurrently on the same
> `kvm`. This could at least result in accounting mistakes (e.g.
> underflows on `kvm->stat.*pages`).
> 
> Cc: stable@vger.kernel.org
> Fixes: 752e2cd7b4fb ("LoongArch: KVM: Implement kvm mmu operations")
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
> Note: This is compile-tested only!
> 
>   arch/loongarch/kvm/mmu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
> index a7fa458e3360..5dbce9b18e1c 100644
> --- a/arch/loongarch/kvm/mmu.c
> +++ b/arch/loongarch/kvm/mmu.c
> @@ -486,7 +486,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   
>   void kvm_arch_flush_shadow_all(struct kvm *kvm)
>   {
> -	kvm_flush_range(kvm, 0, kvm->arch.gpa_size >> PAGE_SHIFT, 0);
> +	kvm_flush_range(kvm, 0, kvm->arch.gpa_size >> PAGE_SHIFT, 1);
>   }
>   
>   void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> 
Reviewed-by: Bibo Mao <maobibo@loongson.cn>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-08  7:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260504224213.1049426-1-jthoughton@google.com>
2026-05-04 22:42 ` [PATCH 1/5] KVM: arm64: Grab KVM MMU write lock in kvm_arch_flush_shadow_all() James Houghton
2026-05-04 23:10   ` James Houghton
2026-05-05 17:05     ` Sean Christopherson
2026-05-05 18:01       ` James Houghton
2026-05-05 18:16         ` Sean Christopherson
2026-05-05 20:14           ` Sean Christopherson
2026-05-06  2:27   ` Bibo Mao
2026-05-06 13:55     ` Sean Christopherson
2026-05-07  1:40       ` Bibo Mao
2026-05-04 22:42 ` [PATCH 2/5] KVM: loongarch: Grab MMU " James Houghton
2026-05-08  7:21   ` Bibo Mao
2026-05-04 22:42 ` [PATCH 3/5] KVM: mips: " James Houghton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox