public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
@ 2025-09-19 21:42 Sean Christopherson
  2025-09-23  3:20 ` Xiaoyao Li
  2025-09-30 12:22 ` Yan Zhao
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-09-19 21:42 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Xiaoyao Li, Rick Edgecombe

Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
and use the helper kvm_set_user_return_msr() to make it obvious that the
double-underscores version is doing a subset of the work of the "full"
setter.

While the function does indeed update a cache, the nomenclature becomes
slightly misleading when adding a getter[1], as the current value isn't
_just_ the cached value, it's also the value that's currently loaded in
hardware.

Opportunistically rename "index" to "slot" in the prototypes.  The user-
return APIs deliberately use "slot" to try and make it more obvious that
they take the slot within the array, not the index of the MSR.

No functional change intended.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/all/aM2EvzLLmBi5-iQ5@google.com [1]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/vmx/tdx.c          |  4 ++--
 arch/x86/kvm/x86.c              | 21 ++++++++++-----------
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 17772513b9cc..b633d5c33f57 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2374,8 +2374,8 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
-int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
-void kvm_user_return_msr_update_cache(unsigned int index, u64 val);
+int kvm_set_user_return_msr(unsigned int slot, u64 val, u64 mask);
+void __kvm_set_user_return_msr(unsigned int slot, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ff41d3d00380..b3cb39ae937d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -809,8 +809,8 @@ static void tdx_user_return_msr_update_cache(void)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
-		kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot,
-						 tdx_uret_msrs[i].defval);
+		__kvm_set_user_return_msr(tdx_uret_msrs[i].slot,
+					  tdx_uret_msrs[i].defval);
 }
 
 static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e07936efacd4..d975d0c60107 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -648,6 +648,15 @@ static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
 	}
 }
 
+void __kvm_set_user_return_msr(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(__kvm_set_user_return_msr);
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
@@ -660,21 +669,11 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 	if (err)
 		return 1;
 
-	msrs->values[slot].curr = value;
-	kvm_user_return_register_notifier(msrs);
+	__kvm_set_user_return_msr(slot, value);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
-void kvm_user_return_msr_update_cache(unsigned int slot, u64 value)
-{
-	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
-
-	msrs->values[slot].curr = value;
-	kvm_user_return_register_notifier(msrs);
-}
-EXPORT_SYMBOL_GPL(kvm_user_return_msr_update_cache);
-
 static void drop_user_return_notifiers(void)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);

base-commit: c8fbf7ceb2ae3f64b0c377c8c21f6df577a13eb4
-- 
2.51.0.470.ga7dc726c21-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-19 21:42 [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR Sean Christopherson
@ 2025-09-23  3:20 ` Xiaoyao Li
  2025-09-30 12:22 ` Yan Zhao
  1 sibling, 0 replies; 16+ messages in thread
From: Xiaoyao Li @ 2025-09-23  3:20 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Rick Edgecombe

On 9/20/2025 5:42 AM, Sean Christopherson wrote:
> Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> and use the helper kvm_set_user_return_msr() to make it obvious that the
                      ^
the helper *in* kvm_set_user_return_msrs() ..

> double-underscores version is doing a subset of the work of the "full"
> setter.
> 
> While the function does indeed update a cache, the nomenclature becomes
> slightly misleading when adding a getter[1], as the current value isn't
> _just_ the cached value, it's also the value that's currently loaded in
> hardware.
> 
> Opportunistically rename "index" to "slot" in the prototypes.  The user-
> return APIs deliberately use "slot" to try and make it more obvious that
> they take the slot within the array, not the index of the MSR.
> 
> No functional change intended.
> 
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/all/aM2EvzLLmBi5-iQ5@google.com [1]
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-19 21:42 [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR Sean Christopherson
  2025-09-23  3:20 ` Xiaoyao Li
@ 2025-09-30 12:22 ` Yan Zhao
  2025-09-30 12:58   ` Yan Zhao
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2025-09-30 12:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> and use the helper kvm_set_user_return_msr() to make it obvious that the
> double-underscores version is doing a subset of the work of the "full"
> setter.
> 
> While the function does indeed update a cache, the nomenclature becomes
> slightly misleading when adding a getter[1], as the current value isn't
> _just_ the cached value, it's also the value that's currently loaded in
> hardware.
Nit:

For TDX, "it's also the value that's currently loaded in hardware" is not true.

> Opportunistically rename "index" to "slot" in the prototypes.  The user-
> return APIs deliberately use "slot" to try and make it more obvious that
> they take the slot within the array, not the index of the MSR.
> 
> No functional change intended.

Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>

> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/all/aM2EvzLLmBi5-iQ5@google.com [1]
> Signed-off-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-30 12:22 ` Yan Zhao
@ 2025-09-30 12:58   ` Yan Zhao
  2025-09-30 16:29     ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2025-09-30 12:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li,
	Rick Edgecombe

On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > double-underscores version is doing a subset of the work of the "full"
> > setter.
> > 
> > While the function does indeed update a cache, the nomenclature becomes
> > slightly misleading when adding a getter[1], as the current value isn't
> > _just_ the cached value, it's also the value that's currently loaded in
> > hardware.
> Nit:
> 
> For TDX, "it's also the value that's currently loaded in hardware" is not true.
since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
invokes __kvm_set_user_return_msr() in tdx_vcpu_put().

> > Opportunistically rename "index" to "slot" in the prototypes.  The user-
> > return APIs deliberately use "slot" to try and make it more obvious that
> > they take the slot within the array, not the index of the MSR.
> > 
> > No functional change intended.
> 
> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
> 
> > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > Cc: Xiaoyao Li <xiaoyao.li@intel.com>
> > Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Link: https://lore.kernel.org/all/aM2EvzLLmBi5-iQ5@google.com [1]
> > Signed-off-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-30 12:58   ` Yan Zhao
@ 2025-09-30 16:29     ` Sean Christopherson
  2025-09-30 16:34       ` Sean Christopherson
  2025-10-03 13:53       ` Yan Zhao
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-09-30 16:29 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Tue, Sep 30, 2025, Yan Zhao wrote:
> On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > double-underscores version is doing a subset of the work of the "full"
> > > setter.
> > > 
> > > While the function does indeed update a cache, the nomenclature becomes
> > > slightly misleading when adding a getter[1], as the current value isn't
> > > _just_ the cached value, it's also the value that's currently loaded in
> > > hardware.
> > Nit:
> > 
> > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> invokes __kvm_set_user_return_msr() in tdx_vcpu_put().

No?  kvm_user_return_msr_update_cache() is passed the value that's currently
loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.

Ah, I suspect you're calling out that the cache can be stale.  Maybe this?

  While the function does indeed update a cache, the nomenclature becomes
  slightly misleading when adding a getter[1], as the current value isn't
  _just_ the cached value, it's also the value that's currently loaded in
  hardware (ignoring that the cache holds stale data until the vCPU is put,
  i.e. until KVM prepares to switch back to the host).

Actually, that's a bug waiting to happen when the getter comes along.  Rather
than document the potential pitfall, what about adding a prep patch to mimize
the window?  Then _this_ patch shouldn't need the caveat about the cache being
stale.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ff41d3d00380..326fa81cb35f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -789,6 +789,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
                vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
 
        vt->guest_state_loaded = true;
+
+       /*
+        * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
+        * VP.ENTER succeeds, i.e. on TD-Exit.  Mark those MSRs as needing an
+        * update to synchronize the "current" value in KVM's cache with the
+        * value in hardware (loaded by the TDX-Module).
+        */
+       to_tdx(vcpu)->need_user_return_msr_update = true;
 }
 
 struct tdx_uret_msr {
@@ -816,7 +824,6 @@ static void tdx_user_return_msr_update_cache(void)
 static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
 {
        struct vcpu_vt *vt = to_vt(vcpu);
-       struct vcpu_tdx *tdx = to_tdx(vcpu);
 
        if (!vt->guest_state_loaded)
                return;
@@ -824,11 +831,6 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
        ++vcpu->stat.host_state_reload;
        wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
 
-       if (tdx->guest_entered) {
-               tdx_user_return_msr_update_cache();
-               tdx->guest_entered = false;
-       }
-
        vt->guest_state_loaded = false;
 }
 
@@ -1067,7 +1069,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
                update_debugctlmsr(vcpu->arch.host_debugctl);
 
        tdx_load_host_xsave_state(vcpu);
-       tdx->guest_entered = true;
+
+       if (tdx->need_user_return_msr_update) {
+               tdx_user_return_msr_update_cache();
+               tdx->need_user_return_msr_update = false;
+       }
 
        vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..fcac1627f71f 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -67,7 +67,7 @@ struct vcpu_tdx {
        u64 vp_enter_ret;
 
        enum vcpu_tdx_state state;
-       bool guest_entered;
+       bool need_user_return_msr_update;
 
        u64 map_gpa_next;
        u64 map_gpa_end;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-30 16:29     ` Sean Christopherson
@ 2025-09-30 16:34       ` Sean Christopherson
  2025-10-15  0:55         ` Yan Zhao
  2025-10-03 13:53       ` Yan Zhao
  1 sibling, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-09-30 16:34 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Tue, Sep 30, 2025, Sean Christopherson wrote:
> On Tue, Sep 30, 2025, Yan Zhao wrote:
> > On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > > double-underscores version is doing a subset of the work of the "full"
> > > > setter.
> > > > 
> > > > While the function does indeed update a cache, the nomenclature becomes
> > > > slightly misleading when adding a getter[1], as the current value isn't
> > > > _just_ the cached value, it's also the value that's currently loaded in
> > > > hardware.
> > > Nit:
> > > 
> > > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> > since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> > invokes __kvm_set_user_return_msr() in tdx_vcpu_put().
> 
> No?  kvm_user_return_msr_update_cache() is passed the value that's currently
> loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.
> 
> Ah, I suspect you're calling out that the cache can be stale.  Maybe this?
> 
>   While the function does indeed update a cache, the nomenclature becomes
>   slightly misleading when adding a getter[1], as the current value isn't
>   _just_ the cached value, it's also the value that's currently loaded in
>   hardware (ignoring that the cache holds stale data until the vCPU is put,
>   i.e. until KVM prepares to switch back to the host).
> 
> Actually, that's a bug waiting to happen when the getter comes along.  Rather
> than document the potential pitfall, what about adding a prep patch to mimize
> the window?  Then _this_ patch shouldn't need the caveat about the cache being
> stale.

Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
consume a stale values->curr.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-30 16:29     ` Sean Christopherson
  2025-09-30 16:34       ` Sean Christopherson
@ 2025-10-03 13:53       ` Yan Zhao
  2025-10-03 16:53         ` Sean Christopherson
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2025-10-03 13:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

Sorry for the slow response due to the PRC holiday.

On Tue, Sep 30, 2025 at 09:29:00AM -0700, Sean Christopherson wrote:
> On Tue, Sep 30, 2025, Yan Zhao wrote:
> > On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > > double-underscores version is doing a subset of the work of the "full"
> > > > setter.
> > > > 
> > > > While the function does indeed update a cache, the nomenclature becomes
> > > > slightly misleading when adding a getter[1], as the current value isn't
> > > > _just_ the cached value, it's also the value that's currently loaded in
> > > > hardware.
> > > Nit:
> > > 
> > > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> > since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> > invokes __kvm_set_user_return_msr() in tdx_vcpu_put().
> 
> No?  kvm_user_return_msr_update_cache() is passed the value that's currently
> loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.
> 
> Ah, I suspect you're calling out that the cache can be stale.  Maybe this?
Right. But not just that the cache can be stale. My previous reply was quite
misleading.

As with below tables, where
CURR: msrs->values[slot].curr.
REAL: value that's currently loaded in hardware

For TDs,
                            CURR          REAL
   -----------------------------------------------------------------------
1. enable virtualization    host value    host value

2. TDH.VP.ENTER             host value    guest value (updated by tdx module)
3. TDH.VP.ENTER return      host value    defval (updated by tdx module)
4. tdx_vcpu_put             defval        defval
5. exit to user mode        host value    host value


For normal VMs,
                            CURR                 REAL
   -----------------------------------------------------------------------
1. enable virtualization    host value           host value

2. before vcpu_run          shadow guest value   shadow guest value
3. after vcpu_run           shadow guest value   shadow guest value
4. exit to user mode        host value           host value


Unlike normal VMs, where msrs->values[slot].curr always matches the the value
that's currently loaded in hardware. For TDs, msrs->values[slot].curr does not
contain the value that's currently loaded in hardware in stages 2-3.


>   While the function does indeed update a cache, the nomenclature becomes
>   slightly misleading when adding a getter[1], as the current value isn't
>   _just_ the cached value, it's also the value that's currently loaded in
>   hardware (ignoring that the cache holds stale data until the vCPU is put,
So, "stale data" is not accurate.
It just can't hold the current hardware loaded value when guest is running in
TD.

>   i.e. until KVM prepares to switch back to the host).
> 
> Actually, that's a bug waiting to happen when the getter comes along.  Rather
> than document the potential pitfall, what about adding a prep patch to mimize
> the window?  Then _this_ patch shouldn't need the caveat about the cache being
> stale.
With below patch,

For TDs,
                            CURR          REAL
   -----------------------------------------------------------------------
1. enable virtualization    host value    host value

2. before TDH.VP.ENTER      defval        host value or defval
3. TDH.VP.ENTER             defval        guest value (updated by tdx module)
4. TDH.VP.ENTER return      defval        defval (updated by tdx module)
5. exit to user mode        host value    host value

msrs->values[slot].curr is still not the current value loaded in hardware.

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ff41d3d00380..326fa81cb35f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -789,6 +789,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>                 vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
>  
>         vt->guest_state_loaded = true;
> +
> +       /*
> +        * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
Hmm, my previous mail didn't mention that besides saving guest value + clobber
hardware value before exit to VMM, the TDX module also loads saved guest value
to hardware on TDH.VP.ENTER.

> +        * VP.ENTER succeeds, i.e. on TD-Exit.  Mark those MSRs as needing an
> +        * update to synchronize the "current" value in KVM's cache with the
> +        * value in hardware (loaded by the TDX-Module).
> +        */
> +       to_tdx(vcpu)->need_user_return_msr_update = true;
>  }
>  
>  struct tdx_uret_msr {
> @@ -816,7 +824,6 @@ static void tdx_user_return_msr_update_cache(void)
>  static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
>  {
>         struct vcpu_vt *vt = to_vt(vcpu);
> -       struct vcpu_tdx *tdx = to_tdx(vcpu);
>  
>         if (!vt->guest_state_loaded)
>                 return;
> @@ -824,11 +831,6 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
>         ++vcpu->stat.host_state_reload;
>         wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
>  
> -       if (tdx->guest_entered) {
> -               tdx_user_return_msr_update_cache();
> -               tdx->guest_entered = false;
> -       }
> -
>         vt->guest_state_loaded = false;
>  }
>  
> @@ -1067,7 +1069,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>                 update_debugctlmsr(vcpu->arch.host_debugctl);
>  
>         tdx_load_host_xsave_state(vcpu);
> -       tdx->guest_entered = true;
> +
> +       if (tdx->need_user_return_msr_update) {
> +               tdx_user_return_msr_update_cache();
> +               tdx->need_user_return_msr_update = false;
> +       }
>  
>         vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..fcac1627f71f 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -67,7 +67,7 @@ struct vcpu_tdx {
>         u64 vp_enter_ret;
>  
>         enum vcpu_tdx_state state;
> -       bool guest_entered;
> +       bool need_user_return_msr_update;
>  
>         u64 map_gpa_next;
>         u64 map_gpa_end;
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-03 13:53       ` Yan Zhao
@ 2025-10-03 16:53         ` Sean Christopherson
  2025-10-13 13:47           ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-10-03 16:53 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Fri, Oct 03, 2025, Yan Zhao wrote:
> Sorry for the slow response due to the PRC holiday.
> 
> On Tue, Sep 30, 2025 at 09:29:00AM -0700, Sean Christopherson wrote:
> > On Tue, Sep 30, 2025, Yan Zhao wrote:
> > > On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > > > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > > > double-underscores version is doing a subset of the work of the "full"
> > > > > setter.
> > > > > 
> > > > > While the function does indeed update a cache, the nomenclature becomes
> > > > > slightly misleading when adding a getter[1], as the current value isn't
> > > > > _just_ the cached value, it's also the value that's currently loaded in
> > > > > hardware.
> > > > Nit:
> > > > 
> > > > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> > > since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> > > invokes __kvm_set_user_return_msr() in tdx_vcpu_put().
> > 
> > No?  kvm_user_return_msr_update_cache() is passed the value that's currently
> > loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.
> > 
> > Ah, I suspect you're calling out that the cache can be stale.  Maybe this?
> Right. But not just that the cache can be stale. My previous reply was quite
> misleading.
> 
> As with below tables, where
> CURR: msrs->values[slot].curr.
> REAL: value that's currently loaded in hardware
> 
> For TDs,
>                             CURR          REAL
>    -----------------------------------------------------------------------
> 1. enable virtualization    host value    host value
> 
> 2. TDH.VP.ENTER             host value    guest value (updated by tdx module)
> 3. TDH.VP.ENTER return      host value    defval (updated by tdx module)
> 4. tdx_vcpu_put             defval        defval
> 5. exit to user mode        host value    host value
> 
> 
> For normal VMs,
>                             CURR                 REAL
>    -----------------------------------------------------------------------
> 1. enable virtualization    host value           host value
> 2. before vcpu_run          shadow guest value   shadow guest value
> 3. after vcpu_run           shadow guest value   shadow guest value
> 4. exit to user mode        host value           host value
> 
> 
> Unlike normal VMs, where msrs->values[slot].curr always matches the the value
> that's currently loaded in hardware. 

That isn't actually true, see the bottom.

> For TDs, msrs->values[slot].curr does not contain the value that's currently
> loaded in hardware in stages 2-3.
> 
> >   While the function does indeed update a cache, the nomenclature becomes
> >   slightly misleading when adding a getter[1], as the current value isn't
> >   _just_ the cached value, it's also the value that's currently loaded in
> >   hardware (ignoring that the cache holds stale data until the vCPU is put,
> So, "stale data" is not accurate.
> It just can't hold the current hardware loaded value when guest is running in
> TD.

Eh, that's still "stale data" as far as KVM is concerned.  Though I'm splitting
hairs, I totally agree that as written the changelog is misleading.

> it's also the value that's currently loaded in hardware.

I just need to append "when KVM is actively running" (or probably something more
verbose).

> >   i.e. until KVM prepares to switch back to the host).
> > 
> > Actually, that's a bug waiting to happen when the getter comes along.  Rather
> > than document the potential pitfall, what about adding a prep patch to mimize
> > the window?  Then _this_ patch shouldn't need the caveat about the cache being
> > stale.
> With below patch,
> 
> For TDs,
>                             CURR          REAL
>    -----------------------------------------------------------------------
> 1. enable virtualization    host value    host value
> 2. before TDH.VP.ENTER      defval        host value or defval
> 3. TDH.VP.ENTER             defval        guest value (updated by tdx module)
> 4. TDH.VP.ENTER return      defval        defval (updated by tdx module)
> 5. exit to user mode        host value    host value
> 
> msrs->values[slot].curr is still not the current value loaded in hardware.

Right, this where it becomes stale from my perspective.

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index ff41d3d00380..326fa81cb35f 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -789,6 +789,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> >                 vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> >  
> >         vt->guest_state_loaded = true;
> > +
> > +       /*
> > +        * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
> Hmm, my previous mail didn't mention that besides saving guest value + clobber
> hardware value before exit to VMM, the TDX module also loads saved guest value
> to hardware on TDH.VP.ENTER.

That's not actually unique to TDX.  EFER is setup as a user return MSR, but is
context switched on VM-Enter/VM-Exit except when running on ancient hardware
without VM_{ENTRY,EXIT}_LOAD_IA32_EFER (and even then, only when KVM doesn't
need to atomically switch to avoid toggling EFER.NX while in the host).

I.e. msrs->values[<EFER slot>].curr won't match hardware either while running
the guest.  But because EFER is atomically loaded on VM-Exit in those cases, the
curr value can't be stale while KVM is running.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-03 16:53         ` Sean Christopherson
@ 2025-10-13 13:47           ` Yan Zhao
  0 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2025-10-13 13:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Fri, Oct 03, 2025 at 09:53:28AM -0700, Sean Christopherson wrote:
> On Fri, Oct 03, 2025, Yan Zhao wrote:
> > Sorry for the slow response due to the PRC holiday.
> > 
> > On Tue, Sep 30, 2025 at 09:29:00AM -0700, Sean Christopherson wrote:
> > > On Tue, Sep 30, 2025, Yan Zhao wrote:
> > > > On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > > > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > > > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > > > > double-underscores version is doing a subset of the work of the "full"
> > > > > > setter.
> > > > > > 
> > > > > > While the function does indeed update a cache, the nomenclature becomes
> > > > > > slightly misleading when adding a getter[1], as the current value isn't
> > > > > > _just_ the cached value, it's also the value that's currently loaded in
> > > > > > hardware.
> > > > > Nit:
> > > > > 
> > > > > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> > > > since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> > > > invokes __kvm_set_user_return_msr() in tdx_vcpu_put().
> > > 
> > > No?  kvm_user_return_msr_update_cache() is passed the value that's currently
> > > loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.
> > > 
> > > Ah, I suspect you're calling out that the cache can be stale.  Maybe this?
> > Right. But not just that the cache can be stale. My previous reply was quite
> > misleading.
> > 
> > As with below tables, where
> > CURR: msrs->values[slot].curr.
> > REAL: value that's currently loaded in hardware
> > 
> > For TDs,
> >                             CURR          REAL
> >    -----------------------------------------------------------------------
> > 1. enable virtualization    host value    host value
> > 
> > 2. TDH.VP.ENTER             host value    guest value (updated by tdx module)
> > 3. TDH.VP.ENTER return      host value    defval (updated by tdx module)
> > 4. tdx_vcpu_put             defval        defval
> > 5. exit to user mode        host value    host value
> > 
> > 
> > For normal VMs,
> >                             CURR                 REAL
> >    -----------------------------------------------------------------------
> > 1. enable virtualization    host value           host value
> > 2. before vcpu_run          shadow guest value   shadow guest value
> > 3. after vcpu_run           shadow guest value   shadow guest value
> > 4. exit to user mode        host value           host value
> > 
> > 
> > Unlike normal VMs, where msrs->values[slot].curr always matches the the value
> > that's currently loaded in hardware. 
> 
> That isn't actually true, see the bottom.
> 
> > For TDs, msrs->values[slot].curr does not contain the value that's currently
> > loaded in hardware in stages 2-3.
> > 
> > >   While the function does indeed update a cache, the nomenclature becomes
> > >   slightly misleading when adding a getter[1], as the current value isn't
> > >   _just_ the cached value, it's also the value that's currently loaded in
> > >   hardware (ignoring that the cache holds stale data until the vCPU is put,
> > So, "stale data" is not accurate.
> > It just can't hold the current hardware loaded value when guest is running in
> > TD.
> 
> Eh, that's still "stale data" as far as KVM is concerned.  Though I'm splitting
> hairs, I totally agree that as written the changelog is misleading.
> 
> > it's also the value that's currently loaded in hardware.
> 
> I just need to append "when KVM is actively running" (or probably something more
> verbose).
> 
> > >   i.e. until KVM prepares to switch back to the host).
> > > 
> > > Actually, that's a bug waiting to happen when the getter comes along.  Rather
> > > than document the potential pitfall, what about adding a prep patch to mimize
> > > the window?  Then _this_ patch shouldn't need the caveat about the cache being
> > > stale.
> > With below patch,
> > 
> > For TDs,
> >                             CURR          REAL
> >    -----------------------------------------------------------------------
> > 1. enable virtualization    host value    host value
> > 2. before TDH.VP.ENTER      defval        host value or defval
> > 3. TDH.VP.ENTER             defval        guest value (updated by tdx module)
> > 4. TDH.VP.ENTER return      defval        defval (updated by tdx module)
> > 5. exit to user mode        host value    host value
> > 
> > msrs->values[slot].curr is still not the current value loaded in hardware.
> 
> Right, this where it becomes stale from my perspective.
Ok, then though after your fix, msrs->values[slot].curr is still not matching
hardware value in stage 2, I think it's harmless. 

> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index ff41d3d00380..326fa81cb35f 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -789,6 +789,14 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > >                 vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> > >  
> > >         vt->guest_state_loaded = true;
> > > +
> > > +       /*
> > > +        * Several of KVM's user-return MSRs are clobbered by the TDX-Module if
> > Hmm, my previous mail didn't mention that besides saving guest value + clobber
> > hardware value before exit to VMM, the TDX module also loads saved guest value
> > to hardware on TDH.VP.ENTER.
> 
> That's not actually unique to TDX.  EFER is setup as a user return MSR, but is
> context switched on VM-Enter/VM-Exit except when running on ancient hardware
> without VM_{ENTRY,EXIT}_LOAD_IA32_EFER (and even then, only when KVM doesn't
> need to atomically switch to avoid toggling EFER.NX while in the host).
>
> I.e. msrs->values[<EFER slot>].curr won't match hardware either while running
> the guest.  But because EFER is atomically loaded on VM-Exit in those cases, the
> curr value can't be stale while KVM is running.
Oh, yes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-09-30 16:34       ` Sean Christopherson
@ 2025-10-15  0:55         ` Yan Zhao
  2025-10-15 16:19           ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2025-10-15  0:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe

On Tue, Sep 30, 2025 at 09:34:20AM -0700, Sean Christopherson wrote:
> On Tue, Sep 30, 2025, Sean Christopherson wrote:
> > On Tue, Sep 30, 2025, Yan Zhao wrote:
> > > On Tue, Sep 30, 2025 at 08:22:41PM +0800, Yan Zhao wrote:
> > > > On Fri, Sep 19, 2025 at 02:42:59PM -0700, Sean Christopherson wrote:
> > > > > Rename kvm_user_return_msr_update_cache() to __kvm_set_user_return_msr()
> > > > > and use the helper kvm_set_user_return_msr() to make it obvious that the
> > > > > double-underscores version is doing a subset of the work of the "full"
> > > > > setter.
> > > > > 
> > > > > While the function does indeed update a cache, the nomenclature becomes
> > > > > slightly misleading when adding a getter[1], as the current value isn't
> > > > > _just_ the cached value, it's also the value that's currently loaded in
> > > > > hardware.
> > > > Nit:
> > > > 
> > > > For TDX, "it's also the value that's currently loaded in hardware" is not true.
> > > since tdx module invokes wrmsr()s before each exit to VMM, while KVM only
> > > invokes __kvm_set_user_return_msr() in tdx_vcpu_put().
> > 
> > No?  kvm_user_return_msr_update_cache() is passed the value that's currently
> > loaded in hardware, by way of the TDX-Module zeroing some MSRs on TD-Exit.
> > 
> > Ah, I suspect you're calling out that the cache can be stale.  Maybe this?
> > 
> >   While the function does indeed update a cache, the nomenclature becomes
> >   slightly misleading when adding a getter[1], as the current value isn't
> >   _just_ the cached value, it's also the value that's currently loaded in
> >   hardware (ignoring that the cache holds stale data until the vCPU is put,
> >   i.e. until KVM prepares to switch back to the host).
> > 
> > Actually, that's a bug waiting to happen when the getter comes along.  Rather
> > than document the potential pitfall, what about adding a prep patch to mimize
> > the window?  Then _this_ patch shouldn't need the caveat about the cache being
> > stale.
> 
> Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
> kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
> to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
> consume a stale values->curr.
Looks consuming stale values->curr could also happen for normal VMs.

vmx_prepare_switch_to_guest
  |->kvm_set_user_return_msr //for all slots that load_into_hardware is true
       |->1) wrmsrq_safe(kvm_uret_msrs_list[slot], value);
       |  2) __kvm_set_user_return_msr(slot, value);
               |->msrs->values[slot].curr = value;
               |  kvm_user_return_register_notifier

As vmx_prepare_switch_to_guest() invokes kvm_set_user_return_msr() with local
irq enabled, there's a window where kvm_shutdown() may call
kvm_disable_virtualization_cpu() between steps 1) and 2). During this window,
the hardware contains the shadow guest value while values[slot].curr still holds
the host value.

In this scenario, if msrs->registered is true at step 1) (due to updating of a
previous slot), kvm_disable_virtualization_cpu() could call kvm_on_user_return()
and find "values->host == values->curr", which would leave the hardware value
set to the shadow guest value instead of restoring the host value.

Do you think it's a bug?
And do we need to fix it by disabling irq in kvm_set_user_return_msr() ? e.g.,

 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
        struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+       unsigned long flags;
        int err;

        value = (value & mask) | (msrs->values[slot].host & ~mask);
        if (value == msrs->values[slot].curr)
                return 0;
+
+       local_irq_save(flags);
        err = wrmsrq_safe(kvm_uret_msrs_list[slot], value);
-       if (err)
+       if (err) {
+               local_irq_restore(flags);
                return 1;
+       }

        __kvm_set_user_return_msr(slot, value);
+       local_irq_restore(flags);
        return 0;
 }


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-15  0:55         ` Yan Zhao
@ 2025-10-15 16:19           ` Sean Christopherson
  2025-10-16  8:58             ` Yan Zhao
  2025-10-16 13:05             ` Hou Wenlong
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-10-15 16:19 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe,
	Hou Wenlong

+Hou, who is trying to clean up the user-return registration code as well:

https://lore.kernel.org/all/15fa59ba7f6f849082fb36735e784071539d5ad2.1758002303.git.houwenlong.hwl@antgroup.com

On Wed, Oct 15, 2025, Yan Zhao wrote:
> On Tue, Sep 30, 2025 at 09:34:20AM -0700, Sean Christopherson wrote:
> > Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
> > kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
> > to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
> > consume a stale values->curr.
> Looks consuming stale values->curr could also happen for normal VMs.
> 
> vmx_prepare_switch_to_guest
>   |->kvm_set_user_return_msr //for all slots that load_into_hardware is true
>        |->1) wrmsrq_safe(kvm_uret_msrs_list[slot], value);
>        |  2) __kvm_set_user_return_msr(slot, value);
>                |->msrs->values[slot].curr = value;
>                |  kvm_user_return_register_notifier
> 
> As vmx_prepare_switch_to_guest() invokes kvm_set_user_return_msr() with local
> irq enabled, there's a window where kvm_shutdown() may call
> kvm_disable_virtualization_cpu() between steps 1) and 2). During this window,
> the hardware contains the shadow guest value while values[slot].curr still holds
> the host value.
> 
> In this scenario, if msrs->registered is true at step 1) (due to updating of a
> previous slot), kvm_disable_virtualization_cpu() could call kvm_on_user_return()
> and find "values->host == values->curr", which would leave the hardware value
> set to the shadow guest value instead of restoring the host value.
> 
> Do you think it's a bug?
> And do we need to fix it by disabling irq in kvm_set_user_return_msr() ? e.g.,

Ugh.  It's technically "bug" of sorts, but I really, really don't want to fix it
by disabling IRQs.

Back when commit 1650b4ebc99d ("KVM: Disable irq while unregistering user notifier")
disabled IRQs in kvm_on_user_return(), KVM blasted IPIs in the _normal_ flow, when
when the last VM is destroyed (and also when enabling virtualization, which created
its own problems).

Now that KVM uses the cpuhp framework to enable/disable virtualization, the normal
case runs in task context, including kvm_suspend() and kvm_resume().  I.e. the only
path that can toggle virtualization via IPI callback is kvm_shutdown().  And on
reboot/shutdown, keeping the hook registered is ok as far as MSR state is concerned,
as the callback will run cleanly and restore host MSRs if the CPU manages to return
to userspace before the system goes down.

The only wrinkle is that if kvm.ko module unload manages to race with reboot, then
leaving the notifier registered could lead to use-after-free.  But that's only
possible on --forced reboot/shutdown, because otherwise userspace tasks would be
frozen before kvm_shutdown() is called, i.e. the CPU shouldn't return to userspace
after kvm_shutdown().  Furthermore, on a --forced reboot/shutdown, unregistering
the user-return hook from IRQ context rather pointless, because KVM could immediately
re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier()
is called.  I.e. the use-after-free isn't fully defended on --forced reboot/shutdown
anyways.

Given all of the above, my vote is to eliminate the IRQ disabling crud and simply
leave the user-return notifier registered on a reboot.  Then to defend against
a use-after-free due to kvm.ko unload racing against reboot, simply bump the module
refcount.  Trying to account for a rather absurd case in the normal paths adds a
ton of noise for almost no gain.

E.g.

---
 arch/x86/kvm/x86.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b8138bd4857..f03f3ae836f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -582,18 +582,12 @@ static void kvm_on_user_return(struct user_return_notifier *urn)
 	struct kvm_user_return_msrs *msrs
 		= container_of(urn, struct kvm_user_return_msrs, urn);
 	struct kvm_user_return_msr_values *values;
-	unsigned long flags;
 
-	/*
-	 * Disabling irqs at this point since the following code could be
-	 * interrupted and executed through kvm_arch_disable_virtualization_cpu()
-	 */
-	local_irq_save(flags);
 	if (msrs->registered) {
 		msrs->registered = false;
 		user_return_notifier_unregister(urn);
 	}
-	local_irq_restore(flags);
+
 	for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
 		values = &msrs->values[slot];
 		if (values->host != values->curr) {
@@ -13079,7 +13073,21 @@ int kvm_arch_enable_virtualization_cpu(void)
 void kvm_arch_disable_virtualization_cpu(void)
 {
 	kvm_x86_call(disable_virtualization_cpu)();
-	drop_user_return_notifiers();
+
+	/*
+	 * Leave the user-return notifiers as-is when disabling virtualization
+	 * for reboot, i.e. when disabling via IPI function call, and instead
+	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
+	 * the *very* unlikely scenario module unload is racing with reboot).
+	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
+	 * could be actively modifying user-return MSR state when the IPI to
+	 * disable virtualization arrives.  Handle the extreme edge case here
+	 * instead of trying to account for it in the normal flows.
+	 */
+	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
+		drop_user_return_notifiers();
+	else
+		__module_get(THIS_MODULE);
 }
 
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
@@ -14363,6 +14371,11 @@ module_init(kvm_x86_init);
 
 static void __exit kvm_x86_exit(void)
 {
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(per_cpu_ptr(user_return_msrs, cpu)->registered);
+
 	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
 }
 module_exit(kvm_x86_exit);

base-commit: fe57670bfaba66049529fe7a60a926d5f3397589
--

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-15 16:19           ` Sean Christopherson
@ 2025-10-16  8:58             ` Yan Zhao
  2025-10-16 13:27               ` Hou Wenlong
  2025-10-16 13:05             ` Hou Wenlong
  1 sibling, 1 reply; 16+ messages in thread
From: Yan Zhao @ 2025-10-16  8:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li, Rick Edgecombe,
	Hou Wenlong

On Wed, Oct 15, 2025 at 09:19:00AM -0700, Sean Christopherson wrote:
> +Hou, who is trying to clean up the user-return registration code as well:
> 
> https://lore.kernel.org/all/15fa59ba7f6f849082fb36735e784071539d5ad2.1758002303.git.houwenlong.hwl@antgroup.com
> 
> On Wed, Oct 15, 2025, Yan Zhao wrote:
> > On Tue, Sep 30, 2025 at 09:34:20AM -0700, Sean Christopherson wrote:
> > > Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
> > > kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
> > > to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
> > > consume a stale values->curr.
> > Looks consuming stale values->curr could also happen for normal VMs.
> > 
> > vmx_prepare_switch_to_guest
> >   |->kvm_set_user_return_msr //for all slots that load_into_hardware is true
> >        |->1) wrmsrq_safe(kvm_uret_msrs_list[slot], value);
> >        |  2) __kvm_set_user_return_msr(slot, value);
> >                |->msrs->values[slot].curr = value;
> >                |  kvm_user_return_register_notifier
> > 
> > As vmx_prepare_switch_to_guest() invokes kvm_set_user_return_msr() with local
> > irq enabled, there's a window where kvm_shutdown() may call
> > kvm_disable_virtualization_cpu() between steps 1) and 2). During this window,
> > the hardware contains the shadow guest value while values[slot].curr still holds
> > the host value.
> > 
> > In this scenario, if msrs->registered is true at step 1) (due to updating of a
> > previous slot), kvm_disable_virtualization_cpu() could call kvm_on_user_return()
> > and find "values->host == values->curr", which would leave the hardware value
> > set to the shadow guest value instead of restoring the host value.
> > 
> > Do you think it's a bug?
> > And do we need to fix it by disabling irq in kvm_set_user_return_msr() ? e.g.,
> 
> Ugh.  It's technically "bug" of sorts, but I really, really don't want to fix it
> by disabling IRQs.
> 
> Back when commit 1650b4ebc99d ("KVM: Disable irq while unregistering user notifier")
> disabled IRQs in kvm_on_user_return(), KVM blasted IPIs in the _normal_ flow, when
> when the last VM is destroyed (and also when enabling virtualization, which created
> its own problems).
> 
> Now that KVM uses the cpuhp framework to enable/disable virtualization, the normal
> case runs in task context, including kvm_suspend() and kvm_resume().  I.e. the only
> path that can toggle virtualization via IPI callback is kvm_shutdown().  And on
> reboot/shutdown, keeping the hook registered is ok as far as MSR state is concerned,
> as the callback will run cleanly and restore host MSRs if the CPU manages to return
> to userspace before the system goes down.
> 
> The only wrinkle is that if kvm.ko module unload manages to race with reboot, then
> leaving the notifier registered could lead to use-after-free.  But that's only
> possible on --forced reboot/shutdown, because otherwise userspace tasks would be
> frozen before kvm_shutdown() is called, i.e. the CPU shouldn't return to userspace
> after kvm_shutdown().  Furthermore, on a --forced reboot/shutdown, unregistering
> the user-return hook from IRQ context rather pointless, because KVM could immediately
> re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier()
> is called.  I.e. the use-after-free isn't fully defended on --forced reboot/shutdown
> anyways.
> 
> Given all of the above, my vote is to eliminate the IRQ disabling crud and simply
> leave the user-return notifier registered on a reboot.  Then to defend against
> a use-after-free due to kvm.ko unload racing against reboot, simply bump the module
> refcount.  Trying to account for a rather absurd case in the normal paths adds a
> ton of noise for almost no gain.
Thanks for the detailed explanation.

> E.g.
> 
> ---
>  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b8138bd4857..f03f3ae836f8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -582,18 +582,12 @@ static void kvm_on_user_return(struct user_return_notifier *urn)
>  	struct kvm_user_return_msrs *msrs
>  		= container_of(urn, struct kvm_user_return_msrs, urn);
>  	struct kvm_user_return_msr_values *values;
> -	unsigned long flags;
>  
> -	/*
> -	 * Disabling irqs at this point since the following code could be
> -	 * interrupted and executed through kvm_arch_disable_virtualization_cpu()
> -	 */
> -	local_irq_save(flags);
>  	if (msrs->registered) {
>  		msrs->registered = false;
>  		user_return_notifier_unregister(urn);
>  	}
> -	local_irq_restore(flags);
> +
>  	for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
>  		values = &msrs->values[slot];
>  		if (values->host != values->curr) {
> @@ -13079,7 +13073,21 @@ int kvm_arch_enable_virtualization_cpu(void)
>  void kvm_arch_disable_virtualization_cpu(void)
>  {
>  	kvm_x86_call(disable_virtualization_cpu)();
> -	drop_user_return_notifiers();
> +
> +	/*
> +	 * Leave the user-return notifiers as-is when disabling virtualization
> +	 * for reboot, i.e. when disabling via IPI function call, and instead
> +	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> +	 * the *very* unlikely scenario module unload is racing with reboot).
> +	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> +	 * could be actively modifying user-return MSR state when the IPI to
> +	 * disable virtualization arrives.  Handle the extreme edge case here
> +	 * instead of trying to account for it in the normal flows.
> +	 */
> +	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
kvm_offline_cpu() may be invoked when irq is enabled.
So does it depend on [1]?

[1] https://lore.kernel.org/kvm/aMirvo9Xly5fVmbY@google.com/

> +		drop_user_return_notifiers();
> +	else
> +		__module_get(THIS_MODULE);
Since vm_vm_fops holds ref of module kvm_intel, and drop_user_return_notifiers()
is called in kvm_destroy_vm() or kvm_exit():

kvm_destroy_vm/kvm_exit
  kvm_disable_virtualization
    kvm_offline_cpu
      kvm_disable_virtualization_cpu
        drop_user_return_notifiers

also since fire_user_return_notifiers() executes with irq disabled, is it
necessary to pin kvm.ko?

>  }
>  
>  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
> @@ -14363,6 +14371,11 @@ module_init(kvm_x86_init);
>  
>  static void __exit kvm_x86_exit(void)
>  {
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		WARN_ON_ONCE(per_cpu_ptr(user_return_msrs, cpu)->registered);
> +
>  	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
>  }
>  module_exit(kvm_x86_exit);
> 
> base-commit: fe57670bfaba66049529fe7a60a926d5f3397589
> --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-15 16:19           ` Sean Christopherson
  2025-10-16  8:58             ` Yan Zhao
@ 2025-10-16 13:05             ` Hou Wenlong
  1 sibling, 0 replies; 16+ messages in thread
From: Hou Wenlong @ 2025-10-16 13:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li,
	Rick Edgecombe

On Wed, Oct 15, 2025 at 09:19:00AM -0700, Sean Christopherson wrote:
> +Hou, who is trying to clean up the user-return registration code as well:
> 
> https://lore.kernel.org/all/15fa59ba7f6f849082fb36735e784071539d5ad2.1758002303.git.houwenlong.hwl@antgroup.com
>
Thanks for the CC.
 
> On Wed, Oct 15, 2025, Yan Zhao wrote:
> > On Tue, Sep 30, 2025 at 09:34:20AM -0700, Sean Christopherson wrote:
> > > Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
> > > kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
> > > to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
> > > consume a stale values->curr.
> > Looks consuming stale values->curr could also happen for normal VMs.
> > 
> > vmx_prepare_switch_to_guest
> >   |->kvm_set_user_return_msr //for all slots that load_into_hardware is true
> >        |->1) wrmsrq_safe(kvm_uret_msrs_list[slot], value);
> >        |  2) __kvm_set_user_return_msr(slot, value);
> >                |->msrs->values[slot].curr = value;
> >                |  kvm_user_return_register_notifier
> > 
> > As vmx_prepare_switch_to_guest() invokes kvm_set_user_return_msr() with local
> > irq enabled, there's a window where kvm_shutdown() may call
> > kvm_disable_virtualization_cpu() between steps 1) and 2). During this window,
> > the hardware contains the shadow guest value while values[slot].curr still holds
> > the host value.
> > 
> > In this scenario, if msrs->registered is true at step 1) (due to updating of a
> > previous slot), kvm_disable_virtualization_cpu() could call kvm_on_user_return()
> > and find "values->host == values->curr", which would leave the hardware value
> > set to the shadow guest value instead of restoring the host value.
> > 
> > Do you think it's a bug?
> > And do we need to fix it by disabling irq in kvm_set_user_return_msr() ? e.g.,
> 
> Ugh.  It's technically "bug" of sorts, but I really, really don't want to fix it
> by disabling IRQs.
> 
> Back when commit 1650b4ebc99d ("KVM: Disable irq while unregistering user notifier")
> disabled IRQs in kvm_on_user_return(), KVM blasted IPIs in the _normal_ flow, when
> when the last VM is destroyed (and also when enabling virtualization, which created
> its own problems).
>
Yes, when I looked up this commit, I was confused as well. It is a UAF
issue if fire_user_return_notifiers() runs in an IRQ-enabled state and
is interrupted by the unloading of the KVM module.  Both the data
(user_return_notifier) and the code could be subject to UAF. It is not
sufficient to simply disable IRQ in kvm_on_user_return() in this commit.

> Now that KVM uses the cpuhp framework to enable/disable virtualization, the normal
> case runs in task context, including kvm_suspend() and kvm_resume().  I.e. the only
> path that can toggle virtualization via IPI callback is kvm_shutdown().  And on
> reboot/shutdown, keeping the hook registered is ok as far as MSR state is concerned,
> as the callback will run cleanly and restore host MSRs if the CPU manages to return
> to userspace before the system goes down.
> 
> The only wrinkle is that if kvm.ko module unload manages to race with reboot, then
> leaving the notifier registered could lead to use-after-free.  But that's only
> possible on --forced reboot/shutdown, because otherwise userspace tasks would be
> frozen before kvm_shutdown() is called, i.e. the CPU shouldn't return to userspace
> after kvm_shutdown().  Furthermore, on a --forced reboot/shutdown, unregistering
> the user-return hook from IRQ context rather pointless, because KVM could immediately
> re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier()
> is called.  I.e. the use-after-free isn't fully defended on --forced reboot/shutdown
> anyways.
> 

Thank you for providing the extra information about forced reboot/shutdown. I had only
assumed that all userspace tasks are frozen before shutdown.

> Given all of the above, my vote is to eliminate the IRQ disabling crud and simply
> leave the user-return notifier registered on a reboot.  Then to defend against
> a use-after-free due to kvm.ko unload racing against reboot, simply bump the module
> refcount.  Trying to account for a rather absurd case in the normal paths adds a
> ton of noise for almost no gain.
> 
> E.g.
> 
> ---
>  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b8138bd4857..f03f3ae836f8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -582,18 +582,12 @@ static void kvm_on_user_return(struct user_return_notifier *urn)
>  	struct kvm_user_return_msrs *msrs
>  		= container_of(urn, struct kvm_user_return_msrs, urn);
>  	struct kvm_user_return_msr_values *values;
> -	unsigned long flags;
>  
> -	/*
> -	 * Disabling irqs at this point since the following code could be
> -	 * interrupted and executed through kvm_arch_disable_virtualization_cpu()
> -	 */
> -	local_irq_save(flags);
>  	if (msrs->registered) {
>  		msrs->registered = false;
>  		user_return_notifier_unregister(urn);
>  	}
> -	local_irq_restore(flags);
> +
>  	for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
>  		values = &msrs->values[slot];
>  		if (values->host != values->curr) {
> @@ -13079,7 +13073,21 @@ int kvm_arch_enable_virtualization_cpu(void)
>  void kvm_arch_disable_virtualization_cpu(void)
>  {
>  	kvm_x86_call(disable_virtualization_cpu)();
> -	drop_user_return_notifiers();
> +
> +	/*
> +	 * Leave the user-return notifiers as-is when disabling virtualization
> +	 * for reboot, i.e. when disabling via IPI function call, and instead
> +	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> +	 * the *very* unlikely scenario module unload is racing with reboot).
> +	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> +	 * could be actively modifying user-return MSR state when the IPI to
> +	 * disable virtualization arrives.  Handle the extreme edge case here
> +	 * instead of trying to account for it in the normal flows.
> +	 */
> +	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
> +		drop_user_return_notifiers();
> +	else
> +		__module_get(THIS_MODULE);
>  }
>  
>  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
> @@ -14363,6 +14371,11 @@ module_init(kvm_x86_init);
>  
>  static void __exit kvm_x86_exit(void)
>  {
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		WARN_ON_ONCE(per_cpu_ptr(user_return_msrs, cpu)->registered);
> +
>  	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
>  }
>  module_exit(kvm_x86_exit);
> 
> base-commit: fe57670bfaba66049529fe7a60a926d5f3397589
> --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-16  8:58             ` Yan Zhao
@ 2025-10-16 13:27               ` Hou Wenlong
  2025-10-16 13:41                 ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Wenlong @ 2025-10-16 13:27 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li,
	Rick Edgecombe

On Thu, Oct 16, 2025 at 04:58:17PM +0800, Yan Zhao wrote:
> On Wed, Oct 15, 2025 at 09:19:00AM -0700, Sean Christopherson wrote:
> > +Hou, who is trying to clean up the user-return registration code as well:
> > 
> > https://lore.kernel.org/all/15fa59ba7f6f849082fb36735e784071539d5ad2.1758002303.git.houwenlong.hwl@antgroup.com
> > 
> > On Wed, Oct 15, 2025, Yan Zhao wrote:
> > > On Tue, Sep 30, 2025 at 09:34:20AM -0700, Sean Christopherson wrote:
> > > > Ha!  It's technically a bug fix.  Because a forced shutdown will invoke
> > > > kvm_shutdown() without waiting for tasks to exit, and so the on_each_cpu() calls
> > > > to kvm_disable_virtualization_cpu() can call kvm_on_user_return() and thus
> > > > consume a stale values->curr.
> > > Looks consuming stale values->curr could also happen for normal VMs.
> > > 
> > > vmx_prepare_switch_to_guest
> > >   |->kvm_set_user_return_msr //for all slots that load_into_hardware is true
> > >        |->1) wrmsrq_safe(kvm_uret_msrs_list[slot], value);
> > >        |  2) __kvm_set_user_return_msr(slot, value);
> > >                |->msrs->values[slot].curr = value;
> > >                |  kvm_user_return_register_notifier
> > > 
> > > As vmx_prepare_switch_to_guest() invokes kvm_set_user_return_msr() with local
> > > irq enabled, there's a window where kvm_shutdown() may call
> > > kvm_disable_virtualization_cpu() between steps 1) and 2). During this window,
> > > the hardware contains the shadow guest value while values[slot].curr still holds
> > > the host value.
> > > 
> > > In this scenario, if msrs->registered is true at step 1) (due to updating of a
> > > previous slot), kvm_disable_virtualization_cpu() could call kvm_on_user_return()
> > > and find "values->host == values->curr", which would leave the hardware value
> > > set to the shadow guest value instead of restoring the host value.
> > > 
> > > Do you think it's a bug?
> > > And do we need to fix it by disabling irq in kvm_set_user_return_msr() ? e.g.,
> > 
> > Ugh.  It's technically "bug" of sorts, but I really, really don't want to fix it
> > by disabling IRQs.
> > 
> > Back when commit 1650b4ebc99d ("KVM: Disable irq while unregistering user notifier")
> > disabled IRQs in kvm_on_user_return(), KVM blasted IPIs in the _normal_ flow, when
> > when the last VM is destroyed (and also when enabling virtualization, which created
> > its own problems).
> > 
> > Now that KVM uses the cpuhp framework to enable/disable virtualization, the normal
> > case runs in task context, including kvm_suspend() and kvm_resume().  I.e. the only
> > path that can toggle virtualization via IPI callback is kvm_shutdown().  And on
> > reboot/shutdown, keeping the hook registered is ok as far as MSR state is concerned,
> > as the callback will run cleanly and restore host MSRs if the CPU manages to return
> > to userspace before the system goes down.
> > 
> > The only wrinkle is that if kvm.ko module unload manages to race with reboot, then
> > leaving the notifier registered could lead to use-after-free.  But that's only
> > possible on --forced reboot/shutdown, because otherwise userspace tasks would be
> > frozen before kvm_shutdown() is called, i.e. the CPU shouldn't return to userspace
> > after kvm_shutdown().  Furthermore, on a --forced reboot/shutdown, unregistering
> > the user-return hook from IRQ context rather pointless, because KVM could immediately
> > re-register the hook, e.g. if the IRQ arrives before kvm_user_return_register_notifier()
> > is called.  I.e. the use-after-free isn't fully defended on --forced reboot/shutdown
> > anyways.
> > 
> > Given all of the above, my vote is to eliminate the IRQ disabling crud and simply
> > leave the user-return notifier registered on a reboot.  Then to defend against
> > a use-after-free due to kvm.ko unload racing against reboot, simply bump the module
> > refcount.  Trying to account for a rather absurd case in the normal paths adds a
> > ton of noise for almost no gain.
> Thanks for the detailed explanation.
> 
> > E.g.
> > 
> > ---
> >  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++--------
> >  1 file changed, 21 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4b8138bd4857..f03f3ae836f8 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -582,18 +582,12 @@ static void kvm_on_user_return(struct user_return_notifier *urn)
> >  	struct kvm_user_return_msrs *msrs
> >  		= container_of(urn, struct kvm_user_return_msrs, urn);
> >  	struct kvm_user_return_msr_values *values;
> > -	unsigned long flags;
> >  
> > -	/*
> > -	 * Disabling irqs at this point since the following code could be
> > -	 * interrupted and executed through kvm_arch_disable_virtualization_cpu()
> > -	 */
> > -	local_irq_save(flags);
> >  	if (msrs->registered) {
> >  		msrs->registered = false;
> >  		user_return_notifier_unregister(urn);
> >  	}
> > -	local_irq_restore(flags);
> > +
> >  	for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) {
> >  		values = &msrs->values[slot];
> >  		if (values->host != values->curr) {
> > @@ -13079,7 +13073,21 @@ int kvm_arch_enable_virtualization_cpu(void)
> >  void kvm_arch_disable_virtualization_cpu(void)
> >  {
> >  	kvm_x86_call(disable_virtualization_cpu)();
> > -	drop_user_return_notifiers();
> > +
> > +	/*
> > +	 * Leave the user-return notifiers as-is when disabling virtualization
> > +	 * for reboot, i.e. when disabling via IPI function call, and instead
> > +	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> > +	 * the *very* unlikely scenario module unload is racing with reboot).
> > +	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> > +	 * could be actively modifying user-return MSR state when the IPI to
> > +	 * disable virtualization arrives.  Handle the extreme edge case here
> > +	 * instead of trying to account for it in the normal flows.
> > +	 */
> > +	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
> kvm_offline_cpu() may be invoked when irq is enabled.
> So does it depend on [1]?
> 
> [1] https://lore.kernel.org/kvm/aMirvo9Xly5fVmbY@google.com/
>

Actually, kvm_offline_cpu() can't be interrupted by kvm_shutdown().
syscore_shutdown() is always called after
migrate_to_reboot_cpu(), which internally waits for currently running
CPU hotplug to complete, as described in [*].

[*] https://lore.kernel.org/kvm/dd4b8286774df98d58b5048e380b10d4de5836af.camel@intel.com


> > +		drop_user_return_notifiers();
> > +	else
> > +		__module_get(THIS_MODULE);
> Since vm_vm_fops holds ref of module kvm_intel, and drop_user_return_notifiers()
> is called in kvm_destroy_vm() or kvm_exit():
> 
> kvm_destroy_vm/kvm_exit
>   kvm_disable_virtualization
>     kvm_offline_cpu
>       kvm_disable_virtualization_cpu
>         drop_user_return_notifiers
> 
> also since fire_user_return_notifiers() executes with irq disabled, is it
> necessary to pin kvm.ko?
> 
> >  }
> >  
> >  bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
> > @@ -14363,6 +14371,11 @@ module_init(kvm_x86_init);
> >  
> >  static void __exit kvm_x86_exit(void)
> >  {
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		WARN_ON_ONCE(per_cpu_ptr(user_return_msrs, cpu)->registered);
> > +
> >  	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
> >  }
> >  module_exit(kvm_x86_exit);
> > 
> > base-commit: fe57670bfaba66049529fe7a60a926d5f3397589
> > --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-16 13:27               ` Hou Wenlong
@ 2025-10-16 13:41                 ` Sean Christopherson
  2025-10-20  7:53                   ` Yan Zhao
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-10-16 13:41 UTC (permalink / raw)
  To: Hou Wenlong
  Cc: Yan Zhao, Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li,
	Rick Edgecombe

On Thu, Oct 16, 2025, Hou Wenlong wrote:
> On Thu, Oct 16, 2025 at 04:58:17PM +0800, Yan Zhao wrote:
> > On Wed, Oct 15, 2025 at 09:19:00AM -0700, Sean Christopherson wrote:
> > > +	/*
> > > +	 * Leave the user-return notifiers as-is when disabling virtualization
> > > +	 * for reboot, i.e. when disabling via IPI function call, and instead
> > > +	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> > > +	 * the *very* unlikely scenario module unload is racing with reboot).
> > > +	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> > > +	 * could be actively modifying user-return MSR state when the IPI to
> > > +	 * disable virtualization arrives.  Handle the extreme edge case here
> > > +	 * instead of trying to account for it in the normal flows.
> > > +	 */
> > > +	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
> > kvm_offline_cpu() may be invoked when irq is enabled.
> > So does it depend on [1]?
> > 
> > [1] https://lore.kernel.org/kvm/aMirvo9Xly5fVmbY@google.com/
> >
> 
> Actually, kvm_offline_cpu() can't be interrupted by kvm_shutdown().
> syscore_shutdown() is always called after
> migrate_to_reboot_cpu(), which internally waits for currently running
> CPU hotplug to complete, as described in [*].
> 
> [*] https://lore.kernel.org/kvm/dd4b8286774df98d58b5048e380b10d4de5836af.camel@intel.com
> 
> 
> > > +		drop_user_return_notifiers();
> > > +	else
> > > +		__module_get(THIS_MODULE);
> > Since vm_vm_fops holds ref of module kvm_intel, and drop_user_return_notifiers()
> > is called in kvm_destroy_vm() or kvm_exit():
> > 
> > kvm_destroy_vm/kvm_exit
> >   kvm_disable_virtualization
> >     kvm_offline_cpu
> >       kvm_disable_virtualization_cpu
> >         drop_user_return_notifiers
> > 
> > also since fire_user_return_notifiers() executes with irq disabled, is it
> > necessary to pin kvm.ko?

Pinning kvm.ko is necessary because kvm_disable_virtualization_cpu() will bail
early due to virtualization_enabled being false (it will have been cleared by
the IPI call from kvm_shutdown()).  We could try figuring out a way around that,
but I don't see an easy solution, and in practice I can't think of any meaningful
downside to pinning kvm.ko.

I don't want to leave virtualization_enabled set because that's completely wrong
for everything except x86's user-return MSRs, which aren't even strictly related
to enabling virtualization.

I considered calling drop_user_return_notifiers() directly from kvm_exit(), but
that would require more special-case code, and it would mean blasting an IPI to
all CPUs, which seems like a bad idea when we know the system is trying to reboot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR
  2025-10-16 13:41                 ` Sean Christopherson
@ 2025-10-20  7:53                   ` Yan Zhao
  0 siblings, 0 replies; 16+ messages in thread
From: Yan Zhao @ 2025-10-20  7:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Hou Wenlong, Paolo Bonzini, kvm, linux-kernel, Xiaoyao Li,
	Rick Edgecombe

On Thu, Oct 16, 2025 at 06:41:03AM -0700, Sean Christopherson wrote:
> On Thu, Oct 16, 2025, Hou Wenlong wrote:
> > On Thu, Oct 16, 2025 at 04:58:17PM +0800, Yan Zhao wrote:
> > > On Wed, Oct 15, 2025 at 09:19:00AM -0700, Sean Christopherson wrote:
> > > > +	/*
> > > > +	 * Leave the user-return notifiers as-is when disabling virtualization
> > > > +	 * for reboot, i.e. when disabling via IPI function call, and instead
> > > > +	 * pin kvm.ko (if it's a module) to defend against use-after-free (in
> > > > +	 * the *very* unlikely scenario module unload is racing with reboot).
> > > > +	 * On a forced reboot, tasks aren't frozen before shutdown, and so KVM
> > > > +	 * could be actively modifying user-return MSR state when the IPI to
> > > > +	 * disable virtualization arrives.  Handle the extreme edge case here
> > > > +	 * instead of trying to account for it in the normal flows.
> > > > +	 */
> > > > +	if (in_task() || WARN_ON_ONCE(!kvm_rebooting))
> > > kvm_offline_cpu() may be invoked when irq is enabled.
> > > So does it depend on [1]?
> > > 
> > > [1] https://lore.kernel.org/kvm/aMirvo9Xly5fVmbY@google.com/
> > >
> > 
> > Actually, kvm_offline_cpu() can't be interrupted by kvm_shutdown().
> > syscore_shutdown() is always called after
> > migrate_to_reboot_cpu(), which internally waits for currently running
> > CPU hotplug to complete, as described in [*].
Got it. Thanks!

> > [*] https://lore.kernel.org/kvm/dd4b8286774df98d58b5048e380b10d4de5836af.camel@intel.com
> > 
> > 
> > > > +		drop_user_return_notifiers();
> > > > +	else
> > > > +		__module_get(THIS_MODULE);
> > > Since vm_vm_fops holds ref of module kvm_intel, and drop_user_return_notifiers()
> > > is called in kvm_destroy_vm() or kvm_exit():
> > > 
> > > kvm_destroy_vm/kvm_exit
> > >   kvm_disable_virtualization
> > >     kvm_offline_cpu
> > >       kvm_disable_virtualization_cpu
> > >         drop_user_return_notifiers
> > > 
> > > also since fire_user_return_notifiers() executes with irq disabled, is it
> > > necessary to pin kvm.ko?
> 
> Pinning kvm.ko is necessary because kvm_disable_virtualization_cpu() will bail
> early due to virtualization_enabled being false (it will have been cleared by
> the IPI call from kvm_shutdown()).  We could try figuring out a way around that,
> but I don't see an easy solution, and in practice I can't think of any meaningful
> downside to pinning kvm.ko.
You are right!

> I don't want to leave virtualization_enabled set because that's completely wrong
> for everything except x86's user-return MSRs, which aren't even strictly related
> to enabling virtualization.
> 
> I considered calling drop_user_return_notifiers() directly from kvm_exit(), but
> that would require more special-case code, and it would mean blasting an IPI to
> all CPUs, which seems like a bad idea when we know the system is trying to reboot.
Agreed. Thanks for the explanation!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-20  7:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-19 21:42 [PATCH] KVM: x86: Drop "cache" from user return MSR setter that skips WRMSR Sean Christopherson
2025-09-23  3:20 ` Xiaoyao Li
2025-09-30 12:22 ` Yan Zhao
2025-09-30 12:58   ` Yan Zhao
2025-09-30 16:29     ` Sean Christopherson
2025-09-30 16:34       ` Sean Christopherson
2025-10-15  0:55         ` Yan Zhao
2025-10-15 16:19           ` Sean Christopherson
2025-10-16  8:58             ` Yan Zhao
2025-10-16 13:27               ` Hou Wenlong
2025-10-16 13:41                 ` Sean Christopherson
2025-10-20  7:53                   ` Yan Zhao
2025-10-16 13:05             ` Hou Wenlong
2025-10-03 13:53       ` Yan Zhao
2025-10-03 16:53         ` Sean Christopherson
2025-10-13 13:47           ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox