[PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
@ 2026-03-02 10:22 Kai Huang
  2026-03-02 10:22 ` [PATCH v2] " Kai Huang
  2026-03-09 16:38 ` [PATCH] " Edgecombe, Rick P
  0 siblings, 2 replies; 14+ messages in thread
From: Kai Huang @ 2026-03-02 10:22 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc, kas
  Cc: rick.p.edgecombe, tglx, bp, mingo, x86, hpa, linux-kernel,
	Kai Huang, stable, Vishal Verma

TDX can leave the cache in an incoherent state for the memory it uses.
During kexec the kernel does a WBINVD for each CPU before memory gets
reused in the second kernel.

There were two considerations for where this WBINVD should happen.  In
order to handle cases where the cache might get into an incoherent state
while the kexec is in the initial stages, it is needed to do this later
in the kexec path, when the kexecing CPU stops all remote CPUs.  However,
the later kexec process is sensitive to existing races.  So to avoid
perturbing that operation, it is better to do it earlier.

The existing solution is to track the need for the kexec time WBINVD
generically (i.e., not just for TDX) in a per-cpu var.  The late
invocation only happens if the earlier TDX specific logic in
tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
earlier WBINVD logic was built into KVM’s existing syscore ops shutdown()
handler, which is called earlier in the kexec path.

However, this accidentally added it to KVM’s unload path as well (also
the "error path" when bringing up TDX during KVM module load), which
uses the same internal functions.  This makes some sense too, though,
because if KVM is getting unloaded, TDX cache affecting operations will
likely cease.  So it is a good point to do the work before KVM is
unloaded and won't have a chance to handle the shutdown operation in the
future.

Unfortunately this KVM unload invocation triggers a lockdep warning in
tdx_cpu_flush_cache_for_kexec().  Since tdx_cpu_flush_cache_for_kexec()
is doing WBINVD on a specific CPU, it has an assert for preemption being
disabled.  This works fine for the kexec time invocation, but the KVM
unload path calls this as part of a CPUHP callback for which, despite
always executing on the target CPU, preemption is not disabled.

It might be better to add the earlier invocation logic to a dedicated
arch/x86 TDX syscore shutdown() handler, but to make the fix more
backport friendly just adjust the lockdep assert in the
tdx_cpu_flush_cache_for_kexec().

The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
the same CPU.  It's OK that it can be preempted in the middle as long as
it won't be rescheduled to another CPU.

Remove the too strong lockdep_assert_preemption_disabled(), and change
this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the
more proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
conditions that the context cannot be moved to another CPU to run in the
middle.

Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
Cc: stable@vger.kernel.org
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
---

Hi Dave, Paolo, Sean,

/facepalm.

This was recently reported by Vishal.  Sorry that I forgot to test the
module unloading (but too focused on kexecing path, which doesn't have
this issue).  This wasn't caught by our CI because there's no such test
case in CI.  Right now we are adding this to the CI so it will be covered
in the future.

---
 arch/x86/virt/vmx/tdx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8b8e165a2001..6f6be1df4b78 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1872,9 +1872,7 @@ EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
 #ifdef CONFIG_KEXEC_CORE
 void tdx_cpu_flush_cache_for_kexec(void)
 {
-	lockdep_assert_preemption_disabled();
-
-	if (!this_cpu_read(cache_state_incoherent))
+	if (!__this_cpu_read(cache_state_incoherent))
 		return;

 	/*
@@ -1883,7 +1881,7 @@ void tdx_cpu_flush_cache_for_kexec(void)
 	 * there should be no more SEAMCALLs on this CPU.
 	 */
 	wbinvd();
-	this_cpu_write(cache_state_incoherent, false);
+	__this_cpu_write(cache_state_incoherent, false);
 }
 EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
 #endif

base-commit: b5425f5406ee1b4bd84720f68020ef18ce380bab
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-02 10:22 [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec Kai Huang
@ 2026-03-02 10:22 ` Kai Huang
  2026-03-02 10:26   ` Huang, Kai
                     ` (2 more replies)
  2026-03-09 16:38 ` [PATCH] " Edgecombe, Rick P
  1 sibling, 3 replies; 14+ messages in thread
From: Kai Huang @ 2026-03-02 10:22 UTC (permalink / raw)
  To: dave.hansen, pbonzini, seanjc, kas
  Cc: rick.p.edgecombe, tglx, bp, mingo, x86, hpa, linux-kernel,
	Kai Huang, stable, Vishal Verma

TDX can leave the cache in an incoherent state for the memory it uses.
During kexec the kernel does a WBINVD for each CPU before memory gets
reused in the second kernel.

There were two considerations for where this WBINVD should happen.  In
order to handle cases where the cache might get into an incoherent state
while the kexec is in the initial stages, it is needed to do this later
in the kexec path, when the kexecing CPU stops all remote CPUs.  However,
the later kexec process is sensitive to existing races.  So to avoid
perturbing that operation, it is better to do it earlier.

The existing solution is to track the need for the kexec time WBINVD
generically (i.e., not just for TDX) in a per-cpu var.  The late
invocation only happens if the earlier TDX specific logic in
tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
earlier WBINVD logic was built into KVM’s existing syscore ops shutdown()
handler, which is called earlier in the kexec path.

However, this accidentally added it to KVM’s unload path as well (also
the "error path" when bringing up TDX during KVM module load), which
uses the same internal functions.  This makes some sense too, though,
because if KVM is getting unloaded, TDX cache affecting operations will
likely cease.  So it is a good point to do the work before KVM is
unloaded and won't have a chance to handle the shutdown operation in the
future.

Unfortunately this KVM unload invocation triggers a lockdep warning in
tdx_cpu_flush_cache_for_kexec().  Since tdx_cpu_flush_cache_for_kexec()
is doing WBINVD on a specific CPU, it has an assert for preemption being
disabled.  This works fine for the kexec time invocation, but the KVM
unload path calls this as part of a CPUHP callback for which, despite
always executing on the target CPU, preemption is not disabled.

It might be better to add the earlier invocation logic to a dedicated
arch/x86 TDX syscore shutdown() handler, but to make the fix more
backport friendly just adjust the lockdep assert in the
tdx_cpu_flush_cache_for_kexec().

The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
the same CPU.  It's OK that it can be preempted in the middle as long as
it won't be rescheduled to another CPU.

Remove the too strong lockdep_assert_preemption_disabled(), and change
this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the more
proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
conditions that the context cannot be moved to another CPU to run in the
middle.

Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
Cc: stable@vger.kernel.org
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
---

v1 -> v2:
  - Improve changelog as discussed in v1.
  - Also mention this can also be triggered in "error path" in changelog.

Hi Rick,

Are you OK with sending this patch out to public, or do you have more
comments?

-- below is for public --

Hi Dave, Paolo, Sean,

/facepalm.

This was recently reported by Vishal.  Sorry that I forgot to test the
module unloading (but too focused on kexecing path, which doesn't have
this issue).  This wasn't caught by our CI because there's no such test
case in CI.  Right now we are adding this to the CI so it will be covered
in the future.

---
 arch/x86/virt/vmx/tdx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8b8e165a2001..6f6be1df4b78 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1872,9 +1872,7 @@ EXPORT_SYMBOL_FOR_KVM(tdh_phymem_page_wbinvd_hkid);
 #ifdef CONFIG_KEXEC_CORE
 void tdx_cpu_flush_cache_for_kexec(void)
 {
-	lockdep_assert_preemption_disabled();
-
-	if (!this_cpu_read(cache_state_incoherent))
+	if (!__this_cpu_read(cache_state_incoherent))
 		return;

 	/*
@@ -1883,7 +1881,7 @@ void tdx_cpu_flush_cache_for_kexec(void)
 	 * there should be no more SEAMCALLs on this CPU.
 	 */
 	wbinvd();
-	this_cpu_write(cache_state_incoherent, false);
+	__this_cpu_write(cache_state_incoherent, false);
 }
 EXPORT_SYMBOL_FOR_KVM(tdx_cpu_flush_cache_for_kexec);
 #endif

base-commit: 7dff99b354601dd01829e1511711846e04340a69
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-02 10:22 ` [PATCH v2] " Kai Huang
@ 2026-03-02 10:26   ` Huang, Kai
  2026-03-05 18:33   ` Nikolay Borisov
  2026-03-10 13:43   ` Sean Christopherson
  2 siblings, 0 replies; 14+ messages in thread
From: Huang, Kai @ 2026-03-02 10:26 UTC (permalink / raw)
  To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com,
	dave.hansen@linux.intel.com
  Cc: Edgecombe, Rick P, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org, Verma, Vishal L,
	tglx@kernel.org, stable@vger.kernel.org

> 
> v1 -> v2:
>   - Improve changelog as discussed in v1.
> 

Apologize, this is the internal version that I forgot to delete from my
tree, and was sent out by accident.

Please ignore this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-02 10:22 ` [PATCH v2] " Kai Huang
  2026-03-02 10:26   ` Huang, Kai
@ 2026-03-05 18:33   ` Nikolay Borisov
  2026-03-05 21:35     ` Huang, Kai
  2026-03-10 13:43   ` Sean Christopherson
  2 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2026-03-05 18:33 UTC (permalink / raw)
  To: Kai Huang, dave.hansen, pbonzini, seanjc, kas
  Cc: rick.p.edgecombe, tglx, bp, mingo, x86, hpa, linux-kernel, stable,
	Vishal Verma



On 2.03.26 г. 12:22 ч., Kai Huang wrote:
> TDX can leave the cache in an incoherent state for the memory it uses.
> During kexec the kernel does a WBINVD for each CPU before memory gets
> reused in the second kernel.
> 
> There were two considerations for where this WBINVD should happen.  In
> order to handle cases where the cache might get into an incoherent state
> while the kexec is in the initial stages, it is needed to do this later
> in the kexec path, when the kexecing CPU stops all remote CPUs.  However,
> the later kexec process is sensitive to existing races.  So to avoid
> perturbing that operation, it is better to do it earlier.
> 
> The existing solution is to track the need for the kexec time WBINVD
> generically (i.e., not just for TDX) in a per-cpu var.  The late
> invocation only happens if the earlier TDX specific logic in
> tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
> earlier WBINVD logic was built into KVM’s existing syscore ops shutdown()
> handler, which is called earlier in the kexec path.
> 
> However, this accidentally added it to KVM’s unload path as well (also
> the "error path" when bringing up TDX during KVM module load), which
> uses the same internal functions.  This makes some sense too, though,
> because if KVM is getting unloaded, TDX cache affecting operations will
> likely cease.  So it is a good point to do the work before KVM is
> unloaded and won't have a chance to handle the shutdown operation in the
> future.
> 
> Unfortunately this KVM unload invocation triggers a lockdep warning in
> tdx_cpu_flush_cache_for_kexec().  Since tdx_cpu_flush_cache_for_kexec()
> is doing WBINVD on a specific CPU, it has an assert for preemption being
> disabled.  This works fine for the kexec time invocation, but the KVM
> unload path calls this as part of a CPUHP callback for which, despite
> always executing on the target CPU, preemption is not disabled.
> 
> It might be better to add the earlier invocation logic to a dedicated
> arch/x86 TDX syscore shutdown() handler, but to make the fix more
> backport friendly just adjust the lockdep assert in the
> tdx_cpu_flush_cache_for_kexec().
> 
> The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
> the same CPU.  It's OK that it can be preempted in the middle as long as
> it won't be rescheduled to another CPU.

TLDR: It wants migration disabled.

> 
> Remove the too strong lockdep_assert_preemption_disabled(), and change
> this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the more
> proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
> conditions that the context cannot be moved to another CPU to run in the
> middle.
> 
> Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
> Cc: stable@vger.kernel.org
> Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Vishal Verma <vishal.l.verma@intel.com>


So how exactly does this patch prevent the BUG: printk in 
check_preemption_disabled from triggering, if the lockdep assert was 
triggering?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-05 18:33   ` Nikolay Borisov
@ 2026-03-05 21:35     ` Huang, Kai
  2026-03-06  9:58       ` Nikolay Borisov
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Kai @ 2026-03-05 21:35 UTC (permalink / raw)
  To: kas@kernel.org, pbonzini@redhat.com, nik.borisov@suse.com,
	seanjc@google.com, dave.hansen@linux.intel.com
  Cc: Edgecombe, Rick P, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org, Verma, Vishal L,
	tglx@kernel.org, stable@vger.kernel.org


> > 
> > The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
> > the same CPU.  It's OK that it can be preempted in the middle as long as
> > it won't be rescheduled to another CPU.
> 
> TLDR: It wants migration disabled.

Basically yes.

> 
> > 
> > Remove the too strong lockdep_assert_preemption_disabled(), and change
> > this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the more
> > proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
> > conditions that the context cannot be moved to another CPU to run in the
> > middle.
> > 
> > Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
> > Cc: stable@vger.kernel.org
> > Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Tested-by: Vishal Verma <vishal.l.verma@intel.com>
> 
> 
> So how exactly does this patch prevent the BUG: printk in 
> check_preemption_disabled from triggering, if the lockdep assert was 
> triggering?

There's no real BUG here.  It's just the
lockdep_assert_preemption_disabled() is misused.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-05 21:35     ` Huang, Kai
@ 2026-03-06  9:58       ` Nikolay Borisov
  2026-03-08 10:12         ` Huang, Kai
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2026-03-06  9:58 UTC (permalink / raw)
  To: Huang, Kai, kas@kernel.org, pbonzini@redhat.com,
	nik.borisov@suse.com, seanjc@google.com,
	dave.hansen@linux.intel.com
  Cc: Edgecombe, Rick P, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org, Verma, Vishal L,
	tglx@kernel.org, stable@vger.kernel.org



On 5.03.26 г. 23:35 ч., Huang, Kai wrote:
> 
>>>
>>> The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
>>> the same CPU.  It's OK that it can be preempted in the middle as long as
>>> it won't be rescheduled to another CPU.
>>
>> TLDR: It wants migration disabled.
> 
> Basically yes.
> 
>>
>>>
>>> Remove the too strong lockdep_assert_preemption_disabled(), and change
>>> this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the more
>>> proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
>>> conditions that the context cannot be moved to another CPU to run in the
>>> middle.
>>>
>>> Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
>>> Cc: stable@vger.kernel.org
>>> Reported-by: Vishal Verma <vishal.l.verma@intel.com>
>>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>>> Tested-by: Vishal Verma <vishal.l.verma@intel.com>
>>
>>
>> So how exactly does this patch prevent the BUG: printk in
>> check_preemption_disabled from triggering, if the lockdep assert was
>> triggering?
> 
> There's no real BUG here.  It's just the
> lockdep_assert_preemption_disabled() is misused.

Essentially in check_preemption_disabled() the check is considered 
passed IF ANY of the preempt disable conditions is met, i.e it's more 
laxed. So yeah, makes sense!

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-06  9:58       ` Nikolay Borisov
@ 2026-03-08 10:12         ` Huang, Kai
  0 siblings, 0 replies; 14+ messages in thread
From: Huang, Kai @ 2026-03-08 10:12 UTC (permalink / raw)
  To: pbonzini@redhat.com, nik.borisov@suse.com, kas@kernel.org,
	seanjc@google.com, dave.hansen@linux.intel.com
  Cc: Edgecombe, Rick P, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	mingo@redhat.com, Verma, Vishal L, tglx@kernel.org,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org

> > > 
> > > So how exactly does this patch prevent the BUG: printk in
> > > check_preemption_disabled from triggering, if the lockdep assert was
> > > triggering?
> > 
> > There's no real BUG here.  It's just the
> > lockdep_assert_preemption_disabled() is misused.
> 
> Essentially in check_preemption_disabled() the check is considered 
> passed IF ANY of the preempt disable conditions is met, i.e it's more 
> laxed. So yeah, makes sense!
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-02 10:22 [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec Kai Huang
  2026-03-02 10:22 ` [PATCH v2] " Kai Huang
@ 2026-03-09 16:38 ` Edgecombe, Rick P
  2026-03-10  7:19   ` Huang, Kai
  2026-03-10 16:42   ` Edgecombe, Rick P
  1 sibling, 2 replies; 14+ messages in thread
From: Edgecombe, Rick P @ 2026-03-09 16:38 UTC (permalink / raw)
  To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com,
	Huang, Kai, dave.hansen@linux.intel.com
  Cc: bp@alien8.de, x86@kernel.org, hpa@zytor.com, mingo@redhat.com,
	Verma, Vishal L, tglx@kernel.org, stable@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2026-03-02 at 23:22 +1300, Kai Huang wrote:
> TDX can leave the cache in an incoherent state for the memory it
> uses. During kexec the kernel does a WBINVD for each CPU before
> memory gets reused in the second kernel.
> 
> There were two considerations for where this WBINVD should happen. 
> In order to handle cases where the cache might get into an incoherent
> state while the kexec is in the initial stages, it is needed to do
> this later in the kexec path, when the kexecing CPU stops all remote
> CPUs.  However, the later kexec process is sensitive to existing
> races.  So to avoid perturbing that operation, it is better to do it
> earlier.
> 
> The existing solution is to track the need for the kexec time WBINVD
> generically (i.e., not just for TDX) in a per-cpu var.  The late
> invocation only happens if the earlier TDX specific logic in
> tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
> earlier WBINVD logic was built into KVM’s existing syscore ops
> shutdown() handler, which is called earlier in the kexec path.
> 
> However, this accidentally added it to KVM’s unload path as well
> (also the "error path" when bringing up TDX during KVM module load),
> which uses the same internal functions.  This makes some sense too,
> though, because if KVM is getting unloaded, TDX cache affecting
> operations will likely cease.  So it is a good point to do the work
> before KVM is unloaded and won't have a chance to handle the shutdown
> operation in the future.
> 
> Unfortunately this KVM unload invocation triggers a lockdep warning
> in tdx_cpu_flush_cache_for_kexec().  Since
> tdx_cpu_flush_cache_for_kexec() is doing WBINVD on a specific CPU, it
> has an assert for preemption being disabled.  This works fine for the
> kexec time invocation, but the KVM unload path calls this as part of
> a CPUHP callback for which, despite always executing on the target
> CPU, preemption is not disabled.
> 
> It might be better to add the earlier invocation logic to a dedicated
> arch/x86 TDX syscore shutdown() handler, but to make the fix more
> backport friendly just adjust the lockdep assert in the
> tdx_cpu_flush_cache_for_kexec().
> 
> The real requirement is tdx_cpu_flush_cache_for_kexec() must be done
> on the same CPU.  It's OK that it can be preempted in the middle as
> long as it won't be rescheduled to another CPU.
> 
> Remove the too strong lockdep_assert_preemption_disabled(), and
> change this_cpu_{read|write}() to __this_cpu_{read|write}() which
> provide the more proper check (when CONFIG_DEBUG_PREEMPT is true),
> which checks all conditions that the context cannot be moved to
> another CPU to run in the middle.
> 
> Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX
> SEAMCALLs")
> Cc: stable@vger.kernel.org
> Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Vishal Verma <vishal.l.verma@intel.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

But this issue is also solved by:
https://lore.kernel.org/kvm/20260307010358.819645-3-rick.p.edgecombe@intel.com/

I guess that these changes are correct in either case. There is no need
for the stricter asserts. But depending on the order the log would be
confusing in the history when it talks about lockdep warnings. So we'll
have to keep an eye on things. If this goes first, then it's fine.

You know, it might have helped to include the splat if you end up with
a v2.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-09 16:38 ` [PATCH] " Edgecombe, Rick P
@ 2026-03-10  7:19   ` Huang, Kai
  2026-03-10 13:50     ` Sean Christopherson
  2026-03-10 16:42   ` Edgecombe, Rick P
  1 sibling, 1 reply; 14+ messages in thread
From: Huang, Kai @ 2026-03-10  7:19 UTC (permalink / raw)
  To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com,
	Edgecombe, Rick P, dave.hansen@linux.intel.com
  Cc: bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	linux-kernel@vger.kernel.org, Verma, Vishal L, tglx@kernel.org,
	stable@vger.kernel.org, mingo@redhat.com

On Mon, 2026-03-09 at 16:38 +0000, Edgecombe, Rick P wrote:
> On Mon, 2026-03-02 at 23:22 +1300, Kai Huang wrote:
> > TDX can leave the cache in an incoherent state for the memory it
> > uses. During kexec the kernel does a WBINVD for each CPU before
> > memory gets reused in the second kernel.
> > 
> > There were two considerations for where this WBINVD should happen. 
> > In order to handle cases where the cache might get into an incoherent
> > state while the kexec is in the initial stages, it is needed to do
> > this later in the kexec path, when the kexecing CPU stops all remote
> > CPUs.  However, the later kexec process is sensitive to existing
> > races.  So to avoid perturbing that operation, it is better to do it
> > earlier.
> > 
> > The existing solution is to track the need for the kexec time WBINVD
> > generically (i.e., not just for TDX) in a per-cpu var.  The late
> > invocation only happens if the earlier TDX specific logic in
> > tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
> > earlier WBINVD logic was built into KVM’s existing syscore ops
> > shutdown() handler, which is called earlier in the kexec path.
> > 
> > However, this accidentally added it to KVM’s unload path as well
> > (also the "error path" when bringing up TDX during KVM module load),
> > which uses the same internal functions.  This makes some sense too,
> > though, because if KVM is getting unloaded, TDX cache affecting
> > operations will likely cease.  So it is a good point to do the work
> > before KVM is unloaded and won't have a chance to handle the shutdown
> > operation in the future.
> > 
> > Unfortunately this KVM unload invocation triggers a lockdep warning
> > in tdx_cpu_flush_cache_for_kexec().  Since
> > tdx_cpu_flush_cache_for_kexec() is doing WBINVD on a specific CPU, it
> > has an assert for preemption being disabled.  This works fine for the
> > kexec time invocation, but the KVM unload path calls this as part of
> > a CPUHP callback for which, despite always executing on the target
> > CPU, preemption is not disabled.
> > 
> > It might be better to add the earlier invocation logic to a dedicated
> > arch/x86 TDX syscore shutdown() handler, but to make the fix more
> > backport friendly just adjust the lockdep assert in the
> > tdx_cpu_flush_cache_for_kexec().
> > 
> > The real requirement is tdx_cpu_flush_cache_for_kexec() must be done
> > on the same CPU.  It's OK that it can be preempted in the middle as
> > long as it won't be rescheduled to another CPU.
> > 
> > Remove the too strong lockdep_assert_preemption_disabled(), and
> > change this_cpu_{read|write}() to __this_cpu_{read|write}() which
> > provide the more proper check (when CONFIG_DEBUG_PREEMPT is true),
> > which checks all conditions that the context cannot be moved to
> > another CPU to run in the middle.
> > 
> > Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX
> > SEAMCALLs")
> > Cc: stable@vger.kernel.org
> > Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Tested-by: Vishal Verma <vishal.l.verma@intel.com>
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> But this issue is also solved by:
> https://lore.kernel.org/kvm/20260307010358.819645-3-rick.p.edgecombe@intel.com/

This depends on Sean's series to move VMXON to x86 core, so it's not stable
friendly.

> 
> I guess that these changes are correct in either case. There is no need
> for the stricter asserts. But depending on the order the log would be
> confusing in the history when it talks about lockdep warnings. So we'll
> have to keep an eye on things. If this goes first, then it's fine.

I see.  Will keep this in mind.

> 
> You know, it might have helped to include the splat if you end up with
> a v2.

I thought lockdep warn should be obvious even w/o the actual splat, but fine
I can include the splat if v2 is needed.

Hi Sean, Paolo, Kirill,

It would be good to merge this upstream and backport to stable.  Appreciate
if you can ack if it looks good to you?  Thanks.

 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-02 10:22 ` [PATCH v2] " Kai Huang
  2026-03-02 10:26   ` Huang, Kai
  2026-03-05 18:33   ` Nikolay Borisov
@ 2026-03-10 13:43   ` Sean Christopherson
  2 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2026-03-10 13:43 UTC (permalink / raw)
  To: Kai Huang
  Cc: dave.hansen, pbonzini, kas, rick.p.edgecombe, tglx, bp, mingo,
	x86, hpa, linux-kernel, stable, Vishal Verma

On Mon, Mar 02, 2026, Kai Huang wrote:
> TDX can leave the cache in an incoherent state for the memory it uses.
> During kexec the kernel does a WBINVD for each CPU before memory gets
> reused in the second kernel.
> 
> There were two considerations for where this WBINVD should happen.  In
> order to handle cases where the cache might get into an incoherent state
> while the kexec is in the initial stages, it is needed to do this later
> in the kexec path, when the kexecing CPU stops all remote CPUs.  However,
> the later kexec process is sensitive to existing races.  So to avoid
> perturbing that operation, it is better to do it earlier.
> 
> The existing solution is to track the need for the kexec time WBINVD
> generically (i.e., not just for TDX) in a per-cpu var.  The late
> invocation only happens if the earlier TDX specific logic in
> tdx_cpu_flush_cache_for_kexec() didn’t take care of the work.  This
> earlier WBINVD logic was built into KVM’s existing syscore ops shutdown()
> handler, which is called earlier in the kexec path.
> 
> However, this accidentally added it to KVM’s unload path as well (also
> the "error path" when bringing up TDX during KVM module load), which
> uses the same internal functions.  This makes some sense too, though,
> because if KVM is getting unloaded, TDX cache affecting operations will
> likely cease.  So it is a good point to do the work before KVM is
> unloaded and won't have a chance to handle the shutdown operation in the
> future.
> 
> Unfortunately this KVM unload invocation triggers a lockdep warning in
> tdx_cpu_flush_cache_for_kexec().  Since tdx_cpu_flush_cache_for_kexec()
> is doing WBINVD on a specific CPU, it has an assert for preemption being
> disabled.  This works fine for the kexec time invocation, but the KVM
> unload path calls this as part of a CPUHP callback for which, despite
> always executing on the target CPU, preemption is not disabled.
> 
> It might be better to add the earlier invocation logic to a dedicated
> arch/x86 TDX syscore shutdown() handler, but to make the fix more
> backport friendly just adjust the lockdep assert in the
> tdx_cpu_flush_cache_for_kexec().
> 
> The real requirement is tdx_cpu_flush_cache_for_kexec() must be done on
> the same CPU.  It's OK that it can be preempted in the middle as long as
> it won't be rescheduled to another CPU.
> 
> Remove the too strong lockdep_assert_preemption_disabled(), and change
> this_cpu_{read|write}() to __this_cpu_{read|write}() which provide the more
> proper check (when CONFIG_DEBUG_PREEMPT is true), which checks all
> conditions that the context cannot be moved to another CPU to run in the
> middle.
> 
> Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs")
> Cc: stable@vger.kernel.org
> Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Tested-by: Vishal Verma <vishal.l.verma@intel.com>
> ---

Acked-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-10  7:19   ` Huang, Kai
@ 2026-03-10 13:50     ` Sean Christopherson
  2026-03-10 16:36       ` Edgecombe, Rick P
  2026-03-10 21:15       ` Huang, Kai
  0 siblings, 2 replies; 14+ messages in thread
From: Sean Christopherson @ 2026-03-10 13:50 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kas@kernel.org, Rick P Edgecombe,
	dave.hansen@linux.intel.com, bp@alien8.de, x86@kernel.org,
	hpa@zytor.com, linux-kernel@vger.kernel.org, Vishal L Verma,
	tglx@kernel.org, stable@vger.kernel.org, mingo@redhat.com

On Tue, Mar 10, 2026, Kai Huang wrote:
> On Mon, 2026-03-09 at 16:38 +0000, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-02 at 23:22 +1300, Kai Huang wrote:
> > > Remove the too strong lockdep_assert_preemption_disabled(), and
> > > change this_cpu_{read|write}() to __this_cpu_{read|write}() which
> > > provide the more proper check (when CONFIG_DEBUG_PREEMPT is true),
> > > which checks all conditions that the context cannot be moved to
> > > another CPU to run in the middle.
> > > 
> > > Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX
> > > SEAMCALLs")
> > > Cc: stable@vger.kernel.org
> > > Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > Tested-by: Vishal Verma <vishal.l.verma@intel.com>
> > 
> > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > But this issue is also solved by:
> > https://lore.kernel.org/kvm/20260307010358.819645-3-rick.p.edgecombe@intel.com/

Even when that series comes along, I would rather have __this_cpu_{read|write}()
instead of the explicit lockdep_assert_preemption_disabled().  Similar to the WARN
about IRQs being disabled that got removed, explicitly requiring that preemption
be disabled feels like a description of the current code, not an actual requirement.

Asserting that preemption is disabled gives the false impression that the current
task must not be scheduled out, between reading and writing cache_state_incoherent.
Which then raises the question of why scheduling out the current task is bad".

> This depends on Sean's series to move VMXON to x86 core, so it's not stable
> friendly.
> 
> > 
> > I guess that these changes are correct in either case. There is no need
> > for the stricter asserts. But depending on the order the log would be
> > confusing in the history when it talks about lockdep warnings. So we'll
> > have to keep an eye on things. If this goes first, then it's fine.
> 
> I see.  Will keep this in mind.
> 
> > 
> > You know, it might have helped to include the splat if you end up with
> > a v2.

+1.  I can read a backtrace about 10x faster than a full sentence describing the
backtrace.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-10 13:50     ` Sean Christopherson
@ 2026-03-10 16:36       ` Edgecombe, Rick P
  2026-03-10 21:15       ` Huang, Kai
  1 sibling, 0 replies; 14+ messages in thread
From: Edgecombe, Rick P @ 2026-03-10 16:36 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai
  Cc: mingo@redhat.com, dave.hansen@linux.intel.com, kas@kernel.org,
	hpa@zytor.com, linux-kernel@vger.kernel.org, Verma, Vishal L,
	bp@alien8.de, pbonzini@redhat.com, tglx@kernel.org,
	stable@vger.kernel.org, x86@kernel.org

On Tue, 2026-03-10 at 06:50 -0700, Sean Christopherson wrote:
> Even when that series comes along, I would rather have __this_cpu_{read|write}()
> instead of the explicit lockdep_assert_preemption_disabled().  Similar to the WARN
> about IRQs being disabled that got removed, explicitly requiring that preemption
> be disabled feels like a description of the current code, not an actual requirement.

Agreed. It's just going to confusing if the log talks about a KVM unload path
that is not there. The simpler answer is just merge this first.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-09 16:38 ` [PATCH] " Edgecombe, Rick P
  2026-03-10  7:19   ` Huang, Kai
@ 2026-03-10 16:42   ` Edgecombe, Rick P
  1 sibling, 0 replies; 14+ messages in thread
From: Edgecombe, Rick P @ 2026-03-10 16:42 UTC (permalink / raw)
  To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com,
	Huang, Kai, dave.hansen@linux.intel.com
  Cc: bp@alien8.de, x86@kernel.org, hpa@zytor.com, mingo@redhat.com,
	Verma, Vishal L, tglx@kernel.org, stable@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2026-03-09 at 09:39 -0700, Edgecombe, Rick P wrote:
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Oops this was supposed to go on the v2.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec
  2026-03-10 13:50     ` Sean Christopherson
  2026-03-10 16:36       ` Edgecombe, Rick P
@ 2026-03-10 21:15       ` Huang, Kai
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Kai @ 2026-03-10 21:15 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Edgecombe, Rick P, dave.hansen@linux.intel.com, kas@kernel.org,
	hpa@zytor.com, linux-kernel@vger.kernel.org, Verma, Vishal L,
	bp@alien8.de, pbonzini@redhat.com, tglx@kernel.org,
	stable@vger.kernel.org, x86@kernel.org, mingo@redhat.com

On Tue, 2026-03-10 at 06:50 -0700, Sean Christopherson wrote:
> On Tue, Mar 10, 2026, Kai Huang wrote:
> > On Mon, 2026-03-09 at 16:38 +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2026-03-02 at 23:22 +1300, Kai Huang wrote:
> > > > Remove the too strong lockdep_assert_preemption_disabled(), and
> > > > change this_cpu_{read|write}() to __this_cpu_{read|write}() which
> > > > provide the more proper check (when CONFIG_DEBUG_PREEMPT is true),
> > > > which checks all conditions that the context cannot be moved to
> > > > another CPU to run in the middle.
> > > > 
> > > > Fixes: 61221d07e815 ("KVM/TDX: Explicitly do WBINVD when no more TDX
> > > > SEAMCALLs")
> > > > Cc: stable@vger.kernel.org
> > > > Reported-by: Vishal Verma <vishal.l.verma@intel.com>
> > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > Tested-by: Vishal Verma <vishal.l.verma@intel.com>
> > > 
> > > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > 
> > > But this issue is also solved by:
> > > https://lore.kernel.org/kvm/20260307010358.819645-3-rick.p.edgecombe@intel.com/
> 
> Even when that series comes along, I would rather have __this_cpu_{read|write}()
> instead of the explicit lockdep_assert_preemption_disabled().  Similar to the WARN
> about IRQs being disabled that got removed, explicitly requiring that preemption
> be disabled feels like a description of the current code, not an actual requirement.
> 
> Asserting that preemption is disabled gives the false impression that the current
> task must not be scheduled out, between reading and writing cache_state_incoherent.
> Which then raises the question of why scheduling out the current task is bad".

Agreed.

> 
> > This depends on Sean's series to move VMXON to x86 core, so it's not stable
> > friendly.
> > 
> > > 
> > > I guess that these changes are correct in either case. There is no need
> > > for the stricter asserts. But depending on the order the log would be
> > > confusing in the history when it talks about lockdep warnings. So we'll
> > > have to keep an eye on things. If this goes first, then it's fine.
> > 
> > I see.  Will keep this in mind.
> > 
> > > 
> > > You know, it might have helped to include the splat if you end up with
> > > a v2.
> 
> +1.  I can read a backtrace about 10x faster than a full sentence describing the
> backtrace.

I'll include the actual WARN splat in v2.

Thanks for the ack!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-03-10 21:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 10:22 [PATCH] x86/virt/tdx: Fix lockdep assertion failure in cache flush for kexec Kai Huang
2026-03-02 10:22 ` [PATCH v2] " Kai Huang
2026-03-02 10:26   ` Huang, Kai
2026-03-05 18:33   ` Nikolay Borisov
2026-03-05 21:35     ` Huang, Kai
2026-03-06  9:58       ` Nikolay Borisov
2026-03-08 10:12         ` Huang, Kai
2026-03-10 13:43   ` Sean Christopherson
2026-03-09 16:38 ` [PATCH] " Edgecombe, Rick P
2026-03-10  7:19   ` Huang, Kai
2026-03-10 13:50     ` Sean Christopherson
2026-03-10 16:36       ` Edgecombe, Rick P
2026-03-10 21:15       ` Huang, Kai
2026-03-10 16:42   ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox