[BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
@ 2025-06-04  8:33 Yipeng Zou
  2025-07-26  9:50 ` Yipeng Zou
  2025-07-27 20:01 ` Thomas Gleixner
  0 siblings, 2 replies; 8+ messages in thread
From: Yipeng Zou @ 2025-06-04  8:33 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel
  Cc: zouyipeng

Recently, A issue has been reported that CPU hang in x86 VM.

The CPU halted during Kdump likely due to IPI issues when one CPU was
rebooting and another was in Kdump:

CPU0			  CPU2
========================  ======================
reboot			  Panic
machine shutdown	  Kdump
			  machine shutdown
stop other cpus
			  stop other cpus
...			  ...
local_irq_disable	  local_irq_disable
send_IPIs(REBOOT)	  [critical regions]
[critical regions]	  1) send_IPIs(REBOOT)
			  wait timeout
			  2) send_IPIs(NMI);
Halt,NMI context
			  3) lapic_shutdown [IPI is pending]
			  ...
			  second kernel start
			  4) init_bsp_APIC [IPI is pending]
			  ...
			  local irq enable
			  Halt, IPI context

In simple terms, when the Kdump jump to the second kernel, the IPI that
was pending in the first kernel remains and is responded to by the
second kernel.

I was thinking maybe we need mask IPI in clear_local_APIC() to solve this
problem. In that way, it will clear the pending IPI in both 3) and 4).

I can't seem to find a solution in the SDM manual. I want to ask if this
approach is feasible, or if there are other ways to fix the issue.

Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
---
 arch/x86/kernel/apic/apic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d73ba5a7b623..68c41d579303 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1117,6 +1117,8 @@ void clear_local_APIC(void)
 	}
 #endif
 
+	// Mask IPI here
+
 	/*
 	 * Clean APIC state for other OSs:
 	 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-06-04  8:33 [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump Yipeng Zou
@ 2025-07-26  9:50 ` Yipeng Zou
  2025-07-27 12:39   ` Thomas Gleixner
  2025-07-27 20:01 ` Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Yipeng Zou @ 2025-07-26  9:50 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel

Hi Thomas：

     I skipped sending the NMI in native_stop_other_cpus(), and the test 
passed.

     However, this change reverts the fix introduced by commit [1], 
which was intended to handle cases where the reboot IPI is not properly 
handled by all CPUs.

     Given this, is there an alternative way to resolve the issue, or 
can we simply mask the IPI directly at that point?

     [1] 747d5a1bf293 ("x86/reboot: Always use NMI fallback when 
shutdown via reboot vector IPI fails")

在 2025/6/4 16:33, Yipeng Zou 写道:
> Recently, A issue has been reported that CPU hang in x86 VM.
>
> The CPU halted during Kdump likely due to IPI issues when one CPU was
> rebooting and another was in Kdump:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine shutdown
> stop other cpus
> 			  stop other cpus
> ...			  ...
> local_irq_disable	  local_irq_disable
> send_IPIs(REBOOT)	  [critical regions]
> [critical regions]	  1) send_IPIs(REBOOT)
> 			  wait timeout
> 			  2) send_IPIs(NMI);
> Halt,NMI context
> 			  3) lapic_shutdown [IPI is pending]
> 			  ...
> 			  second kernel start
> 			  4) init_bsp_APIC [IPI is pending]
> 			  ...
> 			  local irq enable
> 			  Halt, IPI context
>
> In simple terms, when the Kdump jump to the second kernel, the IPI that
> was pending in the first kernel remains and is responded to by the
> second kernel.
>
> I was thinking maybe we need mask IPI in clear_local_APIC() to solve this
> problem. In that way, it will clear the pending IPI in both 3) and 4).
>
> I can't seem to find a solution in the SDM manual. I want to ask if this
> approach is feasible, or if there are other ways to fix the issue.
>
> Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
> ---
>   arch/x86/kernel/apic/apic.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index d73ba5a7b623..68c41d579303 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1117,6 +1117,8 @@ void clear_local_APIC(void)
>   	}
>   #endif
>   
> +	// Mask IPI here
> +
>   	/*
>   	 * Clean APIC state for other OSs:
>   	 */

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-07-26  9:50 ` Yipeng Zou
@ 2025-07-27 12:39   ` Thomas Gleixner
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2025-07-27 12:39 UTC (permalink / raw)
  To: Yipeng Zou, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel

On Sat, Jul 26 2025 at 17:50, Yipeng Zou wrote:

Please do not top-post and trim your replies.

>      I skipped sending the NMI in native_stop_other_cpus(), and the test 
> passed.

I don't see how that would result in anything meaningful. The reboot vector
IRR bit on that second CPU will still be set.

>      Given this, is there an alternative way to resolve the issue, or 
> can we simply mask the IPI directly at that point?

Good luck for finding a mask register in the local APIC.

Even if there would be a mask register, then the IRR bit still would be
there and on unmask delivered. There is no way to clear IRR bits other
than a full reset (power on or INIT/SIPI sequence) of the local APIC.

In theory the APIC can be reset by clearing the enable bit in the
APIC_BASE MSR, but that's a can of worms in itself.

The Intel SDM is very blury about the behaviour:

  When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC
  may be lost and the APIC may return to the state described in Section
  11.4.7.1, “Local APIC State After Power-Up or Reset.”

"may" means there is no guarantee.

Aside of that this cannot be done for the original 3-wire APIC bus based
APICs (32-bit museum) pieces. Not that I care much about them, but
that's just going to add more complexity to the existing horrors.

The other problem is that with the bit disabled, the APIC might not
respond to INIT/SIPI anymore, but that's equally unclear from the
documentation; both Intel and AMD manuals are pretty useless when it
comes to the gory details of the APIC and from past experience I know
that there are quite some subtle differences in the APIC behaviour
across CPU generations...

The stale reboot vector IRR problem is pretty straight forward to
mitigate. See patch below.

That needs a full audit of the various vectors, though at a quick
inspection most of them should be fine.

Aside of that there is quite some bogosity in the APIC setup path, which
I need to look deeper into.

Thanks,

	tglx
---
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -136,6 +136,28 @@ static int smp_stop_nmi_callback(unsigne
 DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
 {
 	apic_eoi();
+
+	/*
+	 * Handle the case where a reboot IPI is stale in the IRR. This
+	 * happens when:
+	 *
+	 *   a CPU crashes with interrupts disabled before handling the
+	 *   reboot IPI and jumps into a crash kernel. The reboot IPI
+	 *   vector is kept set in the APIC IRR across the APIC soft
+	 *   disabled phase and as there is no way to clear a pending IRR
+	 *   bit, it is delivered to the crash kernel immediately when
+	 *   interrupts are enabled.
+	 *
+	 * As the reboot IPI can only be sent after acquiring @stopping_cpu
+	 * by storing the CPU number, this case can be detected when
+	 * @stopping_cpu contains the bootup value -1. Just return and
+	 * ignore it.
+	 */
+	if (atomic_read(&stopping_cpu) == -1) {
+		pr_info("Ignoring stale reboot IPI\n");
+		return;
+	}
+
 	cpu_emergency_disable_virtualization();
 	stop_this_cpu(NULL);
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-06-04  8:33 [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump Yipeng Zou
  2025-07-26  9:50 ` Yipeng Zou
@ 2025-07-27 20:01 ` Thomas Gleixner
  2025-07-29  8:53   ` Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2025-07-27 20:01 UTC (permalink / raw)
  To: Yipeng Zou, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel
  Cc: zouyipeng

On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
> Recently, A issue has been reported that CPU hang in x86 VM.
>
> The CPU halted during Kdump likely due to IPI issues when one CPU was
> rebooting and another was in Kdump:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine shutdown
> stop other cpus
> 			  stop other cpus
> ...			  ...
> local_irq_disable	  local_irq_disable
> send_IPIs(REBOOT)	  [critical regions]
> [critical regions]	  1) send_IPIs(REBOOT)

After staring more at it, this makes absolutely no sense at all.

stop_other_cpus() does:

	/* Only proceed if this is the first CPU to reach this code */
	old_cpu = -1;
	this_cpu = smp_processor_id();
	if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
		return;

So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
at that point @stopping_cpu == 0 ergo the cmpxchg() fails.

So what actually happens in this case is:

CPU0			  CPU2
========================  ======================
reboot			  Panic
machine shutdown	  Kdump
			  machine_crash_shutdown()
stop other cpus           local_irq_disable()
try_cmpxchg() succeeds	  stop other cpus
...		          try_cmpxchg() fails	  
send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
wait timeout

And from there on everything becomes a lottery as CPU0 continues to
execute and CPU2 proceeds and jumps into the crash kernel...

This whole logic is broken...

Nevertheless the patch I sent earlier is definitely making things more
robust, but it won't solve your particular problem.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-07-27 20:01 ` Thomas Gleixner
@ 2025-07-29  8:53   ` Thomas Gleixner
  2025-07-29 13:35     ` Yipeng Zou
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2025-07-29  8:53 UTC (permalink / raw)
  To: Yipeng Zou, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel
  Cc: zouyipeng

On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:

> On Wed, Jun 04 2025 at 08:33, Yipeng Zou wrote:
>> Recently, A issue has been reported that CPU hang in x86 VM.
>>
>> The CPU halted during Kdump likely due to IPI issues when one CPU was
>> rebooting and another was in Kdump:
>>
>> CPU0			  CPU2
>> ========================  ======================
>> reboot			  Panic
>> machine shutdown	  Kdump
>> 			  machine shutdown
>> stop other cpus
>> 			  stop other cpus
>> ...			  ...
>> local_irq_disable	  local_irq_disable
>> send_IPIs(REBOOT)	  [critical regions]
>> [critical regions]	  1) send_IPIs(REBOOT)
>
> After staring more at it, this makes absolutely no sense at all.
>
> stop_other_cpus() does:
>
> 	/* Only proceed if this is the first CPU to reach this code */
> 	old_cpu = -1;
> 	this_cpu = smp_processor_id();
> 	if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
> 		return;
>
> So CPU2 _cannot_ reach the code, which issues the reboot IPIs, because
> at that point @stopping_cpu == 0 ergo the cmpxchg() fails.
>
> So what actually happens in this case is:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine_crash_shutdown()
> stop other cpus           local_irq_disable()
> try_cmpxchg() succeeds	  stop other cpus
> ...		          try_cmpxchg() fails	  
> send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
> wait timeout

But looking even deeper. machine_crash_shutdown() does not end up in
stop_other_cpus() at all. It immediately uses the NMI shutdown. There
are still a few inconsistencies in that code, but they are not really
critical.

So the actual scenario is:

CPU0			  CPU2
========================  ======================
reboot			  Panic
machine shutdown	  Kdump
			  machine_crash_shutdown()
stop other cpus           
send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
wait timeout
                          send NMI stop
NMI -> CPU stop
                          jump to crash kernel

So the patch I gave you should handle the reboot vector pending in IRR
gracefully. Can you please give it a try?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-07-29  8:53   ` Thomas Gleixner
@ 2025-07-29 13:35     ` Yipeng Zou
  2025-07-29 19:48       ` Thomas Gleixner
  0 siblings, 1 reply; 8+ messages in thread
From: Yipeng Zou @ 2025-07-29 13:35 UTC (permalink / raw)
  To: Thomas Gleixner, mingo, bp, dave.hansen, x86, hpa, peterz,
	sohil.mehta, rui.zhang, arnd, yuntao.wang, linux-kernel


在 2025/7/29 16:53, Thomas Gleixner 写道:
> On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:
>
> But looking even deeper. machine_crash_shutdown() does not end up in
> stop_other_cpus() at all. It immediately uses the NMI shutdown. There
> are still a few inconsistencies in that code, but they are not really
> critical.
>
> So the actual scenario is:
>
> CPU0			  CPU2
> ========================  ======================
> reboot			  Panic
> machine shutdown	  Kdump
> 			  machine_crash_shutdown()
> stop other cpus
> send_IPIs(REBOOT)	  --> REBOOT vector becomes pending in IRR
> wait timeout
>                            send NMI stop
> NMI -> CPU stop
>                            jump to crash kernel
>
> So the patch I gave you should handle the reboot vector pending in IRR
> gracefully. Can you please give it a try?

Hi Thomas:

     Thanks for your time!

     Indeed, It invokes kdump_nmi_shootdown_cpus() and uses the NMI 
shutdown.

     I started the test run today, but this is a low-probability to hit 
this path, might take a while.

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-07-29 13:35     ` Yipeng Zou
@ 2025-07-29 19:48       ` Thomas Gleixner
  2025-08-11 12:51         ` Yipeng Zou
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2025-07-29 19:48 UTC (permalink / raw)
  To: Yipeng Zou, mingo, bp, dave.hansen, x86, hpa, peterz, sohil.mehta,
	rui.zhang, arnd, yuntao.wang, linux-kernel

On Tue, Jul 29 2025 at 21:35, Yipeng Zou wrote:
> 在 2025/7/29 16:53, Thomas Gleixner 写道:
>> On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:
>> So the patch I gave you should handle the reboot vector pending in IRR
>> gracefully. Can you please give it a try?
>
> Hi Thomas:
>
>      Thanks for your time!
>
>      Indeed, It invokes kdump_nmi_shootdown_cpus() and uses the NMI 
> shutdown.
>
>      I started the test run today, but this is a low-probability to hit 
> this path, might take a while.

It's trivial enough to enforce that, no?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump
  2025-07-29 19:48       ` Thomas Gleixner
@ 2025-08-11 12:51         ` Yipeng Zou
  0 siblings, 0 replies; 8+ messages in thread
From: Yipeng Zou @ 2025-08-11 12:51 UTC (permalink / raw)
  To: Thomas Gleixner, mingo, bp, dave.hansen, x86, hpa, peterz,
	sohil.mehta, rui.zhang, arnd, yuntao.wang, linux-kernel


在 2025/7/30 3:48, Thomas Gleixner 写道:
> On Tue, Jul 29 2025 at 21:35, Yipeng Zou wrote:
>> 在 2025/7/29 16:53, Thomas Gleixner 写道:
>>> On Sun, Jul 27 2025 at 22:01, Thomas Gleixner wrote:
>>> So the patch I gave you should handle the reboot vector pending in IRR
>>> gracefully. Can you please give it a try?
>> Hi Thomas:
>>
>>       Thanks for your time!
>>
>>       Indeed, It invokes kdump_nmi_shootdown_cpus() and uses the NMI
>> shutdown.
>>
>>       I started the test run today, but this is a low-probability to hit
>> this path, might take a while.
> It's trivial enough to enforce that, no?

Hi Thomas:

     Sorry for delay - after resolving a few environmental issues, 
anyway, it works.

     All previously failed test cases are now passing.

     I also think this solution is reasonable.

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-11 12:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04  8:33 [BUG REPORT] x86/apic: CPU Hang in x86 VM During Kdump Yipeng Zou
2025-07-26  9:50 ` Yipeng Zou
2025-07-27 12:39   ` Thomas Gleixner
2025-07-27 20:01 ` Thomas Gleixner
2025-07-29  8:53   ` Thomas Gleixner
2025-07-29 13:35     ` Yipeng Zou
2025-07-29 19:48       ` Thomas Gleixner
2025-08-11 12:51         ` Yipeng Zou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).