* [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 7:54 [PATCH v2 0/4] x86/boot: provide better diagnostics in AP boot failure Roger Pau Monne
@ 2025-05-22 7:54 ` Roger Pau Monne
2025-05-22 9:10 ` Jan Beulich
2025-05-22 7:54 ` [PATCH v2 2/4] x86/traps: remove smp_mb() ahead of IPI dispatch Roger Pau Monne
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Roger Pau Monne @ 2025-05-22 7:54 UTC (permalink / raw)
To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper
Print the CPU and APIC ID that fails to respond to the init sequence, or
that didn't manage to reach the "callin" state. Expand a bit the printed
error messages. Otherwise the "Not responding." message is not easy to
understand by users.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Also print APIC ID.
---
xen/arch/x86/smpboot.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 0189d6c332a4..dbc2f2f1d411 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
smp_mb();
if ( bootsym(trampoline_cpu_started) == 0xA5 )
/* trampoline started but...? */
- printk("Stuck ??\n");
+ printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
+ cpu, apicid);
else
/* trampoline code not run */
- printk("Not responding.\n");
+ printk("CPU%u/APICID%u: Not responding to startup\n",
+ cpu, apicid);
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 7:54 ` [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure Roger Pau Monne
@ 2025-05-22 9:10 ` Jan Beulich
2025-05-22 14:39 ` Andrew Cooper
0 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2025-05-22 9:10 UTC (permalink / raw)
To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel
On 22.05.2025 09:54, Roger Pau Monne wrote:
> Print the CPU and APIC ID that fails to respond to the init sequence, or
> that didn't manage to reach the "callin" state. Expand a bit the printed
> error messages. Otherwise the "Not responding." message is not easy to
> understand by users.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
> - Also print APIC ID.
> ---
> xen/arch/x86/smpboot.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 0189d6c332a4..dbc2f2f1d411 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
> smp_mb();
> if ( bootsym(trampoline_cpu_started) == 0xA5 )
> /* trampoline started but...? */
> - printk("Stuck ??\n");
> + printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
> + cpu, apicid);
> else
> /* trampoline code not run */
> - printk("Not responding.\n");
> + printk("CPU%u/APICID%u: Not responding to startup\n",
> + cpu, apicid);
> }
> }
>
Elsewhere I think we print AIC IDs in hex; may be better to do so here, too.
That may then want some text re-arrangement though, e.g.
"CPU%u: APICID %#x not responding to startup\n"
Thoughts?
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 9:10 ` Jan Beulich
@ 2025-05-22 14:39 ` Andrew Cooper
2025-05-22 16:50 ` Roger Pau Monné
0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2025-05-22 14:39 UTC (permalink / raw)
To: Jan Beulich, Roger Pau Monne; +Cc: xen-devel
On 22/05/2025 10:10 am, Jan Beulich wrote:
> On 22.05.2025 09:54, Roger Pau Monne wrote:
>> Print the CPU and APIC ID that fails to respond to the init sequence, or
>> that didn't manage to reach the "callin" state. Expand a bit the printed
>> error messages. Otherwise the "Not responding." message is not easy to
>> understand by users.
>>
>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> Changes since v1:
>> - Also print APIC ID.
>> ---
>> xen/arch/x86/smpboot.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>> index 0189d6c332a4..dbc2f2f1d411 100644
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
>> smp_mb();
>> if ( bootsym(trampoline_cpu_started) == 0xA5 )
>> /* trampoline started but...? */
>> - printk("Stuck ??\n");
>> + printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
>> + cpu, apicid);
>> else
>> /* trampoline code not run */
>> - printk("Not responding.\n");
>> + printk("CPU%u/APICID%u: Not responding to startup\n",
>> + cpu, apicid);
>> }
>> }
>>
> Elsewhere I think we print AIC IDs in hex; may be better to do so here, too.
> That may then want some text re-arrangement though, e.g.
>
> "CPU%u: APICID %#x not responding to startup\n"
>
> Thoughts?
Definitely hex. Elsewhere APIC_ID always has an underscore.
I'd suggest:
"APIC_ID %#x (CPU%u) didn't respond to SIPI\n"
The APIC_ID is the critical detail, and the CPU number is fairly incidental.
Also as we're changing things, lets not retain the STARTUP name seeing
as it only exists on pre-Pentium APICs. SIPI is what we use these days.
~Andrew
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 14:39 ` Andrew Cooper
@ 2025-05-22 16:50 ` Roger Pau Monné
2025-05-22 17:22 ` Andrew Cooper
0 siblings, 1 reply; 13+ messages in thread
From: Roger Pau Monné @ 2025-05-22 16:50 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Jan Beulich, xen-devel
On Thu, May 22, 2025 at 03:39:57PM +0100, Andrew Cooper wrote:
> On 22/05/2025 10:10 am, Jan Beulich wrote:
> > On 22.05.2025 09:54, Roger Pau Monne wrote:
> >> Print the CPU and APIC ID that fails to respond to the init sequence, or
> >> that didn't manage to reach the "callin" state. Expand a bit the printed
> >> error messages. Otherwise the "Not responding." message is not easy to
> >> understand by users.
> >>
> >> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >> ---
> >> Changes since v1:
> >> - Also print APIC ID.
> >> ---
> >> xen/arch/x86/smpboot.c | 6 ++++--
> >> 1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> >> index 0189d6c332a4..dbc2f2f1d411 100644
> >> --- a/xen/arch/x86/smpboot.c
> >> +++ b/xen/arch/x86/smpboot.c
> >> @@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
> >> smp_mb();
> >> if ( bootsym(trampoline_cpu_started) == 0xA5 )
> >> /* trampoline started but...? */
> >> - printk("Stuck ??\n");
> >> + printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
> >> + cpu, apicid);
> >> else
> >> /* trampoline code not run */
> >> - printk("Not responding.\n");
> >> + printk("CPU%u/APICID%u: Not responding to startup\n",
> >> + cpu, apicid);
> >> }
> >> }
> >>
> > Elsewhere I think we print AIC IDs in hex; may be better to do so here, too.
> > That may then want some text re-arrangement though, e.g.
> >
> > "CPU%u: APICID %#x not responding to startup\n"
> >
> > Thoughts?
>
> Definitely hex. Elsewhere APIC_ID always has an underscore.
Maybe I'm confused, but I don't think Xen uses an underscore, it's
always 'APIC ID' when printed. I don't mind adding it here, I assume
what you mean with elsewhere is other projects like Linux?
Thanks, Roger.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 16:50 ` Roger Pau Monné
@ 2025-05-22 17:22 ` Andrew Cooper
2025-05-23 7:19 ` Roger Pau Monné
0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2025-05-22 17:22 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: Jan Beulich, xen-devel
On 22/05/2025 5:50 pm, Roger Pau Monné wrote:
> On Thu, May 22, 2025 at 03:39:57PM +0100, Andrew Cooper wrote:
>> On 22/05/2025 10:10 am, Jan Beulich wrote:
>>> On 22.05.2025 09:54, Roger Pau Monne wrote:
>>>> Print the CPU and APIC ID that fails to respond to the init sequence, or
>>>> that didn't manage to reach the "callin" state. Expand a bit the printed
>>>> error messages. Otherwise the "Not responding." message is not easy to
>>>> understand by users.
>>>>
>>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>> ---
>>>> Changes since v1:
>>>> - Also print APIC ID.
>>>> ---
>>>> xen/arch/x86/smpboot.c | 6 ++++--
>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>>>> index 0189d6c332a4..dbc2f2f1d411 100644
>>>> --- a/xen/arch/x86/smpboot.c
>>>> +++ b/xen/arch/x86/smpboot.c
>>>> @@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
>>>> smp_mb();
>>>> if ( bootsym(trampoline_cpu_started) == 0xA5 )
>>>> /* trampoline started but...? */
>>>> - printk("Stuck ??\n");
>>>> + printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
>>>> + cpu, apicid);
>>>> else
>>>> /* trampoline code not run */
>>>> - printk("Not responding.\n");
>>>> + printk("CPU%u/APICID%u: Not responding to startup\n",
>>>> + cpu, apicid);
>>>> }
>>>> }
>>>>
>>> Elsewhere I think we print AIC IDs in hex; may be better to do so here, too.
>>> That may then want some text re-arrangement though, e.g.
>>>
>>> "CPU%u: APICID %#x not responding to startup\n"
>>>
>>> Thoughts?
>> Definitely hex. Elsewhere APIC_ID always has an underscore.
> Maybe I'm confused, but I don't think Xen uses an underscore, it's
> always 'APIC ID' when printed. I don't mind adding it here, I assume
> what you mean with elsewhere is other projects like Linux?
It's apic_id in plenty of smpboot.c, but fine - lets do it with a space.
We do need to reduce from $N ways of rendering this down to 1.
~Andrew
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure
2025-05-22 17:22 ` Andrew Cooper
@ 2025-05-23 7:19 ` Roger Pau Monné
0 siblings, 0 replies; 13+ messages in thread
From: Roger Pau Monné @ 2025-05-23 7:19 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Jan Beulich, xen-devel
On Thu, May 22, 2025 at 06:22:19PM +0100, Andrew Cooper wrote:
> On 22/05/2025 5:50 pm, Roger Pau Monné wrote:
> > On Thu, May 22, 2025 at 03:39:57PM +0100, Andrew Cooper wrote:
> >> On 22/05/2025 10:10 am, Jan Beulich wrote:
> >>> On 22.05.2025 09:54, Roger Pau Monne wrote:
> >>>> Print the CPU and APIC ID that fails to respond to the init sequence, or
> >>>> that didn't manage to reach the "callin" state. Expand a bit the printed
> >>>> error messages. Otherwise the "Not responding." message is not easy to
> >>>> understand by users.
> >>>>
> >>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>>> ---
> >>>> Changes since v1:
> >>>> - Also print APIC ID.
> >>>> ---
> >>>> xen/arch/x86/smpboot.c | 6 ++++--
> >>>> 1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> >>>> index 0189d6c332a4..dbc2f2f1d411 100644
> >>>> --- a/xen/arch/x86/smpboot.c
> >>>> +++ b/xen/arch/x86/smpboot.c
> >>>> @@ -618,10 +618,12 @@ static int do_boot_cpu(int apicid, int cpu)
> >>>> smp_mb();
> >>>> if ( bootsym(trampoline_cpu_started) == 0xA5 )
> >>>> /* trampoline started but...? */
> >>>> - printk("Stuck ??\n");
> >>>> + printk("CPU%u/APICID%u: Didn't finish startup sequence\n",
> >>>> + cpu, apicid);
> >>>> else
> >>>> /* trampoline code not run */
> >>>> - printk("Not responding.\n");
> >>>> + printk("CPU%u/APICID%u: Not responding to startup\n",
> >>>> + cpu, apicid);
> >>>> }
> >>>> }
> >>>>
> >>> Elsewhere I think we print AIC IDs in hex; may be better to do so here, too.
> >>> That may then want some text re-arrangement though, e.g.
> >>>
> >>> "CPU%u: APICID %#x not responding to startup\n"
> >>>
> >>> Thoughts?
> >> Definitely hex. Elsewhere APIC_ID always has an underscore.
> > Maybe I'm confused, but I don't think Xen uses an underscore, it's
> > always 'APIC ID' when printed. I don't mind adding it here, I assume
> > what you mean with elsewhere is other projects like Linux?
>
> It's apic_id in plenty of smpboot.c, but fine - lets do it with a space.
>
> We do need to reduce from $N ways of rendering this down to 1.
Oh, so it's lowercase with underscore. Anyway, will use uppercase and
space as agreed.
Thanks, Roger.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/4] x86/traps: remove smp_mb() ahead of IPI dispatch
2025-05-22 7:54 [PATCH v2 0/4] x86/boot: provide better diagnostics in AP boot failure Roger Pau Monne
2025-05-22 7:54 ` [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure Roger Pau Monne
@ 2025-05-22 7:54 ` Roger Pau Monne
2025-05-22 9:16 ` Jan Beulich
2025-05-22 7:54 ` [PATCH v2 3/4] x86/traps: split code to dump execution state to a separate helper Roger Pau Monne
2025-05-22 7:54 ` [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall Roger Pau Monne
3 siblings, 1 reply; 13+ messages in thread
From: Roger Pau Monne @ 2025-05-22 7:54 UTC (permalink / raw)
To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper
The IPI dispatch functions should already have the required barriers to
ensure correct memory ordering.
Note other callers of send_IPI_mask() don't use any barriers.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- New in this version.
---
xen/arch/x86/traps.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index c94779b4ad4f..22f20629327d 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -785,8 +785,6 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
cpumask_andnot(&show_state_mask, &cpu_online_map,
cpumask_of(smp_processor_id()));
set_nmi_callback(nmi_show_execution_state);
- /* Ensure new callback is set before sending out the NMI. */
- smp_wmb();
smp_send_nmi_allbutself();
/* Wait at most 10ms for some other CPU to respond. */
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v2 2/4] x86/traps: remove smp_mb() ahead of IPI dispatch
2025-05-22 7:54 ` [PATCH v2 2/4] x86/traps: remove smp_mb() ahead of IPI dispatch Roger Pau Monne
@ 2025-05-22 9:16 ` Jan Beulich
0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2025-05-22 9:16 UTC (permalink / raw)
To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel
On 22.05.2025 09:54, Roger Pau Monne wrote:
> The IPI dispatch functions should already have the required barriers to
> ensure correct memory ordering.
To be quite honest, "should" isn't sufficient here for my taste. Either
they are there or they aren't. According to my check they are, so ...
> Note other callers of send_IPI_mask() don't use any barriers.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
with the word dropped.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] x86/traps: split code to dump execution state to a separate helper
2025-05-22 7:54 [PATCH v2 0/4] x86/boot: provide better diagnostics in AP boot failure Roger Pau Monne
2025-05-22 7:54 ` [PATCH v2 1/4] x86/boot: print CPU and APIC ID in bring up failure Roger Pau Monne
2025-05-22 7:54 ` [PATCH v2 2/4] x86/traps: remove smp_mb() ahead of IPI dispatch Roger Pau Monne
@ 2025-05-22 7:54 ` Roger Pau Monne
2025-05-22 13:41 ` Jan Beulich
2025-05-22 7:54 ` [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall Roger Pau Monne
3 siblings, 1 reply; 13+ messages in thread
From: Roger Pau Monne @ 2025-05-22 7:54 UTC (permalink / raw)
To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper
Split the code that triggers remote CPUs to dump stacks into a separate
function. Also introduce a parameter that can be set by the caller of the
newly introduced function to force CPUs to dump the full stack, rather than
just dumping the current function name.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/include/asm/processor.h | 1 +
xen/arch/x86/traps.c | 62 +++++++++++++++++-----------
2 files changed, 39 insertions(+), 24 deletions(-)
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index eacd425c5350..10d8078cc1ca 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -371,6 +371,7 @@ void show_registers(const struct cpu_user_regs *regs);
#define dump_execution_state() run_in_exception_handler(show_execution_state)
void show_page_walk(unsigned long addr);
void noreturn fatal_trap(const struct cpu_user_regs *regs, bool show_remote);
+void show_execution_state_nmi(const cpumask_t *mask, bool show_all);
extern void mtrr_ap_init(void);
extern void mtrr_bp_init(void);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 22f20629327d..f6646d505644 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -714,13 +714,15 @@ static cpumask_t show_state_mask;
static bool opt_show_all;
boolean_param("async-show-all", opt_show_all);
+static bool force_show_all;
+
static int cf_check nmi_show_execution_state(
const struct cpu_user_regs *regs, int cpu)
{
if ( !cpumask_test_cpu(cpu, &show_state_mask) )
return 0;
- if ( opt_show_all )
+ if ( opt_show_all || force_show_all )
show_execution_state(regs);
else if ( guest_mode(regs) )
printk(XENLOG_ERR "CPU%d\t%pv\t%04x:%p in guest\n",
@@ -734,6 +736,38 @@ static int cf_check nmi_show_execution_state(
return 1;
}
+void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
+{
+ unsigned int msecs, pending;
+
+ force_show_all = show_all;
+
+ watchdog_disable();
+ console_start_sync();
+
+ cpumask_copy(&show_state_mask, mask);
+ set_nmi_callback(nmi_show_execution_state);
+ send_IPI_mask(mask, APIC_DM_NMI);
+
+ /* Wait at most 10ms for some other CPU to respond. */
+ msecs = 10;
+ pending = cpumask_weight(&show_state_mask);
+ while ( pending && msecs-- )
+ {
+ unsigned int left;
+
+ mdelay(1);
+ left = cpumask_weight(&show_state_mask);
+ if ( left < pending )
+ {
+ pending = left;
+ msecs = 10;
+ }
+ }
+ if ( pending )
+ printk("Non-responding CPUs: {%*pbl}\n", CPUMASK_PR(&show_state_mask));
+}
+
const char *vector_name(unsigned int vec)
{
static const char names[][4] = {
@@ -780,31 +814,11 @@ void fatal_trap(const struct cpu_user_regs *regs, bool show_remote)
if ( show_remote )
{
- unsigned int msecs, pending;
+ cpumask_t *scratch = this_cpu(scratch_cpumask);
- cpumask_andnot(&show_state_mask, &cpu_online_map,
+ cpumask_andnot(scratch, &cpu_online_map,
cpumask_of(smp_processor_id()));
- set_nmi_callback(nmi_show_execution_state);
- smp_send_nmi_allbutself();
-
- /* Wait at most 10ms for some other CPU to respond. */
- msecs = 10;
- pending = cpumask_weight(&show_state_mask);
- while ( pending && msecs-- )
- {
- unsigned int left;
-
- mdelay(1);
- left = cpumask_weight(&show_state_mask);
- if ( left < pending )
- {
- pending = left;
- msecs = 10;
- }
- }
- if ( pending )
- printk("Non-responding CPUs: {%*pbl}\n",
- CPUMASK_PR(&show_state_mask));
+ show_execution_state_nmi(scratch, false);
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v2 3/4] x86/traps: split code to dump execution state to a separate helper
2025-05-22 7:54 ` [PATCH v2 3/4] x86/traps: split code to dump execution state to a separate helper Roger Pau Monne
@ 2025-05-22 13:41 ` Jan Beulich
0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2025-05-22 13:41 UTC (permalink / raw)
To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel
On 22.05.2025 09:54, Roger Pau Monne wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -714,13 +714,15 @@ static cpumask_t show_state_mask;
> static bool opt_show_all;
> boolean_param("async-show-all", opt_show_all);
>
> +static bool force_show_all;
> +
> static int cf_check nmi_show_execution_state(
> const struct cpu_user_regs *regs, int cpu)
> {
> if ( !cpumask_test_cpu(cpu, &show_state_mask) )
> return 0;
>
> - if ( opt_show_all )
> + if ( opt_show_all || force_show_all )
> show_execution_state(regs);
> else if ( guest_mode(regs) )
> printk(XENLOG_ERR "CPU%d\t%pv\t%04x:%p in guest\n",
> @@ -734,6 +736,38 @@ static int cf_check nmi_show_execution_state(
> return 1;
> }
>
> +void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> +{
> + unsigned int msecs, pending;
> +
> + force_show_all = show_all;
Prior (v1) comments here still apply.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall
2025-05-22 7:54 [PATCH v2 0/4] x86/boot: provide better diagnostics in AP boot failure Roger Pau Monne
` (2 preceding siblings ...)
2025-05-22 7:54 ` [PATCH v2 3/4] x86/traps: split code to dump execution state to a separate helper Roger Pau Monne
@ 2025-05-22 7:54 ` Roger Pau Monne
2025-05-22 14:42 ` Andrew Cooper
3 siblings, 1 reply; 13+ messages in thread
From: Roger Pau Monne @ 2025-05-22 7:54 UTC (permalink / raw)
To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper
With the current AP bring up code, Xen can get stuck indefinitely if an AP
freezes during boot after the 'callin' step. Introduce a 5s timeout while
waiting for APs to finish startup.
On failure of an AP to complete startup, send an NMI to trigger the
printing of a stack backtrace on the stuck AP and panic on the BSP.
This patch was done while investigating the issue caused by Intel erratum
ICX143. It wasn't helpful in that case, but it's still and improvement
when debugging AP bring up related issues.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Use 5s timeout.
- Print APICID.
- Split NMI dispatch code movement to a pre-patch.
- Reorder timeout check condition.
---
xen/arch/x86/smpboot.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index dbc2f2f1d411..50c5674555e4 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1372,6 +1372,7 @@ int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)
int __cpu_up(unsigned int cpu)
{
int apicid, ret;
+ s_time_t start;
if ( (apicid = x86_cpu_to_apicid[cpu]) == BAD_APICID )
return -ENODEV;
@@ -1390,10 +1391,17 @@ int __cpu_up(unsigned int cpu)
time_latch_stamps();
set_cpu_state(CPU_STATE_ONLINE);
+ start = NOW();
while ( !cpu_online(cpu) )
{
cpu_relax();
process_pending_softirqs();
+ if ( (NOW() - start) > SECONDS(5) )
+ {
+ /* AP is stuck, send NMI and panic. */
+ show_execution_state_nmi(cpumask_of(cpu), true);
+ panic("CPU%u/APICID%u: Stuck while starting up\n", cpu, apicid);
+ }
}
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall
2025-05-22 7:54 ` [PATCH v2 4/4] x86/boot: attempt to print trace and panic on AP bring up stall Roger Pau Monne
@ 2025-05-22 14:42 ` Andrew Cooper
0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2025-05-22 14:42 UTC (permalink / raw)
To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich
On 22/05/2025 8:54 am, Roger Pau Monne wrote:
> With the current AP bring up code, Xen can get stuck indefinitely if an AP
> freezes during boot after the 'callin' step. Introduce a 5s timeout while
> waiting for APs to finish startup.
>
> On failure of an AP to complete startup, send an NMI to trigger the
> printing of a stack backtrace on the stuck AP and panic on the BSP.
>
> This patch was done while investigating the issue caused by Intel erratum
> ICX143. It wasn't helpful in that case, but it's still and improvement
> when debugging AP bring up related issues.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
^ permalink raw reply [flat|nested] 13+ messages in thread