All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: xen-devel@lists.xenproject.org, Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH 2/2] x86/boot: attempt to print trace and panic on AP bring up stall
Date: Thu, 22 May 2025 09:26:47 +0200	[thread overview]
Message-ID: <aC7Rt8tyoF09aYl3@macbook.local> (raw)
In-Reply-To: <b66645c8-0e3e-4414-acd3-a0acc6731a14@citrix.com>

On Wed, May 21, 2025 at 07:16:18PM +0100, Andrew Cooper wrote:
> On 21/05/2025 5:55 pm, Roger Pau Monne wrote:
> > With the current AP bring up code Xen can get stuck indefinitely if an AP
> 
> You want a comma between "code, Xen" to make the sentence easier to parse.
> 
> > freezes during boot after the 'callin' step.  Introduce a 10s timeout while
> > waiting for APs to finish startup.
> 
> 5s is the timeout used in other parts of AP bringup.  I'd suggest being
> consistent here.
> 
> 
> > On failure of an AP to complete startup send an NMI to trigger the printing
> 
> Again, a comma between "startup, send" would go a long way.
> 
> > of a stack backtrace on the stuck AP and panic on the BSP.
> >
> > The sending of the NMI re-uses the code already present in fatal_trap(), by
> > moving it to a separate function.
> 
> I'd be tempted to split the patch in two.
> 
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> 
> It may be worth nothing that this came from the ICX143 investigation,
> even if it wasn't relevant in the end?
> 
> > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> > index 48ce996ba414..77dce3e3e22b 100644
> > --- a/xen/arch/x86/smpboot.c
> > +++ b/xen/arch/x86/smpboot.c
> > @@ -1388,10 +1389,17 @@ int __cpu_up(unsigned int cpu)
> >      time_latch_stamps();
> >  
> >      set_cpu_state(CPU_STATE_ONLINE);
> > +    start = NOW();
> >      while ( !cpu_online(cpu) )
> >      {
> >          cpu_relax();
> >          process_pending_softirqs();
> > +        if ( NOW() > start + SECONDS(10) )
> 
> (NOW() - start) > SECONDS(10)
> 
> It has one fewer boundary conditions, even if it is rather unlikely that
> start + 10s will overflow.
> 
> > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> > index c94779b4ad4f..9b9e3726e2fb 100644
> > --- a/xen/arch/x86/traps.c
> > +++ b/xen/arch/x86/traps.c
> > @@ -734,6 +736,40 @@ static int cf_check nmi_show_execution_state(
> >      return 1;
> >  }
> >  
> > +void show_execution_state_nmi(const cpumask_t *mask, bool show_all)
> > +{
> > +    unsigned int msecs, pending;
> > +
> > +    force_show_all = show_all;
> > +
> > +    watchdog_disable();
> > +    console_start_sync();
> > +
> > +    cpumask_copy(&show_state_mask, mask);
> > +    set_nmi_callback(nmi_show_execution_state);
> > +    /* Ensure new callback is set before sending out the NMI. */
> > +    smp_wmb();
> 
> I know this is only moving code, but this is wrong.  So is the smp_mb()
> in the x2apic drivers.
> 
> It would only be correct in principle for xAPIC (which is an MMIO
> store), except it's UC and is strongly ordered anyway.  Furthermore, the
> sequence point for the send_IPI_mask() prevents the compiler from
> reordering this unsafely.
> 
> The x2APIC drivers need LFENCE;MFENCE on Intel, and just MFENCE on AMD,
> and this (critically) is not smp_mb(), which is now just a locked operation.
> 
> I bet these aren't the only examples of incorrect barriers WRT IPIs.  I
> guess we should fix those separately.

Thanks, I will remove the smp_wmb() ahead of moving the code, but
other instances in the APIC drivers I will leave for a different
series, I don't want to delay the work here on those fixes.

Regards, Roger.


  reply	other threads:[~2025-05-22  7:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 16:55 [PATCH 0/2] x86/boot: provide better diagnostics in AP boot failure Roger Pau Monne
2025-05-21 16:55 ` [PATCH 1/2] x86/boot: print CPU number in bring up failure Roger Pau Monne
2025-05-21 16:58   ` Andrew Cooper
2025-05-21 16:55 ` [PATCH 2/2] x86/boot: attempt to print trace and panic on AP bring up stall Roger Pau Monne
2025-05-21 18:16   ` Andrew Cooper
2025-05-22  7:26     ` Roger Pau Monné [this message]
2025-05-22  7:18   ` Jan Beulich
2025-05-22 14:11     ` Roger Pau Monné
2025-05-22 14:53       ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aC7Rt8tyoF09aYl3@macbook.local \
    --to=roger.pau@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.