public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Ashok Raj <ashok.raj@intel.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	x86@kernel.org, Mario Limonciello <mario.limonciello@amd.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Tony Battersby <tonyb@cybernetics.com>,
	Ashok Raj <ashok.raj@linux.intel.com>,
	Tony Luck <tony.luck@intel.com>,
	Arjan van de Veen <arjan@linux.intel.com>,
	Eric Biederman <ebiederm@xmission.com>,
	Ashok Raj <ashok.raj@intel.com>
Subject: Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust
Date: Thu, 15 Jun 2023 00:40:10 +0200	[thread overview]
Message-ID: <87fs6t8y7p.ffs@tglx> (raw)
In-Reply-To: <ZIonT7+sxAx8IWEE@a4bf019067fa.jf.intel.com>

On Wed, Jun 14 2023 at 13:47, Ashok Raj wrote:
> On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote:
>> 
>> Now let me look into this NMI cruft.
>> 
>
> Maybe if each CPU going down can set their mask, we can simply hit NMI to
> only the problematic ones?
>
> The simple count doesn't capture the CPUs in trouble.

Even a mask is not cutting it. If CPUs did not react on the reboot
vector then there is no guarantee that they are not going to do so
concurrently to the NMI IPI:

CPU0                          CPU1
                
IPI(BROADCAST, REBOOT);
wait() // timeout
                              stop_this_cpu()
if (!all_stopped()) {
  for_each_cpu(cpu, mask) {
                                mark_stopped(); <- all_stopped() == true now
       IPI(cpu, NMI);
  }                            --> NMI()

  // no wait() because all_stopped() == true

proceed_and_hope() ....

On bare metal this is likely to "work" by chance, but in a guest all
bets are off.

I'm not surprised at all.

The approach of piling hardware and firmware legacy on top of hardware
and firmware legacy in the hope that we can "fix" that in software was
wrong from the very beginning.

What's surprising is that this worked for a really long time. Though
with increasing complexity the thereby produced debris is starting to
rear its ugly head.

I'm sure the marketing departements of _all_ x86 vendors will come up
with a brilliant slogan for that. Something like:

  "We are committed to ensure that you are able to experience the
   failures of the past forever with increasingly improved performance
   and new exciting features which are fully backwards failure
   compatible."

TBH, the (OS) software industry has proliferated that by joining the
'features first' choir without much thought and push back. See
arch/x86/kernel/cpu/* for prime examples.

Ranted enough. I'm going to sleep now and look at this mess tomorrow
morning with brain awake again. Though that will not change the
underlying problem, which is unfixable.

Thanks,

        tglx



  reply	other threads:[~2023-06-14 22:40 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-13 12:17 [patch V2 0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles Thomas Gleixner
2023-06-13 12:17 ` [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust Thomas Gleixner
2023-06-14 19:42   ` Ashok Raj
2023-06-14 19:53     ` Thomas Gleixner
2023-06-14 20:47       ` Ashok Raj
2023-06-14 22:40         ` Thomas Gleixner [this message]
2023-06-13 12:17 ` [patch V2 2/8] x86/smp: Dont access non-existing CPUID leaf Thomas Gleixner
2023-06-13 12:17 ` [patch V2 3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() Thomas Gleixner
2023-06-15  8:58   ` Peter Zijlstra
2023-06-15 10:57     ` Thomas Gleixner
2023-06-13 12:17 ` [patch V2 4/8] x86/smp: Acquire stopping_cpu unconditionally Thomas Gleixner
2023-06-15  9:02   ` Peter Zijlstra
2023-06-13 12:18 ` [patch V2 5/8] x86/smp: Use dedicated cache-line for mwait_play_dead() Thomas Gleixner
2023-06-13 12:18 ` [patch V2 6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage Thomas Gleixner
2023-06-13 12:18 ` [patch V2 7/8] x86/smp: Split sending INIT IPI out into a helper function Thomas Gleixner
2023-06-13 12:18 ` [patch V2 8/8] x86/smp: Put CPUs into INIT on shutdown if possible Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fs6t8y7p.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=arjan@linux.intel.com \
    --cc=ashok.raj@intel.com \
    --cc=ashok.raj@linux.intel.com \
    --cc=ebiederm@xmission.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=tonyb@cybernetics.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox