public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Dave Hansen <dave.hansen@intel.com>,
	Tony Battersby <tonyb@cybernetics.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Mario Limonciello <mario.limonciello@amd.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andi Kleen <ak@linux.intel.com>
Subject: Re: [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff
Date: Wed, 26 Apr 2023 01:00:31 +0200	[thread overview]
Message-ID: <87leifzhww.ffs@tglx> (raw)
In-Reply-To: <ecdea7a8-a748-6ecb-5fc1-93d7eda3c54d@intel.com>

On Tue, Apr 25 2023 at 15:29, Dave Hansen wrote:
> On 4/25/23 14:05, Thomas Gleixner wrote:
>> The only consequence of looking at bit 0 of some random other leaf is
>> that all CPUs which run stop_this_cpu() issue WBINVD in parallel, which
>> is slow but should not be a fatal issue.
>> 
>> Tony observed this is a 50% chance to hang, which means this is a timing
>> issue.
>
> I _think_ the system in question is a dual-socket Westmere.  I don't see
> any obvious errata that we could pin this on:
>
>> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
>
> Andi Kleen had an interesting theory.  WBINVD is a pretty expensive
> operation.  It's possible that it has some degenerative behavior when
> it's called on a *bunch* of CPUs all at once (which this path can do).
> If the instruction takes too long, it could trigger one of the CPU's
> internal lockup detectors and trigger a machine check.  At that point,
> all hell breaks loose.
>
> I don't know the cache coherency protocol well enough to say for sure,
> but I wonder if there's a storm of cache coherency traffic as all those
> lines get written back.  One of the CPUs gets starved from making enough
> forward progress and trips a CPU-internal watchdog.
>
> Andi also says that it _should_ log something in the machine check banks
> when this happens so there should be at least some kind of breadcrumb.
>
> Either way, I'm hoping this hand waving satiates tglx's morbid curiosity
> about hardware that came out from before I even worked at Intel. ;)

No, it does not. :)

There is no reason to believe that this is just a problem of CPUs which
were released long time ago.

If there is an issue with concurrent WBINVD then this needs to be
addressed independently of Tony's observations.

Aside of that the allowance for the control CPU to make progress based
on the early clearing of the CPU online bit is still a possibility to
explain the wreckage just based on timing.

The reason why I insist on a proper analysis is definitely not morbid
curiosity. The real reason is that I fundamentally hate problems being
handwaved away.

It's a matter of fact that all problems which are not root caused keep
coming back and not necessarily in debuggable ways. Tony's 50% case is
golden compared to the once in a blue moon issues.

I outlined the debug options already. So just throw them at the problem
instead of indulging in handwaing theories.

Thanks,

        tglx

  reply	other threads:[~2023-04-25 23:00 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-25 19:26 [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff Tony Battersby
2023-04-25 19:39 ` Borislav Petkov
2023-04-25 19:58   ` Tony Battersby
2023-04-25 20:03 ` Dave Hansen
2023-04-25 20:34   ` Dave Hansen
2023-04-25 21:06     ` Borislav Petkov
2023-04-25 21:05   ` Thomas Gleixner
2023-04-25 22:29     ` Dave Hansen
2023-04-25 23:00       ` Thomas Gleixner [this message]
2023-04-26  0:10       ` H. Peter Anvin
2023-04-26 14:45     ` Tony Battersby
2023-04-26 16:37       ` Thomas Gleixner
2023-04-26 17:37         ` Tony Battersby
2023-04-26 17:41           ` [PATCH v2] x86/cpu: fix SME test in stop_this_cpu() Tony Battersby
2023-05-22 14:07             ` [PATCH v2 RESEND] " Tony Battersby
2023-04-26 17:51           ` [PATCH RFC] x86/cpu: fix intermittent lockup on poweroff Tom Lendacky
2023-04-26 18:15             ` Dave Hansen
2023-04-26 19:18               ` Tom Lendacky
2023-04-26 22:02                 ` Andi Kleen
2023-04-26 23:20                   ` Thomas Gleixner
2023-04-26 20:00             ` Thomas Gleixner
2023-06-20 13:00 ` [tip: x86/core] x86/smp: Make stop_other_cpus() more robust tip-bot2 for Thomas Gleixner
2023-06-20 13:00 ` [tip: x86/core] x86/smp: Dont access non-existing CPUID leaf tip-bot2 for Tony Battersby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87leifzhww.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=mingo@redhat.com \
    --cc=thomas.lendacky@amd.com \
    --cc=tonyb@cybernetics.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox