* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-28 22:12 ` Thomas Gleixner
@ 2024-05-28 23:22 ` Linus Torvalds
2024-05-29 7:42 ` Thomas Gleixner
2024-05-29 4:39 ` Tony W Wang-oc
2024-07-01 11:09 ` Tony W Wang-oc
2 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2024-05-28 23:22 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Dave Hansen, Tony W Wang-oc, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
HPET may be the main / only one we care about.
Because:
> Think about i8253 :)
I see the smiley, but yeah, I don't think we really care about it.
> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?
The current patch under discussion may be ugly, but looks workable.
Local ugliness isn't necessarily a show-stopper.
So if the HPET is the *only* case which has this situation, I vote for
just doing the ugly thing.
Now, if *other* cases exist, and can't be worked around in similar
ways, then that argues for a more "proper" fix.
And no, I don't think i8253 is a strong enough argument. I don't
actually believe you can realistically find a machine that doesn't
have HPET or the TSC and really falls back on the i8253 any more. And
if you *do* find hw like that, is it SMP-capable? And can you find
somebody who cares?
> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?
I do think that an NMI in user space should be considered mostly just
a normal exception. From a kernel perspective, the NMI'ness just
doesn't matter.
That said, I find your suggestion of making 'panic()' just basically
act as an NMI context intriguing. And cleaner than the
atomic_read(&panic_cpu) thing.
Are there any other situations than this odd HPET thing where that
would change semantics?
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-28 23:22 ` Linus Torvalds
@ 2024-05-29 7:42 ` Thomas Gleixner
2024-06-05 6:23 ` Tony W Wang-oc
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2024-05-29 7:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Hansen, Tony W Wang-oc, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
Linus!
On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
> I see the smiley, but yeah, I don't think we really care about it.
Indeed. But the same problem exists on other architectures as
well. drivers/clocksource alone has 4 examples aside of i8253
>> 1) Should we provide a panic mode read callback for clocksources which
>> are affected by this?
>
> The current patch under discussion may be ugly, but looks workable.
> Local ugliness isn't necessarily a show-stopper.
>
> So if the HPET is the *only* case which has this situation, I vote for
> just doing the ugly thing.
>
> Now, if *other* cases exist, and can't be worked around in similar
> ways, then that argues for a more "proper" fix.
>
> And no, I don't think i8253 is a strong enough argument. I don't
> actually believe you can realistically find a machine that doesn't
> have HPET or the TSC and really falls back on the i8253 any more. And
> if you *do* find hw like that, is it SMP-capable? And can you find
> somebody who cares?
Probably not.
>> 2) Is it correct to claim that a MCE which hits user space and ends up in
>> mce_panic() is still just a regular exception or should we upgrade to
>> NMI class context when we enter mce_panic() or even go as far to
>> upgrade to NMI class context for any panic() invocation?
>
> I do think that an NMI in user space should be considered mostly just
> a normal exception. From a kernel perspective, the NMI'ness just
> doesn't matter.
That's correct. I don't want to change that at all especially not for
recoverable MCEs.
> That said, I find your suggestion of making 'panic()' just basically
> act as an NMI context intriguing. And cleaner than the
> atomic_read(&panic_cpu) thing.
>
> Are there any other situations than this odd HPET thing where that
> would change semantics?
I need to go and stare at this some more.
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-29 7:42 ` Thomas Gleixner
@ 2024-06-05 6:23 ` Tony W Wang-oc
2024-06-05 9:36 ` Borislav Petkov
0 siblings, 1 reply; 19+ messages in thread
From: Tony W Wang-oc @ 2024-06-05 6:23 UTC (permalink / raw)
To: Thomas Gleixner, Linus Torvalds
Cc: Dave Hansen, mingo, bp, dave.hansen, x86, hpa, keescook,
tony.luck, gpiccoli, mat.jonczyk, rdunlap, alexandre.belloni,
mario.limonciello, yaolu, bhelgaas, justinstitt, linux-kernel,
linux-hardening, CobeChen, TimGuo, LeoLiu-oc
On 2024/5/29 15:42, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> Linus!
>
> On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
>> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
>> I see the smiley, but yeah, I don't think we really care about it.
>
> Indeed. But the same problem exists on other architectures as
> well. drivers/clocksource alone has 4 examples aside of i8253
>
>>> 1) Should we provide a panic mode read callback for clocksources which
>>> are affected by this?
>>
>> The current patch under discussion may be ugly, but looks workable.
>> Local ugliness isn't necessarily a show-stopper.
>>
>> So if the HPET is the *only* case which has this situation, I vote for
>> just doing the ugly thing.
>>
>> Now, if *other* cases exist, and can't be worked around in similar
>> ways, then that argues for a more "proper" fix.
>>
>> And no, I don't think i8253 is a strong enough argument. I don't
>> actually believe you can realistically find a machine that doesn't
>> have HPET or the TSC and really falls back on the i8253 any more. And
>> if you *do* find hw like that, is it SMP-capable? And can you find
>> somebody who cares?
>
> Probably not.
>
>>> 2) Is it correct to claim that a MCE which hits user space and ends up in
>>> mce_panic() is still just a regular exception or should we upgrade to
>>> NMI class context when we enter mce_panic() or even go as far to
>>> upgrade to NMI class context for any panic() invocation?
>>
After MCE has occurred, it is possible for the MCE handler to execute
the add_taint() function without panic. For example, the fake_panic is
configured.
So the above patch method does not seem to be able to cover the printk
deadlock caused by the add_taint() function in the MCE handler when a
MCE occurs in user space.
Sincerely
TonyWWang-oc
>> I do think that an NMI in user space should be considered mostly just
>> a normal exception. From a kernel perspective, the NMI'ness just
>> doesn't matter.
>
> That's correct. I don't want to change that at all especially not for
> recoverable MCEs.
>
>> That said, I find your suggestion of making 'panic()' just basically
>> act as an NMI context intriguing. And cleaner than the
>> atomic_read(&panic_cpu) thing.
>>
>> Are there any other situations than this odd HPET thing where that
>> would change semantics?
>
> I need to go and stare at this some more.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 6:23 ` Tony W Wang-oc
@ 2024-06-05 9:36 ` Borislav Petkov
2024-06-05 10:10 ` Tony W Wang-oc
0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2024-06-05 9:36 UTC (permalink / raw)
To: Tony W Wang-oc
Cc: Thomas Gleixner, Linus Torvalds, Dave Hansen, mingo, dave.hansen,
x86, hpa, keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
> After MCE has occurred, it is possible for the MCE handler to execute the
> add_taint() function without panic. For example, the fake_panic is
> configured.
fake_panic is an ancient debugging leftover. If you set it, you get what
you deserve.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 9:36 ` Borislav Petkov
@ 2024-06-05 10:10 ` Tony W Wang-oc
2024-06-05 11:33 ` Borislav Petkov
0 siblings, 1 reply; 19+ messages in thread
From: Tony W Wang-oc @ 2024-06-05 10:10 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, Linus Torvalds, Dave Hansen, mingo, dave.hansen,
x86, hpa, keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
On 2024/6/5 17:36, Borislav Petkov wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
>> After MCE has occurred, it is possible for the MCE handler to execute the
>> add_taint() function without panic. For example, the fake_panic is
>> configured.
>
> fake_panic is an ancient debugging leftover. If you set it, you get what
> you deserve.
>
It may also happen without setting fake_panic, such as an MCE error of
the UCNA/SRAO type occurred?
Sincerely
TonyWWang-oc
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 10:10 ` Tony W Wang-oc
@ 2024-06-05 11:33 ` Borislav Petkov
2024-06-05 12:02 ` Tony W Wang-oc
0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2024-06-05 11:33 UTC (permalink / raw)
To: Tony W Wang-oc
Cc: Thomas Gleixner, Linus Torvalds, Dave Hansen, mingo, dave.hansen,
x86, hpa, keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
> It may also happen without setting fake_panic, such as an MCE error of the
> UCNA/SRAO type occurred?
Which types exactly do you mean when you're looking at the severities[]
array in severity.c?
And what scenario are you talking about?
To get an #MC exception and detect only UCNA/SRAO errors? Can that even
happen on any hardware?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 11:33 ` Borislav Petkov
@ 2024-06-05 12:02 ` Tony W Wang-oc
2024-06-05 15:51 ` Luck, Tony
0 siblings, 1 reply; 19+ messages in thread
From: Tony W Wang-oc @ 2024-06-05 12:02 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, Linus Torvalds, Dave Hansen, mingo, dave.hansen,
x86, hpa, keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening, CobeChen, TimGuo,
LeoLiu-oc
On 2024/6/5 19:33, Borislav Petkov wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
>> It may also happen without setting fake_panic, such as an MCE error of the
>> UCNA/SRAO type occurred?
>
> Which types exactly do you mean when you're looking at the severities[]
> array in severity.c?
>
> And what scenario are you talking about?
>
> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> happen on any hardware?
>
Yes, I mean an #MC exception happened and detect only like SRAO errors
like below:
MCESEV(
AO, "Action optional: memory scrubbing error",
SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
MCI_STATUS_UC|MCACOD_SCRUB)
),
MCESEV(
AO, "Action optional: last level cache writeback error",
SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
),
I think these errors are actually encountered on some platforms that
support these type of errors report to the #MC.
Sincerely
TonyWWang-oc
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 19+ messages in thread* RE: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 12:02 ` Tony W Wang-oc
@ 2024-06-05 15:51 ` Luck, Tony
2024-06-06 8:44 ` Tony W Wang-oc
0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2024-06-05 15:51 UTC (permalink / raw)
To: Tony W Wang-oc, Borislav Petkov
Cc: Thomas Gleixner, Linus Torvalds, Hansen, Dave, mingo@redhat.com,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
keescook@chromium.org, gpiccoli@igalia.com, mat.jonczyk@o2.pl,
rdunlap@infradead.org, alexandre.belloni@bootlin.com,
mario.limonciello@amd.com, yaolu@kylinos.cn, bhelgaas@google.com,
justinstitt@google.com, linux-kernel@vger.kernel.org,
linux-hardening@vger.kernel.org, CobeChen@zhaoxin.com,
TimGuo@zhaoxin.com, LeoLiu-oc@zhaoxin.com
> > Which types exactly do you mean when you're looking at the severities[]
> > array in severity.c?
> >
> > And what scenario are you talking about?
> >
> > To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> > happen on any hardware?
> >
>
> Yes, I mean an #MC exception happened and detect only like SRAO errors
> like below:
>
> MCESEV(
> AO, "Action optional: memory scrubbing error",
> SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
> MCI_STATUS_UC|MCACOD_SCRUB)
> ),
> MCESEV(
> AO, "Action optional: last level cache writeback error",
> SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
> ),
>
> I think these errors are actually encountered on some platforms that
> support these type of errors report to the #MC.
Intel servers from Nehalem through Cascade Lake reported memory controller
patrol scrub uncorrected error with #MC and SRAO signature.
Icelake and newer use CMCI with a UCNA signature.
-Tony
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-05 15:51 ` Luck, Tony
@ 2024-06-06 8:44 ` Tony W Wang-oc
2024-06-06 17:00 ` Luck, Tony
0 siblings, 1 reply; 19+ messages in thread
From: Tony W Wang-oc @ 2024-06-06 8:44 UTC (permalink / raw)
To: Luck, Tony, Borislav Petkov
Cc: Thomas Gleixner, Linus Torvalds, Hansen, Dave, mingo@redhat.com,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
keescook@chromium.org, gpiccoli@igalia.com, mat.jonczyk@o2.pl,
rdunlap@infradead.org, alexandre.belloni@bootlin.com,
mario.limonciello@amd.com, yaolu@kylinos.cn, bhelgaas@google.com,
justinstitt@google.com, linux-kernel@vger.kernel.org,
linux-hardening@vger.kernel.org, CobeChen@zhaoxin.com,
TimGuo@zhaoxin.com, LeoLiu-oc@zhaoxin.com
On 2024/6/5 23:51, Luck, Tony wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
>>> Which types exactly do you mean when you're looking at the severities[]
>>> array in severity.c?
>>>
>>> And what scenario are you talking about?
>>>
>>> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
>>> happen on any hardware?
>>>
>>
>> Yes, I mean an #MC exception happened and detect only like SRAO errors
>> like below:
>>
>> MCESEV(
>> AO, "Action optional: memory scrubbing error",
>> SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
>> MCI_STATUS_UC|MCACOD_SCRUB)
>> ),
>> MCESEV(
>> AO, "Action optional: last level cache writeback error",
>> SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
>> ),
>>
>> I think these errors are actually encountered on some platforms that
>> support these type of errors report to the #MC.
>
> Intel servers from Nehalem through Cascade Lake reported memory controller
> patrol scrub uncorrected error with #MC and SRAO signature.
>
> Icelake and newer use CMCI with a UCNA signature.
>
I have a question, does Intel use #MC to report UCNA errors?
Sincerely
TonyWWang-oc
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-06-06 8:44 ` Tony W Wang-oc
@ 2024-06-06 17:00 ` Luck, Tony
0 siblings, 0 replies; 19+ messages in thread
From: Luck, Tony @ 2024-06-06 17:00 UTC (permalink / raw)
To: Tony W Wang-oc, Borislav Petkov
Cc: Thomas Gleixner, Linus Torvalds, Hansen, Dave, mingo@redhat.com,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
keescook@chromium.org, gpiccoli@igalia.com, mat.jonczyk@o2.pl,
rdunlap@infradead.org, alexandre.belloni@bootlin.com,
mario.limonciello@amd.com, yaolu@kylinos.cn, bhelgaas@google.com,
justinstitt@google.com, linux-kernel@vger.kernel.org,
linux-hardening@vger.kernel.org, CobeChen@zhaoxin.com,
TimGuo@zhaoxin.com, LeoLiu-oc@zhaoxin.com
>> Icelake and newer use CMCI with a UCNA signature.
>>
>
> I have a question, does Intel use #MC to report UCNA errors?
No. They are reported with CMCI[1] (assuming it is enabled by IA32_MCi_CTL2 bit 30).
-Tony
[1] Usage evolved and naming did not keep up. An "Uncorrected" error is being signaled
using "Corrected Machine Check Interrupt".
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-28 22:12 ` Thomas Gleixner
2024-05-28 23:22 ` Linus Torvalds
@ 2024-05-29 4:39 ` Tony W Wang-oc
2024-05-29 6:44 ` Tony W Wang-oc
2024-05-29 7:44 ` Thomas Gleixner
2024-07-01 11:09 ` Tony W Wang-oc
2 siblings, 2 replies; 19+ messages in thread
From: Tony W Wang-oc @ 2024-05-29 4:39 UTC (permalink / raw)
To: Thomas Gleixner, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On 2024/5/29 06:12, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>> if (in_nmi())
>>> return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> + /*
>>> + * Read HPET directly if panic in progress.
>>> + */
>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> + return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation. There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
>
> Think about i8253 :)
>
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
>
>> Why should the problem be fixed in the clock sources themselves? Why
>> doesn't printk() deadlock on systems using the HPET?
>
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
>
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
>
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
>
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
>
> if (in_nmi())
> return (u64)hpet_readl(HPET_COUNTER);
>
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
>
> Now there are two other scenarios which really can make that happen:
>
> 1) A non-NMI class exception within the lock held region
>
> CPU A
> acquire(hpet_lock);
> ... <- #PF, #GP, #DE, ... -> panic()
>
> If any of that happens within that lock held section then the live
> lock on the hpet_lock is the least of your worries. Seriously, I
> don't care about this at all.
>
> 2) The actual scenario is:
>
> CPU A CPU B
> lock(hpet_lock)
> MCE hits user space
> ...
> exc_machine_check_user()
> irqentry_enter_from_user_mode(regs);
>
> irqentry_enter_from_user_mode() obviously does not mark the
> exception as NMI class, so in_nmi() evaluates to false. That would
> actually dead lock if CPU A is not making progress and releases
> hpet_lock.
>
> Sounds unlikely to happen, right? But in reality it can because of
> MCE broadcast. Assume that both CPUs go into MCE:
>
> CPU A CPU B
> lock(hpet_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode();
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter(); mce_panic()
> do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
> mce_panic() wait_for_panic(); <- Not taken
>
> if (atomic_inc_return(&mce_panicked) > 1)
> wait_for_panic(); <- Taken
>
> ....
> hpet_read()
>
> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
> obviously can't release the lock.
>
Because MCE handler will call printk() before call the panic, so
printk() deadlock may happen in this scenario:
CPU A CPU B
printk()
lock(console_owner_lock)
exc_machine_check_user()
irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter() mce_panic()
do_machine_check() printk_mce() #A
mce_panic() ...
wait_for_panic() panic()
printk deadlock will happened at #A because in_nmi() evaluates to false
on CPU B and CPU B do not enter the panic() AT #A.
Update user space MCE handler to NMI class context is preferred?
Sincerely
TonyWWang-oc
> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
>
> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?
>
> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?
>
> #1 Solves it at the clocksource level. It still needs HPET specific
> changes.
>
> #2 Solves a whole class of issues
>
> ... while potentially introducing new ones :)
>
> To me upgrading any panic() invocation to NMI class context makes a
> lot of sense because in that case all bets are off.
>
> in_nmi() is used in quite some places to avoid such problems. IOW,
> that would kill a whole class of issues instead of "curing" the HPET
> problem locally for the price of an extra conditional. Not that the
> extra conditional matters much if HPET is the clocksource as that's
> awfully slow anyway and I really don't care about that.
>
> But I very much care about avoiding to sprinkle panic_cpu checks all
> over the place.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-29 4:39 ` Tony W Wang-oc
@ 2024-05-29 6:44 ` Tony W Wang-oc
2024-05-29 7:45 ` Thomas Gleixner
2024-05-29 7:44 ` Thomas Gleixner
1 sibling, 1 reply; 19+ messages in thread
From: Tony W Wang-oc @ 2024-05-29 6:44 UTC (permalink / raw)
To: Thomas Gleixner, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On 2024/5/29 12:39, Tony W Wang-oc wrote:
>
>
> On 2024/5/29 06:12, Thomas Gleixner wrote:
>>
>>
>> [这封邮件来自外部发件人 谨防风险]
>>
>> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>>> index c96ae8fee95e..ecadd0698d6a 100644
>>>> --- a/arch/x86/kernel/hpet.c
>>>> +++ b/arch/x86/kernel/hpet.c
>>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>> if (in_nmi())
>>>> return (u64)hpet_readl(HPET_COUNTER);
>>>>
>>>> + /*
>>>> + * Read HPET directly if panic in progress.
>>>> + */
>>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>>> + return (u64)hpet_readl(HPET_COUNTER);
>>>> +
>>>
>>> There is literally one other piece of the code in the kernel doing
>>> something similar: the printk() implementation. There's no other
>>> clocksource or timekeeping code that does this on any architecture.
>>>
>>> Why doesn't this problem apply to any other clock sources?
>>
>> I principle it applies to any clocksource which needs a spinlock to
>> serialize access. HPET is not the only insanity here.
>>
>> Think about i8253 :)
>>
>> Most real clocksources, like TSC and the majority of the preferred clock
>> sources on other architectures are perfectly fine. They just read and be
>> done.
>>
>>> Why should the problem be fixed in the clock sources themselves? Why
>>> doesn't printk() deadlock on systems using the HPET?
>>
>> Because regular printk()s are deferred to irq work when in NMI and
>> similar contexts, but that obviously does not apply to panic
>> situations. Also NMI is treated special even in the HPET code. See
>> below.
>>
>>> In other words, I think we should fix pstore to be more like printk
>>> rather than hacking around this in each clock source.
>>
>> pstore is perfectly fine. It uses a NMI safe time accessor function
>> which is then tripping over the HPET lock. That's really a HPET specific
>> problem.
>>
>> Though what I read out of the changelog is that the MCE hits the same
>> CPU 'x' which holds the lock. But that's fairy tale material as you can
>> see in the patch above:
>>
>> if (in_nmi())
>> return (u64)hpet_readl(HPET_COUNTER);
>>
>> For that particular case the dead lock, which would actually be a live
>> lock, cannot happen because in kernel MCEs are NMI class exceptions and
>> therefore in_nmi() evaluates to true and that new voodoo can't be
>> reached at all.
>>
>> Now there are two other scenarios which really can make that happen:
>>
>> 1) A non-NMI class exception within the lock held region
>>
>> CPU A
>> acquire(hpet_lock);
>> ... <- #PF, #GP, #DE, ... -> panic()
>>
>> If any of that happens within that lock held section then the live
>> lock on the hpet_lock is the least of your worries. Seriously, I
>> don't care about this at all.
>>
Actually, this scenario is what this patch is trying to solve.
We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.
Sincerely
TonyWWang-oc
>> 2) The actual scenario is:
>>
>> CPU A CPU B
>> lock(hpet_lock)
>> MCE hits user space
>> ...
>> exc_machine_check_user()
>> irqentry_enter_from_user_mode(regs);
>>
>> irqentry_enter_from_user_mode() obviously does not mark the
>> exception as NMI class, so in_nmi() evaluates to false. That would
>> actually dead lock if CPU A is not making progress and releases
>> hpet_lock.
>>
>> Sounds unlikely to happen, right? But in reality it can because of
>> MCE broadcast. Assume that both CPUs go into MCE:
>>
>> CPU A CPU B
>> lock(hpet_lock)
>> exc_machine_check_user()
>> irqentry_enter_from_user_mode();
>> exc_machine_check_kernel() do_machine_check()
>> irqentry_nmi_enter(); mce_panic()
>> do_machine_check() if
>> (atomic_inc_return(&mce_panicked) > 1)
>> mce_panic() wait_for_panic(); <- Not taken
>>
>> if (atomic_inc_return(&mce_panicked) > 1)
>> wait_for_panic(); <- Taken
>>
>> ....
>> hpet_read()
>>
>> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>> obviously can't release the lock.
>>
>
> Because MCE handler will call printk() before call the panic, so
> printk() deadlock may happen in this scenario:
>
> CPU A CPU B
> printk()
> lock(console_owner_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode()
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter() mce_panic()
> do_machine_check() printk_mce() #A
> mce_panic() ...
> wait_for_panic() panic()
>
> printk deadlock will happened at #A because in_nmi() evaluates to false
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?
>
> Sincerely
> TonyWWang-oc
>
>> So the proposed patch makes sense to some extent. But it only cures the
>> symptom. The real underlying questions are:
>>
>> 1) Should we provide a panic mode read callback for clocksources which
>> are affected by this?
>>
>> 2) Is it correct to claim that a MCE which hits user space and ends
>> up in
>> mce_panic() is still just a regular exception or should we
>> upgrade to
>> NMI class context when we enter mce_panic() or even go as far to
>> upgrade to NMI class context for any panic() invocation?
>>
>> #1 Solves it at the clocksource level. It still needs HPET specific
>> changes.
>>
>> #2 Solves a whole class of issues
>>
>> ... while potentially introducing new ones :)
>>
>> To me upgrading any panic() invocation to NMI class context makes a
>> lot of sense because in that case all bets are off.
>>
>> in_nmi() is used in quite some places to avoid such problems. IOW,
>> that would kill a whole class of issues instead of "curing" the HPET
>> problem locally for the price of an extra conditional. Not that the
>> extra conditional matters much if HPET is the clocksource as that's
>> awfully slow anyway and I really don't care about that.
>>
>> But I very much care about avoiding to sprinkle panic_cpu checks all
>> over the place.
>>
>> Thanks,
>>
>> tglx
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-29 6:44 ` Tony W Wang-oc
@ 2024-05-29 7:45 ` Thomas Gleixner
2024-05-29 8:15 ` Tony W Wang-oc
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2024-05-29 7:45 UTC (permalink / raw)
To: Tony W Wang-oc, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
> Actually, this scenario is what this patch is trying to solve.
>
> We encountered hpet_lock deadlock from the call path of the MCE handler,
> and this hpet_lock deadlock scenario may happen when others exceptions'
> handler like #PF/#GP... to call the panic. So read_hpet should avoid
> deadlock if panic in progress.
That's not what your changelog says.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-29 7:45 ` Thomas Gleixner
@ 2024-05-29 8:15 ` Tony W Wang-oc
0 siblings, 0 replies; 19+ messages in thread
From: Tony W Wang-oc @ 2024-05-29 8:15 UTC (permalink / raw)
To: Thomas Gleixner, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On 2024/5/29 15:45, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
>> Actually, this scenario is what this patch is trying to solve.
>>
>> We encountered hpet_lock deadlock from the call path of the MCE handler,
>> and this hpet_lock deadlock scenario may happen when others exceptions'
>> handler like #PF/#GP... to call the panic. So read_hpet should avoid
>> deadlock if panic in progress.
>
> That's not what your changelog says.
Yes.
The example flow I gave with the MCE handler in my changelog is misleading.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-29 4:39 ` Tony W Wang-oc
2024-05-29 6:44 ` Tony W Wang-oc
@ 2024-05-29 7:44 ` Thomas Gleixner
1 sibling, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2024-05-29 7:44 UTC (permalink / raw)
To: Tony W Wang-oc, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On Wed, May 29 2024 at 12:39, Tony W Wang-oc wrote:
> printk deadlock will happened at #A because in_nmi() evaluates to false
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?
No.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
2024-05-28 22:12 ` Thomas Gleixner
2024-05-28 23:22 ` Linus Torvalds
2024-05-29 4:39 ` Tony W Wang-oc
@ 2024-07-01 11:09 ` Tony W Wang-oc
2 siblings, 0 replies; 19+ messages in thread
From: Tony W Wang-oc @ 2024-07-01 11:09 UTC (permalink / raw)
To: Thomas Gleixner, Dave Hansen, mingo, bp, dave.hansen, x86, hpa,
keescook, tony.luck, gpiccoli, mat.jonczyk, rdunlap,
alexandre.belloni, mario.limonciello, yaolu, bhelgaas,
justinstitt, linux-kernel, linux-hardening
Cc: CobeChen, TimGuo, LeoLiu-oc, Linus Torvalds
On 2024/5/29 06:12, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>> if (in_nmi())
>>> return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> + /*
>>> + * Read HPET directly if panic in progress.
>>> + */
>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> + return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation. There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
>
> Think about i8253 :)
>
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
>
>> Why should the problem be fixed in the clock sources themselves? Why
>> doesn't printk() deadlock on systems using the HPET?
>
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
>
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
>
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
>
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
>
> if (in_nmi())
> return (u64)hpet_readl(HPET_COUNTER);
>
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
>
> Now there are two other scenarios which really can make that happen:
>
> 1) A non-NMI class exception within the lock held region
>
> CPU A
> acquire(hpet_lock);
> ... <- #PF, #GP, #DE, ... -> panic()
>
> If any of that happens within that lock held section then the live
> lock on the hpet_lock is the least of your worries. Seriously, I
> don't care about this at all.
>
> 2) The actual scenario is:
>
> CPU A CPU B
> lock(hpet_lock)
> MCE hits user space
> ...
> exc_machine_check_user()
> irqentry_enter_from_user_mode(regs);
>
> irqentry_enter_from_user_mode() obviously does not mark the
> exception as NMI class, so in_nmi() evaluates to false. That would
> actually dead lock if CPU A is not making progress and releases
> hpet_lock.
>
> Sounds unlikely to happen, right? But in reality it can because of
> MCE broadcast. Assume that both CPUs go into MCE:
>
> CPU A CPU B
> lock(hpet_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode();
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter(); mce_panic()
> do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
> mce_panic() wait_for_panic(); <- Not taken
>
> if (atomic_inc_return(&mce_panicked) > 1)
> wait_for_panic(); <- Taken
>
> ....
> hpet_read()
>
> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
> obviously can't release the lock.
>
For this scenario, an experiment was designed for the printk:
a, Install a driver module that repeatedly sending IPIs to multiple
cores to executes printk.
b, Run a user-level testing tool like stream on all cores.
c, Trigger a MCE hardware error.
During burnin tests a-c, reproduce the following case:
CPU A CPU B
printk()
console_owner
exc_machine_check_user()
irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter() mce_panic()
do_machine_check() print_mce()
...
... while(console_waiter)
cpu_relax(); <- deadloop
mce_timed_out() <-timeout
wait_for_panic()
panic("Panicing machine check CPU died");
In this case CPU B is the monarch CPU in MCE handler, CPU B waiting to
be the console_owner and CPU A can't release the console_owner.
So the monarch CPU B deadloop happened, as a result other CPU witch
waiting the monarch CPU timeout will call the panic function.
This problem is caused by the use of printk in the MCE handler.
Actually, I found the comments for the MCE handler like:
* This is executed in #MC context not subject to normal locking rules.
* This implies that most kernel services cannot be safely used. Don't even
* think about putting a printk in there!
Should consider not using printk in the MCE handler?
Sincerely!
TonyWWang-oc
> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
>
> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?
>
> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?
>
> #1 Solves it at the clocksource level. It still needs HPET specific
> changes.
>
> #2 Solves a whole class of issues
>
> ... while potentially introducing new ones :)
>
> To me upgrading any panic() invocation to NMI class context makes a
> lot of sense because in that case all bets are off.
>
> in_nmi() is used in quite some places to avoid such problems. IOW,
> that would kill a whole class of issues instead of "curing" the HPET
> problem locally for the price of an extra conditional. Not that the
> extra conditional matters much if HPET is the clocksource as that's
> awfully slow anyway and I really don't care about that.
>
> But I very much care about avoiding to sprinkle panic_cpu checks all
> over the place.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 19+ messages in thread