public inbox for linux-edac@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
@ 2025-05-29  3:32 Xiongfeng Wang
  2025-05-29  9:45 ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Xiongfeng Wang @ 2025-05-29  3:32 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: x86, linux-edac, wanghai38, bobo.shaobowang, wangxiongfeng2

syzbot reported the following kernel panic:

[  306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3
[  306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[  306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in
'inject_flags', so the MCE will only be injected to the local CPU. But
'MCG_STATUS_LMCES' is not set in 'mcgstatus', so the local CPU will wait
others CPUs to enter MCE in mce_start(). But other CPUs were not
injected, so the above panic happened. We add sanity check in
inject_mce() to fix this issue.

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
---
 arch/x86/kernel/cpu/mce/inject.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 06e3cf7229ce..4af98f0e191e 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -127,6 +127,13 @@ static void inject_mce(struct mce *m)
 {
 	struct mce *i = &per_cpu(injectm, m->extcpu);
 
+	/* do some sanity checks */
+	if (!(m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST))) {
+		if (m->cpuvendor == X86_VENDOR_INTEL ||
+		    m->cpuvendor == X86_VENDOR_ZHAOXIN)
+			m->mcgstatus |= MCG_STATUS_LMCES;
+	}
+
 	/* Make sure no one reads partially written injectm */
 	i->finished = 0;
 	mb();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-29  3:32 [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() Xiongfeng Wang
@ 2025-05-29  9:45 ` Borislav Petkov
  2025-05-31  8:14   ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2025-05-29  9:45 UTC (permalink / raw)
  To: Xiongfeng Wang; +Cc: tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang

On Thu, May 29, 2025 at 11:32:56AM +0800, Xiongfeng Wang wrote:
> syzbot reported the following kernel panic:
> 
> [  306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3
> [  306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
> [  306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in

That's because you're not supposed to inject MCEs with syzkaller doing rando
poking at the interface but someone with a brain should use it.

And someone with a brain might realize that injecting a wrong error is also
a test.

Please point your syzkaller somewhere else.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-29  9:45 ` Borislav Petkov
@ 2025-05-31  8:14   ` Ingo Molnar
  2025-05-31  9:17     ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2025-05-31  8:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38,
	bobo.shaobowang


* Borislav Petkov <bp@alien8.de> wrote:

> On Thu, May 29, 2025 at 11:32:56AM +0800, Xiongfeng Wang wrote:
> > syzbot reported the following kernel panic:
> > 
> > [  306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3
> > [  306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
> > [  306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > 
> > The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in
> 
> That's because you're not supposed to inject MCEs with syzkaller 
> doing rando poking at the interface but someone with a brain should 
> use it.

Uhm, avoiding a hard kernel crash in case of a tooling or human error:

  [  306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

... is in fact a proper interface robustness fix, especially if it's so 
incredibly simple as in this case. (The comment might need a bit of 
work but I digress.)

Injection can disrupt the kernel, but it shouldn't be able to trivially 
crash it in this case.

Improving MCE injection robustness might also help syzbot uncover 
*other* kernel bugs, because it can continue testing via the injection 
framework. But your proposed solution:

> Please point your syzkaller somewhere else.

... is how real kernel bugs might be hidden ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-31  8:14   ` Ingo Molnar
@ 2025-05-31  9:17     ` Borislav Petkov
  2025-05-31 18:04       ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2025-05-31  9:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38,
	bobo.shaobowang

On Sat, May 31, 2025 at 10:14:41AM +0200, Ingo Molnar wrote:
> Uhm, avoiding a hard kernel crash in case of a tooling or human error:

Do you know what this tool is about?

It is supposed to cause crashes, among others.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-31  9:17     ` Borislav Petkov
@ 2025-05-31 18:04       ` Ingo Molnar
  2025-05-31 19:07         ` Borislav Petkov
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2025-05-31 18:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38,
	bobo.shaobowang


* Borislav Petkov <bp@alien8.de> wrote:

> On Sat, May 31, 2025 at 10:14:41AM +0200, Ingo Molnar wrote:
> > Uhm, avoiding a hard kernel crash in case of a tooling or human error:
> 
> Do you know what this tool is about?
> 
> It is supposed to cause crashes, among others.

What tool? If you mean syzkaller, then it is very much not supposed to 
crash on a correctly functioning kernel. If it causes a crash, then the 
proper response is to fix the crash.

Or if you mean the MCE-injection interface, that's not supposed to 
trigger avoidable crashes either, it's an injection facility for 
testing purposes:

          Provide support for injecting machine checks for testing purposes.

It's really simple really:

- If the kernel unnecessarily locks up on the receipt of a 
  hardware-generated MCE then that's a kernel bug that
  should be fixed.

- If the kernel unnecessarily locks up on the receipt of a 
  software-generated MCE then that's a kernel bug that
  should be fixed.

TL;DR, this is not an acceptable kernel response:

> > [  306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3
> > [  306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
> > [  306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-31 18:04       ` Ingo Molnar
@ 2025-05-31 19:07         ` Borislav Petkov
  2025-05-31 22:00           ` Luck, Tony
  0 siblings, 1 reply; 8+ messages in thread
From: Borislav Petkov @ 2025-05-31 19:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38,
	bobo.shaobowang

On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote:
> What tool?

mce-inject. The module.

> Or if you mean the MCE-injection interface, that's not supposed to 
> trigger avoidable crashes either, it's an injection facility for 
> testing purposes:

That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH
SHOULD TRIGGER CRASHES! That is actually a feature!

I have fixed bugs in the MCE handler exactly because an MCE signature gets
injected unfettered.

> It's really simple really:
> 
> - If the kernel unnecessarily locks up on the receipt of a 
>   hardware-generated MCE then that's a kernel bug that
>   should be fixed.

This is not a hw generated MCE - this is a user-generated MCE signature.
I think you're confusing things here.

> - If the kernel unnecessarily locks up on the receipt of a 
>   software-generated MCE then that's a kernel bug that
>   should be fixed.

No, it isn't. I can inject a MCE which will lock up the whole machine. And
that's a valid MCE which can also be raised by hw.

> TL;DR, this is not an acceptable kernel response:

It is very much a valid kernel response. The MCE injection module allows for
testing the MCE panic path with all that is involved in it, including the
machine dying.

Yes, because MCE is special and it can and *should* cause panics.

So for patches like this one which is masssaging the MCE - and which is also
wrong for other reasons - not every Intel uarch supports local MCEs:

NAK!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-31 19:07         ` Borislav Petkov
@ 2025-05-31 22:00           ` Luck, Tony
  2025-06-11 17:41             ` Yazen Ghannam
  0 siblings, 1 reply; 8+ messages in thread
From: Luck, Tony @ 2025-05-31 22:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Xiongfeng Wang, x86, linux-edac, wanghai38,
	bobo.shaobowang

On Sat, May 31, 2025 at 09:07:24PM +0200, Borislav Petkov wrote:
> On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote:
> > What tool?
> 
> mce-inject. The module.
> 
> > Or if you mean the MCE-injection interface, that's not supposed to 
> > trigger avoidable crashes either, it's an injection facility for 
> > testing purposes:
> 
> That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH
> SHOULD TRIGGER CRASHES! That is actually a feature!
> 
> I have fixed bugs in the MCE handler exactly because an MCE signature gets
> injected unfettered.
> 
> > It's really simple really:
> > 
> > - If the kernel unnecessarily locks up on the receipt of a 
> >   hardware-generated MCE then that's a kernel bug that
> >   should be fixed.
> 
> This is not a hw generated MCE - this is a user-generated MCE signature.
> I think you're confusing things here.
> 
> > - If the kernel unnecessarily locks up on the receipt of a 
> >   software-generated MCE then that's a kernel bug that
> >   should be fixed.
> 
> No, it isn't. I can inject a MCE which will lock up the whole machine. And
> that's a valid MCE which can also be raised by hw.
> 
> > TL;DR, this is not an acceptable kernel response:
> 
> It is very much a valid kernel response. The MCE injection module allows for
> testing the MCE panic path with all that is involved in it, including the
> machine dying.
> 
> Yes, because MCE is special and it can and *should* cause panics.
> 
> So for patches like this one which is masssaging the MCE - and which is also
> wrong for other reasons - not every Intel uarch supports local MCEs:
> 
> NAK!

I'm solidly with Boris on this one. I don't want mce-inject to sanity
check, or second guess and "fix" the signature that I requested. I want
it to do what I asked it to do.

1) Sometimes I want to force a crash from a particular signature because
I want to check that Linux accurately logs the signature of the error
that I injected.

2) Sometimes I want to inject "impossible" signatures that can't happen
on any CPU today, but I know will be a recoverable error on a future
processor. That's how much of the early machine check recovery code was
developed and tested.

-Tony

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce()
  2025-05-31 22:00           ` Luck, Tony
@ 2025-06-11 17:41             ` Yazen Ghannam
  0 siblings, 0 replies; 8+ messages in thread
From: Yazen Ghannam @ 2025-06-11 17:41 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Ingo Molnar, Xiongfeng Wang, x86, linux-edac,
	wanghai38, bobo.shaobowang

On Sat, May 31, 2025 at 03:00:14PM -0700, Luck, Tony wrote:
> On Sat, May 31, 2025 at 09:07:24PM +0200, Borislav Petkov wrote:
> > On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote:
> > > What tool?
> > 
> > mce-inject. The module.
> > 
> > > Or if you mean the MCE-injection interface, that's not supposed to 
> > > trigger avoidable crashes either, it's an injection facility for 
> > > testing purposes:
> > 
> > That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH
> > SHOULD TRIGGER CRASHES! That is actually a feature!
> > 
> > I have fixed bugs in the MCE handler exactly because an MCE signature gets
> > injected unfettered.
> > 
> > > It's really simple really:
> > > 
> > > - If the kernel unnecessarily locks up on the receipt of a 
> > >   hardware-generated MCE then that's a kernel bug that
> > >   should be fixed.
> > 
> > This is not a hw generated MCE - this is a user-generated MCE signature.
> > I think you're confusing things here.
> > 
> > > - If the kernel unnecessarily locks up on the receipt of a 
> > >   software-generated MCE then that's a kernel bug that
> > >   should be fixed.
> > 
> > No, it isn't. I can inject a MCE which will lock up the whole machine. And
> > that's a valid MCE which can also be raised by hw.
> > 
> > > TL;DR, this is not an acceptable kernel response:
> > 
> > It is very much a valid kernel response. The MCE injection module allows for
> > testing the MCE panic path with all that is involved in it, including the
> > machine dying.
> > 
> > Yes, because MCE is special and it can and *should* cause panics.
> > 
> > So for patches like this one which is masssaging the MCE - and which is also
> > wrong for other reasons - not every Intel uarch supports local MCEs:
> > 
> > NAK!
> 
> I'm solidly with Boris on this one. I don't want mce-inject to sanity
> check, or second guess and "fix" the signature that I requested. I want
> it to do what I asked it to do.
> 
> 1) Sometimes I want to force a crash from a particular signature because
> I want to check that Linux accurately logs the signature of the error
> that I injected.
> 
> 2) Sometimes I want to inject "impossible" signatures that can't happen
> on any CPU today, but I know will be a recoverable error on a future
> processor. That's how much of the early machine check recovery code was
> developed and tested.
> 

Additionally, we have a "fake_panic" parameter/debugfs entry that lets
us test the panic path without truly bringing down the system.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-11 17:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29  3:32 [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() Xiongfeng Wang
2025-05-29  9:45 ` Borislav Petkov
2025-05-31  8:14   ` Ingo Molnar
2025-05-31  9:17     ` Borislav Petkov
2025-05-31 18:04       ` Ingo Molnar
2025-05-31 19:07         ` Borislav Petkov
2025-05-31 22:00           ` Luck, Tony
2025-06-11 17:41             ` Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox