* [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() @ 2025-05-29 3:32 Xiongfeng Wang 2025-05-29 9:45 ` Borislav Petkov 0 siblings, 1 reply; 8+ messages in thread From: Xiongfeng Wang @ 2025-05-29 3:32 UTC (permalink / raw) To: tony.luck, bp; +Cc: x86, linux-edac, wanghai38, bobo.shaobowang, wangxiongfeng2 syzbot reported the following kernel panic: [ 306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3 [ 306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler [ 306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in 'inject_flags', so the MCE will only be injected to the local CPU. But 'MCG_STATUS_LMCES' is not set in 'mcgstatus', so the local CPU will wait others CPUs to enter MCE in mce_start(). But other CPUs were not injected, so the above panic happened. We add sanity check in inject_mce() to fix this issue. Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> --- arch/x86/kernel/cpu/mce/inject.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c index 06e3cf7229ce..4af98f0e191e 100644 --- a/arch/x86/kernel/cpu/mce/inject.c +++ b/arch/x86/kernel/cpu/mce/inject.c @@ -127,6 +127,13 @@ static void inject_mce(struct mce *m) { struct mce *i = &per_cpu(injectm, m->extcpu); + /* do some sanity checks */ + if (!(m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST))) { + if (m->cpuvendor == X86_VENDOR_INTEL || + m->cpuvendor == X86_VENDOR_ZHAOXIN) + m->mcgstatus |= MCG_STATUS_LMCES; + } + /* Make sure no one reads partially written injectm */ i->finished = 0; mb(); -- 2.20.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-29 3:32 [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() Xiongfeng Wang @ 2025-05-29 9:45 ` Borislav Petkov 2025-05-31 8:14 ` Ingo Molnar 0 siblings, 1 reply; 8+ messages in thread From: Borislav Petkov @ 2025-05-29 9:45 UTC (permalink / raw) To: Xiongfeng Wang; +Cc: tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang On Thu, May 29, 2025 at 11:32:56AM +0800, Xiongfeng Wang wrote: > syzbot reported the following kernel panic: > > [ 306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3 > [ 306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler > [ 306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in That's because you're not supposed to inject MCEs with syzkaller doing rando poking at the interface but someone with a brain should use it. And someone with a brain might realize that injecting a wrong error is also a test. Please point your syzkaller somewhere else. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-29 9:45 ` Borislav Petkov @ 2025-05-31 8:14 ` Ingo Molnar 2025-05-31 9:17 ` Borislav Petkov 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2025-05-31 8:14 UTC (permalink / raw) To: Borislav Petkov Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang * Borislav Petkov <bp@alien8.de> wrote: > On Thu, May 29, 2025 at 11:32:56AM +0800, Xiongfeng Wang wrote: > > syzbot reported the following kernel panic: > > > > [ 306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3 > > [ 306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler > > [ 306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > > The syzkaller test didn't set 'MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST' in > > That's because you're not supposed to inject MCEs with syzkaller > doing rando poking at the interface but someone with a brain should > use it. Uhm, avoiding a hard kernel crash in case of a tooling or human error: [ 306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler ... is in fact a proper interface robustness fix, especially if it's so incredibly simple as in this case. (The comment might need a bit of work but I digress.) Injection can disrupt the kernel, but it shouldn't be able to trivially crash it in this case. Improving MCE injection robustness might also help syzbot uncover *other* kernel bugs, because it can continue testing via the injection framework. But your proposed solution: > Please point your syzkaller somewhere else. ... is how real kernel bugs might be hidden ... Thanks, Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-31 8:14 ` Ingo Molnar @ 2025-05-31 9:17 ` Borislav Petkov 2025-05-31 18:04 ` Ingo Molnar 0 siblings, 1 reply; 8+ messages in thread From: Borislav Petkov @ 2025-05-31 9:17 UTC (permalink / raw) To: Ingo Molnar Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang On Sat, May 31, 2025 at 10:14:41AM +0200, Ingo Molnar wrote: > Uhm, avoiding a hard kernel crash in case of a tooling or human error: Do you know what this tool is about? It is supposed to cause crashes, among others. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-31 9:17 ` Borislav Petkov @ 2025-05-31 18:04 ` Ingo Molnar 2025-05-31 19:07 ` Borislav Petkov 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2025-05-31 18:04 UTC (permalink / raw) To: Borislav Petkov Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang * Borislav Petkov <bp@alien8.de> wrote: > On Sat, May 31, 2025 at 10:14:41AM +0200, Ingo Molnar wrote: > > Uhm, avoiding a hard kernel crash in case of a tooling or human error: > > Do you know what this tool is about? > > It is supposed to cause crashes, among others. What tool? If you mean syzkaller, then it is very much not supposed to crash on a correctly functioning kernel. If it causes a crash, then the proper response is to fix the crash. Or if you mean the MCE-injection interface, that's not supposed to trigger avoidable crashes either, it's an injection facility for testing purposes: Provide support for injecting machine checks for testing purposes. It's really simple really: - If the kernel unnecessarily locks up on the receipt of a hardware-generated MCE then that's a kernel bug that should be fixed. - If the kernel unnecessarily locks up on the receipt of a software-generated MCE then that's a kernel bug that should be fixed. TL;DR, this is not an acceptable kernel response: > > [ 306.335489][ T3298] mce: CPUs not responding to MCE broadcast (may include false positives): 1-3 > > [ 306.336332][ T3298] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler > > [ 306.337786][ T3298] Kernel Offset: 0x17400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Thanks, Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-31 18:04 ` Ingo Molnar @ 2025-05-31 19:07 ` Borislav Petkov 2025-05-31 22:00 ` Luck, Tony 0 siblings, 1 reply; 8+ messages in thread From: Borislav Petkov @ 2025-05-31 19:07 UTC (permalink / raw) To: Ingo Molnar Cc: Xiongfeng Wang, tony.luck, x86, linux-edac, wanghai38, bobo.shaobowang On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote: > What tool? mce-inject. The module. > Or if you mean the MCE-injection interface, that's not supposed to > trigger avoidable crashes either, it's an injection facility for > testing purposes: That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH SHOULD TRIGGER CRASHES! That is actually a feature! I have fixed bugs in the MCE handler exactly because an MCE signature gets injected unfettered. > It's really simple really: > > - If the kernel unnecessarily locks up on the receipt of a > hardware-generated MCE then that's a kernel bug that > should be fixed. This is not a hw generated MCE - this is a user-generated MCE signature. I think you're confusing things here. > - If the kernel unnecessarily locks up on the receipt of a > software-generated MCE then that's a kernel bug that > should be fixed. No, it isn't. I can inject a MCE which will lock up the whole machine. And that's a valid MCE which can also be raised by hw. > TL;DR, this is not an acceptable kernel response: It is very much a valid kernel response. The MCE injection module allows for testing the MCE panic path with all that is involved in it, including the machine dying. Yes, because MCE is special and it can and *should* cause panics. So for patches like this one which is masssaging the MCE - and which is also wrong for other reasons - not every Intel uarch supports local MCEs: NAK! -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-31 19:07 ` Borislav Petkov @ 2025-05-31 22:00 ` Luck, Tony 2025-06-11 17:41 ` Yazen Ghannam 0 siblings, 1 reply; 8+ messages in thread From: Luck, Tony @ 2025-05-31 22:00 UTC (permalink / raw) To: Borislav Petkov Cc: Ingo Molnar, Xiongfeng Wang, x86, linux-edac, wanghai38, bobo.shaobowang On Sat, May 31, 2025 at 09:07:24PM +0200, Borislav Petkov wrote: > On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote: > > What tool? > > mce-inject. The module. > > > Or if you mean the MCE-injection interface, that's not supposed to > > trigger avoidable crashes either, it's an injection facility for > > testing purposes: > > That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH > SHOULD TRIGGER CRASHES! That is actually a feature! > > I have fixed bugs in the MCE handler exactly because an MCE signature gets > injected unfettered. > > > It's really simple really: > > > > - If the kernel unnecessarily locks up on the receipt of a > > hardware-generated MCE then that's a kernel bug that > > should be fixed. > > This is not a hw generated MCE - this is a user-generated MCE signature. > I think you're confusing things here. > > > - If the kernel unnecessarily locks up on the receipt of a > > software-generated MCE then that's a kernel bug that > > should be fixed. > > No, it isn't. I can inject a MCE which will lock up the whole machine. And > that's a valid MCE which can also be raised by hw. > > > TL;DR, this is not an acceptable kernel response: > > It is very much a valid kernel response. The MCE injection module allows for > testing the MCE panic path with all that is involved in it, including the > machine dying. > > Yes, because MCE is special and it can and *should* cause panics. > > So for patches like this one which is masssaging the MCE - and which is also > wrong for other reasons - not every Intel uarch supports local MCEs: > > NAK! I'm solidly with Boris on this one. I don't want mce-inject to sanity check, or second guess and "fix" the signature that I requested. I want it to do what I asked it to do. 1) Sometimes I want to force a crash from a particular signature because I want to check that Linux accurately logs the signature of the error that I injected. 2) Sometimes I want to inject "impossible" signatures that can't happen on any CPU today, but I know will be a recoverable error on a future processor. That's how much of the early machine check recovery code was developed and tested. -Tony ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() 2025-05-31 22:00 ` Luck, Tony @ 2025-06-11 17:41 ` Yazen Ghannam 0 siblings, 0 replies; 8+ messages in thread From: Yazen Ghannam @ 2025-06-11 17:41 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Ingo Molnar, Xiongfeng Wang, x86, linux-edac, wanghai38, bobo.shaobowang On Sat, May 31, 2025 at 03:00:14PM -0700, Luck, Tony wrote: > On Sat, May 31, 2025 at 09:07:24PM +0200, Borislav Petkov wrote: > > On Sat, May 31, 2025 at 08:04:58PM +0200, Ingo Molnar wrote: > > > What tool? > > > > mce-inject. The module. > > > > > Or if you mean the MCE-injection interface, that's not supposed to > > > trigger avoidable crashes either, it's an injection facility for > > > testing purposes: > > > > That's an injection facility to TEST THE MCE PANIC PATHS too. YES, IT VERY MUCH > > SHOULD TRIGGER CRASHES! That is actually a feature! > > > > I have fixed bugs in the MCE handler exactly because an MCE signature gets > > injected unfettered. > > > > > It's really simple really: > > > > > > - If the kernel unnecessarily locks up on the receipt of a > > > hardware-generated MCE then that's a kernel bug that > > > should be fixed. > > > > This is not a hw generated MCE - this is a user-generated MCE signature. > > I think you're confusing things here. > > > > > - If the kernel unnecessarily locks up on the receipt of a > > > software-generated MCE then that's a kernel bug that > > > should be fixed. > > > > No, it isn't. I can inject a MCE which will lock up the whole machine. And > > that's a valid MCE which can also be raised by hw. > > > > > TL;DR, this is not an acceptable kernel response: > > > > It is very much a valid kernel response. The MCE injection module allows for > > testing the MCE panic path with all that is involved in it, including the > > machine dying. > > > > Yes, because MCE is special and it can and *should* cause panics. > > > > So for patches like this one which is masssaging the MCE - and which is also > > wrong for other reasons - not every Intel uarch supports local MCEs: > > > > NAK! > > I'm solidly with Boris on this one. I don't want mce-inject to sanity > check, or second guess and "fix" the signature that I requested. I want > it to do what I asked it to do. > > 1) Sometimes I want to force a crash from a particular signature because > I want to check that Linux accurately logs the signature of the error > that I injected. > > 2) Sometimes I want to inject "impossible" signatures that can't happen > on any CPU today, but I know will be a recoverable error on a future > processor. That's how much of the early machine check recovery code was > developed and tested. > Additionally, we have a "fake_panic" parameter/debugfs entry that lets us test the panic path without truly bringing down the system. Thanks, Yazen ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-06-11 17:41 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-29 3:32 [RFC PATCH] x86/mce/inject: Add sanity check in inject_mce() Xiongfeng Wang 2025-05-29 9:45 ` Borislav Petkov 2025-05-31 8:14 ` Ingo Molnar 2025-05-31 9:17 ` Borislav Petkov 2025-05-31 18:04 ` Ingo Molnar 2025-05-31 19:07 ` Borislav Petkov 2025-05-31 22:00 ` Luck, Tony 2025-06-11 17:41 ` Yazen Ghannam
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox