public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* WARNING in lmce_supported() during reboot.
@ 2024-10-25 23:13 Kuniyuki Iwashima
  2024-10-25 23:26 ` Benjamin Herrenschmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Kuniyuki Iwashima @ 2024-10-25 23:13 UTC (permalink / raw)
  To: x86, linux-edac, linux-kernel
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt,
	Kuniyuki Iwashima

Hello x86 maintainers,

We have seen the splat below few times when just rebooting hosts.

It rarely happens and seems a timing related, so we don't have a
reproducer.

Our kernel source in the splat is here,
https://github.com/amazonlinux/linux/tree/kernel-6.1.61-85.141.amzn2023

and the triggered WARN_ON_ONCE() in lmce_supported() is here.
https://github.com/amazonlinux/linux/blob/kernel-6.1.61-85.141.amzn2023/arch/x86/kernel/cpu/mce/intel.c#L124

Do you have any hint ?

Thanks in advance.


ACPI: PM: Preparing to enter system sleep state S5
reboot: Restarting system
reboot: machine restart
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/mce/intel.c:124 lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124 arch/x86/kernel/cpu/mce/intel.c:99) 
Modules linked in: ib_core binfmt_misc ext4 crc16 mbcache jbd2 sunrpc mousedev atkbd psmouse ghash_clmulni_intel vivaldi_fmap libps2 aesni_intel crypto_simd cryptd i8042 serio ena button sch_fq_codel dm_mod fuse configfs dax loop dmi_sysfs simpledrm drm_shmem_helper drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm i2c_core drm_panel_orientation_quirks backlight fb crc32_pclmul crc32c_intel fbdev efivarfs
Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124 arch/x86/kernel/cpu/mce/intel.c:99) 
Code: 81 fb 00 00 00 09 75 da b9 3a 00 00 00 0f 32 48 c1 e2 20 48 09 c2 48 89 d3 66 90 48 89 d8 48 c1 e8 14 83 e0 01 83 e3 01 75 ba <0f> 0b 31 c0 eb b4 31 d2 48 89 de bf 3a 00 00 00 e8 6b e6 57 00 eb
All code
========
   0:	81 fb 00 00 00 09    	cmp    $0x9000000,%ebx
   6:	75 da                	jne    0xffffffffffffffe2
   8:	b9 3a 00 00 00       	mov    $0x3a,%ecx
   d:	0f 32                	rdmsr
   f:	48 c1 e2 20          	shl    $0x20,%rdx
  13:	48 09 c2             	or     %rax,%rdx
  16:	48 89 d3             	mov    %rdx,%rbx
  19:	66 90                	xchg   %ax,%ax
  1b:	48 89 d8             	mov    %rbx,%rax
  1e:	48 c1 e8 14          	shr    $0x14,%rax
  22:	83 e0 01             	and    $0x1,%eax
  25:	83 e3 01             	and    $0x1,%ebx
  28:	75 ba                	jne    0xffffffffffffffe4
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	31 c0                	xor    %eax,%eax
  2e:	eb b4                	jmp    0xffffffffffffffe4
  30:	31 d2                	xor    %edx,%edx
  32:	48 89 de             	mov    %rbx,%rsi
  35:	bf 3a 00 00 00       	mov    $0x3a,%edi
  3a:	e8 6b e6 57 00       	call   0x57e6aa
  3f:	eb                   	.byte 0xeb

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	31 c0                	xor    %eax,%eax
   4:	eb b4                	jmp    0xffffffffffffffba
   6:	31 d2                	xor    %edx,%edx
   8:	48 89 de             	mov    %rbx,%rsi
   b:	bf 3a 00 00 00       	mov    $0x3a,%edi
  10:	e8 6b e6 57 00       	call   0x57e680
  15:	eb                   	.byte 0xeb
RSP: 0018:ffffa18f00154fb8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000003a
RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff965cfe2599c0
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffa18f00154ff8 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff965cfe240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8485dfba30 CR3: 0000000389a10003 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) 
? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) 
? mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465 arch/x86/kernel/cpu/mce/intel.c:502) 
? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124 arch/x86/kernel/cpu/mce/intel.c:99) 
? __warn (kernel/panic.c:672) 
? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124 arch/x86/kernel/cpu/mce/intel.c:99) 
? report_bug (lib/bug.c:201 lib/bug.c:219) 
? handle_bug (arch/x86/kernel/traps.c:324) 
? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1)) 
? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568) 
? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124 arch/x86/kernel/cpu/mce/intel.c:99) 
? clear_local_APIC (./arch/x86/include/asm/apic.h:393 arch/x86/kernel/apic/apic.c:1192) 
mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465 arch/x86/kernel/cpu/mce/intel.c:502) 
stop_this_cpu (arch/x86/kernel/process.c:780) 
__sysvec_reboot (arch/x86/kernel/smp.c:140) 
sysvec_reboot (arch/x86/kernel/smp.c:136 (discriminator 14)) 
</IRQ>
<TASK>
asm_sysvec_reboot (./arch/x86/include/asm/idtentry.h:656) 
RIP: 0010:acpi_idle_do_entry (./arch/x86/include/asm/irqflags.h:40 ./arch/x86/include/asm/irqflags.h:75 drivers/acpi/processor_idle.c:113 drivers/acpi/processor_idle.c:572) 
Code: 75 08 48 8b 15 b1 81 df 02 ed c3 cc cc cc cc 65 48 8b 04 25 00 ff 01 00 48 8b 00 a8 08 75 eb 66 90 0f 00 2d 58 c8 6a 00 fb f4 <fa> c3 cc cc cc cc e9 01 fc ff ff 90 0f 1f 44 00 00 41 56 41 55 41
All code
========
   0:	75 08                	jne    0xa
   2:	48 8b 15 b1 81 df 02 	mov    0x2df81b1(%rip),%rdx        # 0x2df81ba
   9:	ed                   	in     (%dx),%eax
   a:	c3                   	ret
   b:	cc                   	int3
   c:	cc                   	int3
   d:	cc                   	int3
   e:	cc                   	int3
   f:	65 48 8b 04 25 00 ff 	mov    %gs:0x1ff00,%rax
  16:	01 00 
  18:	48 8b 00             	mov    (%rax),%rax
  1b:	a8 08                	test   $0x8,%al
  1d:	75 eb                	jne    0xa
  1f:	66 90                	xchg   %ax,%ax
  21:	0f 00 2d 58 c8 6a 00 	verw   0x6ac858(%rip)        # 0x6ac880
  28:	fb                   	sti
  29:	f4                   	hlt
  2a:*	fa                   	cli		<-- trapping instruction
  2b:	c3                   	ret
  2c:	cc                   	int3
  2d:	cc                   	int3
  2e:	cc                   	int3
  2f:	cc                   	int3
  30:	e9 01 fc ff ff       	jmp    0xfffffffffffffc36
  35:	90                   	nop
  36:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  3b:	41 56                	push   %r14
  3d:	41 55                	push   %r13
  3f:	41                   	rex.B

Code starting with the faulting instruction
===========================================
   0:	fa                   	cli
   1:	c3                   	ret
   2:	cc                   	int3
   3:	cc                   	int3
   4:	cc                   	int3
   5:	cc                   	int3
   6:	e9 01 fc ff ff       	jmp    0xfffffffffffffc0c
   b:	90                   	nop
   c:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  11:	41 56                	push   %r14
  13:	41 55                	push   %r13
  15:	41                   	rex.B
RSP: 0018:ffffa18f000afe70 EFLAGS: 00000246
RAX: 0000000000004000 RBX: ffff965603d92400 RCX: 4000000000000000
RDX: ffff965cfe240000 RSI: ffff965601478800 RDI: ffff965601478864
RBP: 0000000000000001 R08: ffffffffb62182c0 R09: 0000000000000000
R10: 0000000000002703 R11: 000000000001993d R12: 0000000000000001
R13: ffffffffb6218340 R14: 0000000000000001 R15: 0000000000000000
acpi_idle_enter (drivers/acpi/processor_idle.c:711 (discriminator 3)) 
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:239) 
cpuidle_enter (drivers/cpuidle/cpuidle.c:358) 
cpuidle_idle_call (kernel/sched/idle.c:240) 
do_idle (kernel/sched/idle.c:305) 
cpu_startup_entry (kernel/sched/idle.c:400 (discriminator 1)) 
start_secondary (arch/x86/kernel/smpboot.c:215 arch/x86/kernel/smpboot.c:249) 
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) 
</TASK>
---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARNING in lmce_supported() during reboot.
  2024-10-25 23:13 WARNING in lmce_supported() during reboot Kuniyuki Iwashima
@ 2024-10-25 23:26 ` Benjamin Herrenschmidt
  2024-10-25 23:57 ` Luck, Tony
  2024-10-25 23:58 ` Dave Hansen
  2 siblings, 0 replies; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2024-10-25 23:26 UTC (permalink / raw)
  To: Kuniyuki Iwashima, x86, linux-edac, linux-kernel
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin

On Fri, 2024-10-25 at 16:13 -0700, Kuniyuki Iwashima wrote:
> Hello x86 maintainers,
> 
> We have seen the splat below few times when just rebooting hosts.
> 
> It rarely happens and seems a timing related, so we don't have a
> reproducer.
> 
> Our kernel source in the splat is here,
> https://github.com/amazonlinux/linux/tree/kernel-6.1.61-85.141.amzn2023
> 
> and the triggered WARN_ON_ONCE() in lmce_supported() is here.
> https://github.com/amazonlinux/linux/blob/kernel-6.1.61-85.141.amzn2023/arch/x86/kernel/cpu/mce/intel.c#L124

(switching to my lkml/spam friendly email)

I also hit it with 6.1.112-122.189.amzn2023.x86_64

Cheers,
Ben.

> Do you have any hint ?
> 
> Thanks in advance.
> 
> 
> ACPI: PM: Preparing to enter system sleep state S5
> reboot: Restarting system
> reboot: machine restart
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/mce/intel.c:124
> lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99) 
> Modules linked in: ib_core binfmt_misc ext4 crc16 mbcache jbd2 sunrpc
> mousedev atkbd psmouse ghash_clmulni_intel vivaldi_fmap libps2
> aesni_intel crypto_simd cryptd i8042 serio ena button sch_fq_codel
> dm_mod fuse configfs dax loop dmi_sysfs simpledrm drm_shmem_helper
> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect
> sysimgblt fb_sys_fops cfbcopyarea drm i2c_core
> drm_panel_orientation_quirks backlight fb crc32_pclmul crc32c_intel
> fbdev efivarfs
> Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
> RIP: 0010:lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99) 
> Code: 81 fb 00 00 00 09 75 da b9 3a 00 00 00 0f 32 48 c1 e2 20 48 09
> c2 48 89 d3 66 90 48 89 d8 48 c1 e8 14 83 e0 01 83 e3 01 75 ba <0f>
> 0b 31 c0 eb b4 31 d2 48 89 de bf 3a 00 00 00 e8 6b e6 57 00 eb
> All code
> ========
>    0:	81 fb 00 00 00 09    	cmp    $0x9000000,%ebx
>    6:	75 da                	jne    0xffffffffffffffe2
>    8:	b9 3a 00 00 00       	mov    $0x3a,%ecx
>    d:	0f 32                	rdmsr
>    f:	48 c1 e2 20          	shl    $0x20,%rdx
>   13:	48 09 c2             	or     %rax,%rdx
>   16:	48 89 d3             	mov    %rdx,%rbx
>   19:	66 90                	xchg   %ax,%ax
>   1b:	48 89 d8             	mov    %rbx,%rax
>   1e:	48 c1 e8 14          	shr    $0x14,%rax
>   22:	83 e0 01             	and    $0x1,%eax
>   25:	83 e3 01             	and    $0x1,%ebx
>   28:	75 ba                	jne    0xffffffffffffffe4
>   2a:*	0f 0b                	ud2		<-- trapping
> instruction
>   2c:	31 c0                	xor    %eax,%eax
>   2e:	eb b4                	jmp    0xffffffffffffffe4
>   30:	31 d2                	xor    %edx,%edx
>   32:	48 89 de             	mov    %rbx,%rsi
>   35:	bf 3a 00 00 00       	mov    $0x3a,%edi
>   3a:	e8 6b e6 57 00       	call   0x57e6aa
>   3f:	eb                   	.byte 0xeb
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2
>    2:	31 c0                	xor    %eax,%eax
>    4:	eb b4                	jmp    0xffffffffffffffba
>    6:	31 d2                	xor    %edx,%edx
>    8:	48 89 de             	mov    %rbx,%rsi
>    b:	bf 3a 00 00 00       	mov    $0x3a,%edi
>   10:	e8 6b e6 57 00       	call   0x57e680
>   15:	eb                   	.byte 0xeb
> RSP: 0018:ffffa18f00154fb8 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000003a
> RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffff965cfe2599c0
> RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffa18f00154ff8 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff965cfe240000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f8485dfba30 CR3: 0000000389a10003 CR4: 00000000007706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <IRQ>
> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) 
> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) 
> ? mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465
> arch/x86/kernel/cpu/mce/intel.c:502) 
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99) 
> ? __warn (kernel/panic.c:672) 
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99) 
> ? report_bug (lib/bug.c:201 lib/bug.c:219) 
> ? handle_bug (arch/x86/kernel/traps.c:324) 
> ? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1)) 
> ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568) 
> ? lmce_supported (arch/x86/kernel/cpu/mce/intel.c:124
> arch/x86/kernel/cpu/mce/intel.c:99) 
> ? clear_local_APIC (./arch/x86/include/asm/apic.h:393
> arch/x86/kernel/apic/apic.c:1192) 
> mce_intel_feature_clear (arch/x86/kernel/cpu/mce/intel.c:465
> arch/x86/kernel/cpu/mce/intel.c:502) 
> stop_this_cpu (arch/x86/kernel/process.c:780) 
> __sysvec_reboot (arch/x86/kernel/smp.c:140) 
> sysvec_reboot (arch/x86/kernel/smp.c:136 (discriminator 14)) 
> </IRQ>
> <TASK>
> asm_sysvec_reboot (./arch/x86/include/asm/idtentry.h:656) 
> RIP: 0010:acpi_idle_do_entry (./arch/x86/include/asm/irqflags.h:40
> ./arch/x86/include/asm/irqflags.h:75
> drivers/acpi/processor_idle.c:113 drivers/acpi/processor_idle.c:572) 
> Code: 75 08 48 8b 15 b1 81 df 02 ed c3 cc cc cc cc 65 48 8b 04 25 00
> ff 01 00 48 8b 00 a8 08 75 eb 66 90 0f 00 2d 58 c8 6a 00 fb f4 <fa>
> c3 cc cc cc cc e9 01 fc ff ff 90 0f 1f 44 00 00 41 56 41 55 41
> All code
> ========
>    0:	75 08                	jne    0xa
>    2:	48 8b 15 b1 81 df 02 	mov    0x2df81b1(%rip),%rdx        #
> 0x2df81ba
>    9:	ed                   	in     (%dx),%eax
>    a:	c3                   	ret
>    b:	cc                   	int3
>    c:	cc                   	int3
>    d:	cc                   	int3
>    e:	cc                   	int3
>    f:	65 48 8b 04 25 00 ff 	mov    %gs:0x1ff00,%rax
>   16:	01 00 
>   18:	48 8b 00             	mov    (%rax),%rax
>   1b:	a8 08                	test   $0x8,%al
>   1d:	75 eb                	jne    0xa
>   1f:	66 90                	xchg   %ax,%ax
>   21:	0f 00 2d 58 c8 6a 00 	verw   0x6ac858(%rip)        #
> 0x6ac880
>   28:	fb                   	sti
>   29:	f4                   	hlt
>   2a:*	fa                   	cli		<-- trapping
> instruction
>   2b:	c3                   	ret
>   2c:	cc                   	int3
>   2d:	cc                   	int3
>   2e:	cc                   	int3
>   2f:	cc                   	int3
>   30:	e9 01 fc ff ff       	jmp    0xfffffffffffffc36
>   35:	90                   	nop
>   36:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
>   3b:	41 56                	push   %r14
>   3d:	41 55                	push   %r13
>   3f:	41                   	rex.B
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	fa                   	cli
>    1:	c3                   	ret
>    2:	cc                   	int3
>    3:	cc                   	int3
>    4:	cc                   	int3
>    5:	cc                   	int3
>    6:	e9 01 fc ff ff       	jmp    0xfffffffffffffc0c
>    b:	90                   	nop
>    c:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
>   11:	41 56                	push   %r14
>   13:	41 55                	push   %r13
>   15:	41                   	rex.B
> RSP: 0018:ffffa18f000afe70 EFLAGS: 00000246
> RAX: 0000000000004000 RBX: ffff965603d92400 RCX: 4000000000000000
> RDX: ffff965cfe240000 RSI: ffff965601478800 RDI: ffff965601478864
> RBP: 0000000000000001 R08: ffffffffb62182c0 R09: 0000000000000000
> R10: 0000000000002703 R11: 000000000001993d R12: 0000000000000001
> R13: ffffffffb6218340 R14: 0000000000000001 R15: 0000000000000000
> acpi_idle_enter (drivers/acpi/processor_idle.c:711 (discriminator 3))
> cpuidle_enter_state (drivers/cpuidle/cpuidle.c:239) 
> cpuidle_enter (drivers/cpuidle/cpuidle.c:358) 
> cpuidle_idle_call (kernel/sched/idle.c:240) 
> do_idle (kernel/sched/idle.c:305) 
> cpu_startup_entry (kernel/sched/idle.c:400 (discriminator 1)) 
> start_secondary (arch/x86/kernel/smpboot.c:215
> arch/x86/kernel/smpboot.c:249) 
> secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) 
> </TASK>
> ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: WARNING in lmce_supported() during reboot.
  2024-10-25 23:13 WARNING in lmce_supported() during reboot Kuniyuki Iwashima
  2024-10-25 23:26 ` Benjamin Herrenschmidt
@ 2024-10-25 23:57 ` Luck, Tony
  2024-10-25 23:58 ` Dave Hansen
  2 siblings, 0 replies; 7+ messages in thread
From: Luck, Tony @ 2024-10-25 23:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima, x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin, Benjamin Herrenschmidt

> and the triggered WARN_ON_ONCE() in lmce_supported() is here.
> https://github.com/amazonlinux/linux/blob/kernel-6.1.61-85.141.amzn2023/arch/x86/kernel/cpu/mce/intel.c#L124

So the warning is this one:

	if (WARN_ON_ONCE(!(tmp & FEAT_CTL_LOCKED)))

It is checking that the MSR_IA32_FEAT_CTL (MSR 0x3a) has been correctly
set and locked by BIOS. I.e. that LMCE mode can't be snatched away by
someone rewriting this MSR.

That said, you ought to either hit it all the time, or never. So this "sometimes"
state is weird.

Which CPU model do you see this on?

Can you please try using the rdmsr/wrmsr commands from msr-tools to:

a) read this MSR on all CPUs to check it is set to the same value and that
bit 0 is set to 1.

b) try writing to this MSR (maybe try clearing the lock bit (bit 0) or the LMCE bit (bit 20)
and see if that succeeds.

-Tony

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARNING in lmce_supported() during reboot.
  2024-10-25 23:13 WARNING in lmce_supported() during reboot Kuniyuki Iwashima
  2024-10-25 23:26 ` Benjamin Herrenschmidt
  2024-10-25 23:57 ` Luck, Tony
@ 2024-10-25 23:58 ` Dave Hansen
  2024-10-26  2:33   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2024-10-25 23:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima, x86, linux-edac, linux-kernel
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin, Benjamin Herrenschmidt

On 10/25/24 16:13, Kuniyuki Iwashima wrote:
> We have seen the splat below few times when just rebooting hosts.
> 
> It rarely happens and seems a timing related, so we don't have a
> reproducer.
> 
> Our kernel source in the splat is here,
> https://github.com/amazonlinux/linux/tree/kernel-6.1.61-85.141.amzn2023

Hi Folks,

We really do need it to be reproduced on mainline.  At the very least,
it would be greatly appreciated if you could summarize what your fork is
doing and why you don't think it is responsible.

But I don't see how this could be timing related.  That MSR gets locked
early from what I can tell, long before the system would be rebooting.

Your best bet is going to be getting a handle on what
MSR_IA32_FEAT_CTL's value was after the CPU was brought up and when this
reboot was attempted.  If those values differ, when it got changed.

I'd _suspect_ some kind of BIOS sleep/wakeup wonkiness where something
forgot to re-lock the MSR.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARNING in lmce_supported() during reboot.
  2024-10-25 23:58 ` Dave Hansen
@ 2024-10-26  2:33   ` Benjamin Herrenschmidt
  2024-10-28 15:46     ` Dave Hansen
  2024-10-30  2:26     ` Zhuo, Qiuxu
  0 siblings, 2 replies; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2024-10-26  2:33 UTC (permalink / raw)
  To: Dave Hansen, Kuniyuki Iwashima, x86, linux-edac, linux-kernel
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin, Alex Graf

On Fri, 2024-10-25 at 16:58 -0700, Dave Hansen wrote:
> 
> Hi Folks,
> 
> We really do need it to be reproduced on mainline.  At the very least,
> it would be greatly appreciated if you could summarize what your fork is
> doing and why you don't think it is responsible.
> 
> But I don't see how this could be timing related.  That MSR gets locked
> early from what I can tell, long before the system would be rebooting.
> 
> Your best bet is going to be getting a handle on what
> MSR_IA32_FEAT_CTL's value was after the CPU was brought up and when this
> reboot was attempted.  If those values differ, when it got changed.
> 
> I'd _suspect_ some kind of BIOS sleep/wakeup wonkiness where something
> forgot to re-lock the MSR.

So far we just happened to notice it in the serial console while doing
other things, I told Kuniyuki to forward it to you in case it rings a
bell. We can definitely do some more systematic attempts at reproducing
but that might take a while.

For me it happened once while rebooting a c5d.4xlarge instance, and not
since (I tried a few reboots), so it could well be something BIOS related
(CC'ing Alex).

This is a KVM/Nitro guest, so the CPU is somewhat virtualized, but
/proc/cpuinfo says: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: WARNING in lmce_supported() during reboot.
  2024-10-26  2:33   ` Benjamin Herrenschmidt
@ 2024-10-28 15:46     ` Dave Hansen
  2024-10-30  2:26     ` Zhuo, Qiuxu
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2024-10-28 15:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Kuniyuki Iwashima, x86, linux-edac,
	linux-kernel
  Cc: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin, Alex Graf

On 10/25/24 19:33, Benjamin Herrenschmidt wrote:
> This is a KVM/Nitro guest, so the CPU is somewhat virtualized, but
> /proc/cpuinfo says: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

Oh, if it's a guest, I'd be 100% pointing fingers at some piece of the
VMM.  While it's possible that the guest kernel screwed this up, it's
not immediately obvious how it might do that.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: WARNING in lmce_supported() during reboot.
  2024-10-26  2:33   ` Benjamin Herrenschmidt
  2024-10-28 15:46     ` Dave Hansen
@ 2024-10-30  2:26     ` Zhuo, Qiuxu
  1 sibling, 0 replies; 7+ messages in thread
From: Zhuo, Qiuxu @ 2024-10-30  2:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Hansen, Dave, Kuniyuki Iwashima,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Luck, Tony, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, H. Peter Anvin, Graf, Alexander

> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> [...]
> Subject: Re: WARNING in lmce_supported() during reboot.
> [...]
> This is a KVM/Nitro guest, so the CPU is somewhat virtualized, but
> /proc/cpuinfo says: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> 

Did you see the same splat on this host? 
It may also be worth paying attention to whether this splat occurs when
you reboot the host, if possible.

-Qiuxu

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-30  2:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 23:13 WARNING in lmce_supported() during reboot Kuniyuki Iwashima
2024-10-25 23:26 ` Benjamin Herrenschmidt
2024-10-25 23:57 ` Luck, Tony
2024-10-25 23:58 ` Dave Hansen
2024-10-26  2:33   ` Benjamin Herrenschmidt
2024-10-28 15:46     ` Dave Hansen
2024-10-30  2:26     ` Zhuo, Qiuxu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox