[BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
@ 2025-08-06 20:14 Frederick Lawler
  2025-08-06 20:19 ` Fred Lawler
  0 siblings, 1 reply; 8+ messages in thread
From: Frederick Lawler @ 2025-08-06 20:14 UTC (permalink / raw)
  To: Corey Minyard; +Cc: openipmi-developer, linux-kernel, kernel-team

Hi Corey,

In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
watchdog event, especially so while querying the BMC for basic device
information via sysfs.

I havn't been able to create a consistent reproducer yet, but I suspect
that these occur during high traffic, BMC is resetting, and reading
from the sysfs files in parallel. We're also using KCS to interface
with the BMC.

I can consistently reproduce hung tasks trivially with the following,
during a BMC reset:

while true; do cat aux_firmware_revision &>/dev/null; done &

I tried also tried to load the CPUs with stress-ng, but the best I can do
are the hung tasks.

I identified that sni_send()[1] could be locked behind the
spin_lock_irqsave() and within the KCS send handler, there's another irq
save lock. I suspect this is where we're getting hung up. Below is a
sample stack trace + log output.

I'm happy to provide traces and additional information, let me know.

Links:
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899

[  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
[  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
[  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
[  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
[  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
[  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
[  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
[  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
[  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
[  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
[  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
[  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
[  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
[  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
[  904.196130] [     C33] Modules linked in:
[  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
[  904.196131] [     C97] Dazed and confused, but trying to continue
[  904.196131] [     C33]  nft_fwd_netdev
[  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
[  904.196133] [      C6] Dazed and confused, but trying to continue
[  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
[  904.196134] [     C33]  nf_dup_netdev
[  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
[  904.196135] [    C101] Dazed and confused, but trying to continue
[  904.196137] [     C99] Dazed and confused, but trying to continue
[  904.196137] [     C33]  xfrm_interface
[  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
[  904.196140] [    C102] Dazed and confused, but trying to continue
[  904.196140] [     C33]  xfrm6_tunnel
[  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
[  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
[  904.196142] [     C35] Dazed and confused, but trying to continue
[  904.196143] [     C69] Dazed and confused, but trying to continue
[  904.196143] [     C33]  nft_numgen
[  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
[  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
[  904.196146] [    C123] Dazed and confused, but trying to continue
[  904.196147] [    C121] Dazed and confused, but trying to continue
[  904.196148] [     C58] Dazed and confused, but trying to continue
[  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
[  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
[  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
[  904.196286] [     C33] Tainted: [O]=OOT_MODULE
[  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
[  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
[  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
[  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
[  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
[  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
[  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
[  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
[  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
[  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
[  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
[  904.196320] [     C33] PKRU: 55555554
[  904.196322] [     C33] Call Trace:
[  904.196324] [     C33]  <TASK>
[  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
[  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
[  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
[  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
[  904.196352] [     C33]  do_idle+0x176/0x1c0
[  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
[  904.196362] [     C33]  start_secondary+0xf7/0x100
[  904.196366] [     C33]  common_startup_64+0x13e/0x141
[  904.196374] [     C33]  </TASK>
[  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
[  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
[  904.196383] [     C33] Tainted: [O]=OOT_MODULE
[  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
[  904.196385] [     C33] Call Trace:
[  904.196387] [     C33]  <NMI>
[  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
[  904.196394] [     C33]  panic+0x106/0x2c4
[  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
[  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
[  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
[  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
[  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
[  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
[  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
[  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
[  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
[  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
[  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
[  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
[  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
[  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
[  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
[  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
[  904.196479] [     C33]  default_do_nmi+0x127/0x180
[  904.196483] [     C33]  exc_nmi+0x103/0x180
[  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
[  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
[  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
[  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
[  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
[  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
[  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
[  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
[  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
[  904.196510] [     C33]  ? io_idle+0x3/0x30
[  904.196515] [     C33]  ? io_idle+0x3/0x30
[  904.196519] [     C33]  </NMI>
[  904.196520] [     C33]  <TASK>
[  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
[  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
[  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
[  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
[  904.196539] [     C33]  do_idle+0x176/0x1c0
[  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
[  904.196548] [     C33]  start_secondary+0xf7/0x100
[  904.196552] [     C33]  common_startup_64+0x13e/0x141
[  904.196559] [     C33]  </TASK>

Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 20:14 [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP Frederick Lawler
@ 2025-08-06 20:19 ` Fred Lawler
  2025-08-06 20:39   ` Corey Minyard
  2025-08-06 21:16   ` Corey Minyard
  0 siblings, 2 replies; 8+ messages in thread
From: Fred Lawler @ 2025-08-06 20:19 UTC (permalink / raw)
  To: Corey Minyard; +Cc: openipmi-developer, linux-kernel, kernel-team

+ CC: Corey Minyard <corey@minyard.net>

Misspelled the email address.

On Wed, Aug 6, 2025 at 3:14 PM Frederick Lawler <fred@cloudflare.com> wrote:
>
> Hi Corey,
>
> In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
> watchdog event, especially so while querying the BMC for basic device
> information via sysfs.
>
> I havn't been able to create a consistent reproducer yet, but I suspect
> that these occur during high traffic, BMC is resetting, and reading
> from the sysfs files in parallel. We're also using KCS to interface
> with the BMC.
>
> I can consistently reproduce hung tasks trivially with the following,
> during a BMC reset:
>
> while true; do cat aux_firmware_revision &>/dev/null; done &
>
> I tried also tried to load the CPUs with stress-ng, but the best I can do
> are the hung tasks.
>
> I identified that sni_send()[1] could be locked behind the
> spin_lock_irqsave() and within the KCS send handler, there's another irq
> save lock. I suspect this is where we're getting hung up. Below is a
> sample stack trace + log output.
>
> I'm happy to provide traces and additional information, let me know.
>
> Links:
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
>
> [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> [  904.196130] [     C33] Modules linked in:
> [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> [  904.196131] [     C97] Dazed and confused, but trying to continue
> [  904.196131] [     C33]  nft_fwd_netdev
> [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> [  904.196133] [      C6] Dazed and confused, but trying to continue
> [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> [  904.196134] [     C33]  nf_dup_netdev
> [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> [  904.196135] [    C101] Dazed and confused, but trying to continue
> [  904.196137] [     C99] Dazed and confused, but trying to continue
> [  904.196137] [     C33]  xfrm_interface
> [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> [  904.196140] [    C102] Dazed and confused, but trying to continue
> [  904.196140] [     C33]  xfrm6_tunnel
> [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> [  904.196142] [     C35] Dazed and confused, but trying to continue
> [  904.196143] [     C69] Dazed and confused, but trying to continue
> [  904.196143] [     C33]  nft_numgen
> [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> [  904.196146] [    C123] Dazed and confused, but trying to continue
> [  904.196147] [    C121] Dazed and confused, but trying to continue
> [  904.196148] [     C58] Dazed and confused, but trying to continue
> [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> [  904.196320] [     C33] PKRU: 55555554
> [  904.196322] [     C33] Call Trace:
> [  904.196324] [     C33]  <TASK>
> [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> [  904.196352] [     C33]  do_idle+0x176/0x1c0
> [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> [  904.196362] [     C33]  start_secondary+0xf7/0x100
> [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> [  904.196374] [     C33]  </TASK>
> [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> [  904.196385] [     C33] Call Trace:
> [  904.196387] [     C33]  <NMI>
> [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> [  904.196394] [     C33]  panic+0x106/0x2c4
> [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> [  904.196483] [     C33]  exc_nmi+0x103/0x180
> [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> [  904.196510] [     C33]  ? io_idle+0x3/0x30
> [  904.196515] [     C33]  ? io_idle+0x3/0x30
> [  904.196519] [     C33]  </NMI>
> [  904.196520] [     C33]  <TASK>
> [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> [  904.196539] [     C33]  do_idle+0x176/0x1c0
> [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> [  904.196548] [     C33]  start_secondary+0xf7/0x100
> [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> [  904.196559] [     C33]  </TASK>
>
> Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 20:19 ` Fred Lawler
@ 2025-08-06 20:39   ` Corey Minyard
  2025-08-06 21:16   ` Corey Minyard
  1 sibling, 0 replies; 8+ messages in thread
From: Corey Minyard @ 2025-08-06 20:39 UTC (permalink / raw)
  To: Fred Lawler; +Cc: openipmi-developer, linux-kernel, kernel-team

Can you try the following patch?

https://github.com/cminyard/linux-ipmi.git ec50ec378e3fd83bde9b3d622ceac3509a60b6b5

I don't think it will fix it, but maybe it's just a bunch of logs
causing this.

I'll look a bit more.

-corey


On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> + CC: Corey Minyard <corey@minyard.net>
> 
> Misspelled the email address.
> 
> On Wed, Aug 6, 2025 at 3:14 PM Frederick Lawler <fred@cloudflare.com> wrote:
> >
> > Hi Corey,
> >
> > In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
> > watchdog event, especially so while querying the BMC for basic device
> > information via sysfs.
> >
> > I havn't been able to create a consistent reproducer yet, but I suspect
> > that these occur during high traffic, BMC is resetting, and reading
> > from the sysfs files in parallel. We're also using KCS to interface
> > with the BMC.
> >
> > I can consistently reproduce hung tasks trivially with the following,
> > during a BMC reset:
> >
> > while true; do cat aux_firmware_revision &>/dev/null; done &
> >
> > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > are the hung tasks.
> >
> > I identified that sni_send()[1] could be locked behind the
> > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > save lock. I suspect this is where we're getting hung up. Below is a
> > sample stack trace + log output.
> >
> > I'm happy to provide traces and additional information, let me know.
> >
> > Links:
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> >
> > [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> > [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> > [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> > [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> > [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> > [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > [  904.196130] [     C33] Modules linked in:
> > [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > [  904.196131] [     C97] Dazed and confused, but trying to continue
> > [  904.196131] [     C33]  nft_fwd_netdev
> > [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > [  904.196133] [      C6] Dazed and confused, but trying to continue
> > [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > [  904.196134] [     C33]  nf_dup_netdev
> > [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > [  904.196135] [    C101] Dazed and confused, but trying to continue
> > [  904.196137] [     C99] Dazed and confused, but trying to continue
> > [  904.196137] [     C33]  xfrm_interface
> > [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > [  904.196140] [    C102] Dazed and confused, but trying to continue
> > [  904.196140] [     C33]  xfrm6_tunnel
> > [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > [  904.196142] [     C35] Dazed and confused, but trying to continue
> > [  904.196143] [     C69] Dazed and confused, but trying to continue
> > [  904.196143] [     C33]  nft_numgen
> > [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > [  904.196146] [    C123] Dazed and confused, but trying to continue
> > [  904.196147] [    C121] Dazed and confused, but trying to continue
> > [  904.196148] [     C58] Dazed and confused, but trying to continue
> > [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> > [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> > [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > [  904.196320] [     C33] PKRU: 55555554
> > [  904.196322] [     C33] Call Trace:
> > [  904.196324] [     C33]  <TASK>
> > [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> > [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> > [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> > [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> > [  904.196352] [     C33]  do_idle+0x176/0x1c0
> > [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> > [  904.196362] [     C33]  start_secondary+0xf7/0x100
> > [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> > [  904.196374] [     C33]  </TASK>
> > [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> > [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> > [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > [  904.196385] [     C33] Call Trace:
> > [  904.196387] [     C33]  <NMI>
> > [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> > [  904.196394] [     C33]  panic+0x106/0x2c4
> > [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> > [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> > [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> > [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> > [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> > [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> > [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> > [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> > [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> > [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> > [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> > [  904.196483] [     C33]  exc_nmi+0x103/0x180
> > [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> > [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> > [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > [  904.196510] [     C33]  ? io_idle+0x3/0x30
> > [  904.196515] [     C33]  ? io_idle+0x3/0x30
> > [  904.196519] [     C33]  </NMI>
> > [  904.196520] [     C33]  <TASK>
> > [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> > [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> > [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> > [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> > [  904.196539] [     C33]  do_idle+0x176/0x1c0
> > [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> > [  904.196548] [     C33]  start_secondary+0xf7/0x100
> > [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> > [  904.196559] [     C33]  </TASK>
> >
> > Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 20:19 ` Fred Lawler
  2025-08-06 20:39   ` Corey Minyard
@ 2025-08-06 21:16   ` Corey Minyard
  2025-08-06 21:36     ` Frederick Lawler
  1 sibling, 1 reply; 8+ messages in thread
From: Corey Minyard @ 2025-08-06 21:16 UTC (permalink / raw)
  To: Fred Lawler; +Cc: openipmi-developer, linux-kernel, kernel-team

On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> + CC: Corey Minyard <corey@minyard.net>
> 
> Misspelled the email address.
> 
> On Wed, Aug 6, 2025 at 3:14 PM Frederick Lawler <fred@cloudflare.com> wrote:
> >
> > Hi Corey,
> >
> > In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
> > watchdog event, especially so while querying the BMC for basic device
> > information via sysfs.
> >
> > I havn't been able to create a consistent reproducer yet, but I suspect
> > that these occur during high traffic, BMC is resetting, and reading
> > from the sysfs files in parallel. We're also using KCS to interface
> > with the BMC.
> >
> > I can consistently reproduce hung tasks trivially with the following,
> > during a BMC reset:
> >
> > while true; do cat aux_firmware_revision &>/dev/null; done &

I looked through the code around this and didn't see anything obvious.

Looking at the logs a bit more:

[  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
[  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.

So there's a lockup on CPU 33, but the backtrace shows it's in the ACPI
idle function.  That's wierd.  For that to really happen, it would have
to lock up in ACPI.

The unknown NMIs are also wierd.  Are you using the IPMI NMI watchdog?

I'm wondering if something is happening with the BMC resetting and
interactions with ACPI involved in that.  Adding the extra part of
trying to talk to the BMC while it's being reset could cause the BMC to
get confused and do bad things?

> >
> > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > are the hung tasks.
> >
> > I identified that sni_send()[1] could be locked behind the
> > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > save lock. I suspect this is where we're getting hung up. Below is a
> > sample stack trace + log output.

Yeah, I don't see that in the traceback.  There is a lock in the KCS
sender, but I don't see how that could do anything.

Maybe you could try changing the cpuidle handler?  That would be at
least something to try.

-corey

> >
> > I'm happy to provide traces and additional information, let me know.
> >
> > Links:
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> >
> > [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> > [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> > [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> > [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> > [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> > [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > [  904.196130] [     C33] Modules linked in:
> > [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > [  904.196131] [     C97] Dazed and confused, but trying to continue
> > [  904.196131] [     C33]  nft_fwd_netdev
> > [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > [  904.196133] [      C6] Dazed and confused, but trying to continue
> > [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > [  904.196134] [     C33]  nf_dup_netdev
> > [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > [  904.196135] [    C101] Dazed and confused, but trying to continue
> > [  904.196137] [     C99] Dazed and confused, but trying to continue
> > [  904.196137] [     C33]  xfrm_interface
> > [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > [  904.196140] [    C102] Dazed and confused, but trying to continue
> > [  904.196140] [     C33]  xfrm6_tunnel
> > [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > [  904.196142] [     C35] Dazed and confused, but trying to continue
> > [  904.196143] [     C69] Dazed and confused, but trying to continue
> > [  904.196143] [     C33]  nft_numgen
> > [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > [  904.196146] [    C123] Dazed and confused, but trying to continue
> > [  904.196147] [    C121] Dazed and confused, but trying to continue
> > [  904.196148] [     C58] Dazed and confused, but trying to continue
> > [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> > [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> > [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > [  904.196320] [     C33] PKRU: 55555554
> > [  904.196322] [     C33] Call Trace:
> > [  904.196324] [     C33]  <TASK>
> > [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> > [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> > [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> > [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> > [  904.196352] [     C33]  do_idle+0x176/0x1c0
> > [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> > [  904.196362] [     C33]  start_secondary+0xf7/0x100
> > [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> > [  904.196374] [     C33]  </TASK>
> > [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> > [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> > [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > [  904.196385] [     C33] Call Trace:
> > [  904.196387] [     C33]  <NMI>
> > [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> > [  904.196394] [     C33]  panic+0x106/0x2c4
> > [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> > [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> > [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> > [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> > [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> > [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> > [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> > [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> > [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> > [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> > [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> > [  904.196483] [     C33]  exc_nmi+0x103/0x180
> > [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> > [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> > [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > [  904.196510] [     C33]  ? io_idle+0x3/0x30
> > [  904.196515] [     C33]  ? io_idle+0x3/0x30
> > [  904.196519] [     C33]  </NMI>
> > [  904.196520] [     C33]  <TASK>
> > [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> > [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> > [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> > [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> > [  904.196539] [     C33]  do_idle+0x176/0x1c0
> > [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> > [  904.196548] [     C33]  start_secondary+0xf7/0x100
> > [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> > [  904.196559] [     C33]  </TASK>
> >
> > Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 21:16   ` Corey Minyard
@ 2025-08-06 21:36     ` Frederick Lawler
  2025-08-06 22:51       ` Corey Minyard
  0 siblings, 1 reply; 8+ messages in thread
From: Frederick Lawler @ 2025-08-06 21:36 UTC (permalink / raw)
  To: Corey Minyard; +Cc: openipmi-developer, linux-kernel, kernel-team

On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote:
> On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> > + CC: Corey Minyard <corey@minyard.net>
> > 
> > Misspelled the email address.
> > 
> > On Wed, Aug 6, 2025 at 3:14 PM Frederick Lawler <fred@cloudflare.com> wrote:
> > >
> > > Hi Corey,
> > >
> > > In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
> > > watchdog event, especially so while querying the BMC for basic device
> > > information via sysfs.
> > >
> > > I havn't been able to create a consistent reproducer yet, but I suspect
> > > that these occur during high traffic, BMC is resetting, and reading
> > > from the sysfs files in parallel. We're also using KCS to interface
> > > with the BMC.
> > >
> > > I can consistently reproduce hung tasks trivially with the following,
> > > during a BMC reset:
> > >
> > > while true; do cat aux_firmware_revision &>/dev/null; done &
> 
> I looked through the code around this and didn't see anything obvious.
> 
> Looking at the logs a bit more:
> 
> [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> 
> So there's a lockup on CPU 33, but the backtrace shows it's in the ACPI
> idle function.  That's wierd.  For that to really happen, it would have
> to lock up in ACPI.
> 
> The unknown NMIs are also wierd.  Are you using the IPMI NMI watchdog?
>

It looks like we do have the module enabled in our config, but we aren't
loading it:

fred@251m103:~$ lsmod | grep -i ipmi
ipmi_ssif              40960  0
acpi_ipmi              16384  0
ipmi_si                81920  1
ipmi_devintf           16384  0
ipmi_msghandler        81920  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif

> I'm wondering if something is happening with the BMC resetting and
> interactions with ACPI involved in that.  Adding the extra part of
> trying to talk to the BMC while it's being reset could cause the BMC to
> get confused and do bad things?
> 

Sure, it's a possibility we explored. We have a lot of automation.
Predominately of which is a prometheus module exporting IPMI information
from the sysfs files. And we also have config management that's querying
sysfs files to regulate updates etc... Sometimes, the config management
automation will attempt to reset the BMC.

> > >
> > > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > > are the hung tasks.
> > >
> > > I identified that sni_send()[1] could be locked behind the
> > > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > > save lock. I suspect this is where we're getting hung up. Below is a
> > > sample stack trace + log output.
> 
> Yeah, I don't see that in the traceback.  There is a lock in the KCS
> sender, but I don't see how that could do anything.
> 
> Maybe you could try changing the cpuidle handler?  That would be at
> least something to try.
> 

Would that help in forming a reproducer? I'd need to deploy any kernel
modifications fleet wide to cast a wide enough net. The lockups arn't
extremely consistent. We may get a couple or more a week.

Lastly, I have the rate limit patch backported. I'll be able to start
testing with that tomorrow, and same with loading the IPMI watchdog
module.

> -corey
> 
> > >
> > > I'm happy to provide traces and additional information, let me know.
> > >
> > > Links:
> > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> > >
> > > [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > > [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > > [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > > [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> > > [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > > [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > > [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > > [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> > > [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> > > [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> > > [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> > > [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > > [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > > [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > > [  904.196130] [     C33] Modules linked in:
> > > [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > > [  904.196131] [     C97] Dazed and confused, but trying to continue
> > > [  904.196131] [     C33]  nft_fwd_netdev
> > > [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > > [  904.196133] [      C6] Dazed and confused, but trying to continue
> > > [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > > [  904.196134] [     C33]  nf_dup_netdev
> > > [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > > [  904.196135] [    C101] Dazed and confused, but trying to continue
> > > [  904.196137] [     C99] Dazed and confused, but trying to continue
> > > [  904.196137] [     C33]  xfrm_interface
> > > [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > > [  904.196140] [    C102] Dazed and confused, but trying to continue
> > > [  904.196140] [     C33]  xfrm6_tunnel
> > > [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > > [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > > [  904.196142] [     C35] Dazed and confused, but trying to continue
> > > [  904.196143] [     C69] Dazed and confused, but trying to continue
> > > [  904.196143] [     C33]  nft_numgen
> > > [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > > [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > > [  904.196146] [    C123] Dazed and confused, but trying to continue
> > > [  904.196147] [    C121] Dazed and confused, but trying to continue
> > > [  904.196148] [     C58] Dazed and confused, but trying to continue
> > > [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > > [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > > [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > > [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> > > [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> > > [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > > [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > > [  904.196320] [     C33] PKRU: 55555554
> > > [  904.196322] [     C33] Call Trace:
> > > [  904.196324] [     C33]  <TASK>
> > > [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> > > [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> > > [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> > > [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> > > [  904.196352] [     C33]  do_idle+0x176/0x1c0
> > > [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> > > [  904.196362] [     C33]  start_secondary+0xf7/0x100
> > > [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> > > [  904.196374] [     C33]  </TASK>
> > > [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> > > [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > > [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> > > [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > [  904.196385] [     C33] Call Trace:
> > > [  904.196387] [     C33]  <NMI>
> > > [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> > > [  904.196394] [     C33]  panic+0x106/0x2c4
> > > [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> > > [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> > > [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> > > [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> > > [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> > > [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> > > [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> > > [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > > [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> > > [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> > > [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> > > [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> > > [  904.196483] [     C33]  exc_nmi+0x103/0x180
> > > [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> > > [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> > > [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > [  904.196510] [     C33]  ? io_idle+0x3/0x30
> > > [  904.196515] [     C33]  ? io_idle+0x3/0x30
> > > [  904.196519] [     C33]  </NMI>
> > > [  904.196520] [     C33]  <TASK>
> > > [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> > > [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> > > [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> > > [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> > > [  904.196539] [     C33]  do_idle+0x176/0x1c0
> > > [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> > > [  904.196548] [     C33]  start_secondary+0xf7/0x100
> > > [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> > > [  904.196559] [     C33]  </TASK>
> > >
> > > Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 21:36     ` Frederick Lawler
@ 2025-08-06 22:51       ` Corey Minyard
  2025-08-07 19:43         ` Frederick Lawler
  0 siblings, 1 reply; 8+ messages in thread
From: Corey Minyard @ 2025-08-06 22:51 UTC (permalink / raw)
  To: Frederick Lawler; +Cc: openipmi-developer, linux-kernel, kernel-team

On Wed, Aug 06, 2025 at 04:36:41PM -0500, Frederick Lawler wrote:
> On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote:
> > On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> > > + CC: Corey Minyard <corey@minyard.net>
> > > 
> 
> > I'm wondering if something is happening with the BMC resetting and
> > interactions with ACPI involved in that.  Adding the extra part of
> > trying to talk to the BMC while it's being reset could cause the BMC to
> > get confused and do bad things?
> > 
> 
> Sure, it's a possibility we explored. We have a lot of automation.
> Predominately of which is a prometheus module exporting IPMI information
> from the sysfs files. And we also have config management that's querying
> sysfs files to regulate updates etc... Sometimes, the config management
> automation will attempt to reset the BMC.

Ok.  I have tests that do BMC resets, but I can't run at the scale you
do, and I'm running in a simulator so it's not going to be have the
same.

The other possibility is the processor goes into the idle code while
interrupts are off, but I think the kernel has checks all around that.
I can't think of how else a processor would get stuck in idle.

> 
> > > >
> > > > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > > > are the hung tasks.
> > > >
> > > > I identified that sni_send()[1] could be locked behind the
> > > > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > > > save lock. I suspect this is where we're getting hung up. Below is a
> > > > sample stack trace + log output.
> > 
> > Yeah, I don't see that in the traceback.  There is a lock in the KCS
> > sender, but I don't see how that could do anything.
> > 
> > Maybe you could try changing the cpuidle handler?  That would be at
> > least something to try.
> > 
> 
> Would that help in forming a reproducer? I'd need to deploy any kernel
> modifications fleet wide to cast a wide enough net. The lockups arn't
> extremely consistent. We may get a couple or more a week.

Ah, so this isn't readily reproducable.  Bummer.

If the problem goes away if you change the cpuidle handler to something
non-ACPI, that would be a big clue that it's an ACPI issue.

> 
> Lastly, I have the rate limit patch backported. I'll be able to start
> testing with that tomorrow, and same with loading the IPMI watchdog
> module.

Ok.  I don't have much hope for it making much difference, but it's safe
and will be coming in the next kernel release.

-corey

> 
> > -corey
> > 
> > > >
> > > > I'm happy to provide traces and additional information, let me know.
> > > >
> > > > Links:
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> > > >
> > > > [  499.564572] [  T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > > > [  499.588176] [  T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > > > [  499.605284] [  T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > > > [  805.906999] [  T12765] usb 1-1: USB disconnect, device number 2
> > > > [  845.346020] [  T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > > > [  845.485453] [  T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > > > [  845.496823] [  T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > > > [  845.507242] [  T12765] usb 1-1: Product: USB Virtual Hub
> > > > [  845.514946] [  T12765] usb 1-1: Manufacturer: Aspeed
> > > > [  845.522363] [  T12765] usb 1-1: SerialNumber: 00000000
> > > > [  845.530454] [  T12765] usb 1-1: Device is not authorized for usage
> > > > [  853.774910] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.783794] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.792649] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.801461] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.810291] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.819069] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.827816] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.836581] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.845326] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.854074] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  853.862813] [    C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [  863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [  904.196122] [     C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > > > [  904.196127] [     C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > > > [  904.196126] [      C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > > > [  904.196130] [     C33] Modules linked in:
> > > > [  904.196129] [    C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > > > [  904.196131] [     C97] Dazed and confused, but trying to continue
> > > > [  904.196131] [     C33]  nft_fwd_netdev
> > > > [  904.196131] [     C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > > > [  904.196133] [      C6] Dazed and confused, but trying to continue
> > > > [  904.196133] [    C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > > > [  904.196134] [     C33]  nf_dup_netdev
> > > > [  904.196134] [     C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > > > [  904.196135] [    C101] Dazed and confused, but trying to continue
> > > > [  904.196137] [     C99] Dazed and confused, but trying to continue
> > > > [  904.196137] [     C33]  xfrm_interface
> > > > [  904.196136] [     C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > > > [  904.196140] [    C102] Dazed and confused, but trying to continue
> > > > [  904.196140] [     C33]  xfrm6_tunnel
> > > > [  904.196138] [    C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > > > [  904.196140] [    C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > > > [  904.196142] [     C35] Dazed and confused, but trying to continue
> > > > [  904.196143] [     C69] Dazed and confused, but trying to continue
> > > > [  904.196143] [     C33]  nft_numgen
> > > > [  904.196143] [     C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > > > [  904.196144] [     C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > > > [  904.196146] [    C123] Dazed and confused, but trying to continue
> > > > [  904.196147] [    C121] Dazed and confused, but trying to continue
> > > > [  904.196148] [     C58] Dazed and confused, but trying to continue
> > > > [  904.196150] [     C33]  nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > > > [  904.196247] [     C33]  sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > > > [  904.196281] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > > > [  904.196286] [     C33] Tainted: [O]=OOT_MODULE
> > > > [  904.196287] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [  904.196290] [     C33] RIP: 0010:io_idle+0x3/0x30
> > > > [  904.196298] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [  904.196301] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [  904.196304] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [  904.196306] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [  904.196308] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [  904.196309] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [  904.196311] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [  904.196313] [     C33] FS:  0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > > > [  904.196316] [     C33] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  904.196318] [     C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > > > [  904.196320] [     C33] PKRU: 55555554
> > > > [  904.196322] [     C33] Call Trace:
> > > > [  904.196324] [     C33]  <TASK>
> > > > [  904.196326] [     C33]  acpi_idle_do_entry+0x22/0x50
> > > > [  904.196336] [     C33]  acpi_idle_enter+0x7b/0xd0
> > > > [  904.196340] [     C33]  cpuidle_enter_state+0x79/0x420
> > > > [  904.196345] [     C33]  cpuidle_enter+0x2d/0x40
> > > > [  904.196352] [     C33]  do_idle+0x176/0x1c0
> > > > [  904.196358] [     C33]  cpu_startup_entry+0x29/0x30
> > > > [  904.196362] [     C33]  start_secondary+0xf7/0x100
> > > > [  904.196366] [     C33]  common_startup_64+0x13e/0x141
> > > > [  904.196374] [     C33]  </TASK>
> > > > [  904.196377] [     C33] Kernel panic - not syncing: Hard LOCKUP
> > > > [  904.196379] [     C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G           O       6.12.34-cloudflare-2025.6.9 #1
> > > > [  904.196383] [     C33] Tainted: [O]=OOT_MODULE
> > > > [  904.196384] [     C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [  904.196385] [     C33] Call Trace:
> > > > [  904.196387] [     C33]  <NMI>
> > > > [  904.196389] [     C33]  dump_stack_lvl+0x4b/0x70
> > > > [  904.196394] [     C33]  panic+0x106/0x2c4
> > > > [  904.196401] [     C33]  nmi_panic.cold+0xc/0xc
> > > > [  904.196404] [     C33]  watchdog_hardlockup_check.cold+0xc6/0xe8
> > > > [  904.196409] [     C33]  __perf_event_overflow+0x15a/0x450
> > > > [  904.196416] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [  904.196421] [     C33]  x86_pmu_handle_irq+0x18a/0x1c0
> > > > [  904.196436] [     C33]  ? set_pte_vaddr+0x40/0x50
> > > > [  904.196439] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [  904.196442] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [  904.196445] [     C33]  ? native_set_fixmap+0x63/0xb0
> > > > [  904.196448] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [  904.196451] [     C33]  ? ghes_copy_tofrom_phys+0x7a/0x100
> > > > [  904.196457] [     C33]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [  904.196460] [     C33]  ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > > > [  904.196465] [     C33]  amd_pmu_handle_irq+0x4b/0xc0
> > > > [  904.196469] [     C33]  perf_event_nmi_handler+0x2a/0x50
> > > > [  904.196473] [     C33]  nmi_handle.part.0+0x59/0x110
> > > > [  904.196479] [     C33]  default_do_nmi+0x127/0x180
> > > > [  904.196483] [     C33]  exc_nmi+0x103/0x180
> > > > [  904.196486] [     C33]  end_repeat_nmi+0xf/0x53
> > > > [  904.196489] [     C33] RIP: 0010:io_idle+0x3/0x30
> > > > [  904.196493] [     C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [  904.196495] [     C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [  904.196497] [     C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [  904.196499] [     C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [  904.196501] [     C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [  904.196502] [     C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [  904.196504] [     C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [  904.196510] [     C33]  ? io_idle+0x3/0x30
> > > > [  904.196515] [     C33]  ? io_idle+0x3/0x30
> > > > [  904.196519] [     C33]  </NMI>
> > > > [  904.196520] [     C33]  <TASK>
> > > > [  904.196521] [     C33]  acpi_idle_do_entry+0x22/0x50
> > > > [  904.196526] [     C33]  acpi_idle_enter+0x7b/0xd0
> > > > [  904.196529] [     C33]  cpuidle_enter_state+0x79/0x420
> > > > [  904.196535] [     C33]  cpuidle_enter+0x2d/0x40
> > > > [  904.196539] [     C33]  do_idle+0x176/0x1c0
> > > > [  904.196544] [     C33]  cpu_startup_entry+0x29/0x30
> > > > [  904.196548] [     C33]  start_secondary+0xf7/0x100
> > > > [  904.196552] [     C33]  common_startup_64+0x13e/0x141
> > > > [  904.196559] [     C33]  </TASK>
> > > >
> > > > Best, Fred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-06 22:51       ` Corey Minyard
@ 2025-08-07 19:43         ` Frederick Lawler
  2025-08-07 20:29           ` Corey Minyard
  0 siblings, 1 reply; 8+ messages in thread
From: Frederick Lawler @ 2025-08-07 19:43 UTC (permalink / raw)
  To: Corey Minyard; +Cc: openipmi-developer, linux-kernel, kernel-team

On Wed, Aug 06, 2025 at 05:51:29PM -0500, Corey Minyard wrote:
> On Wed, Aug 06, 2025 at 04:36:41PM -0500, Frederick Lawler wrote:
> > On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote:
> > > On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> > > > + CC: Corey Minyard <corey@minyard.net>
> > > > 
> > 
> > > I'm wondering if something is happening with the BMC resetting and
> > > interactions with ACPI involved in that.  Adding the extra part of
> > > trying to talk to the BMC while it's being reset could cause the BMC to
> > > get confused and do bad things?
> > > 
> > 
> > Sure, it's a possibility we explored. We have a lot of automation.
> > Predominately of which is a prometheus module exporting IPMI information
> > from the sysfs files. And we also have config management that's querying
> > sysfs files to regulate updates etc... Sometimes, the config management
> > automation will attempt to reset the BMC.
> 
> Ok.  I have tests that do BMC resets, but I can't run at the scale you
> do, and I'm running in a simulator so it's not going to be have the
> same.
> 
> The other possibility is the processor goes into the idle code while
> interrupts are off, but I think the kernel has checks all around that.
> I can't think of how else a processor would get stuck in idle.
> 

Yes, it's a bit of an odd case. There's nothing obvious reported by the
crash utility. By the time we get the NMI/panic, the CPUs are off doing
something else in our crash typical case. That said, earlier this week I got a
hard lockup outside of a BMC reset, but the node had too many MCE
correctable memory errors.

For sake of completeness, I'll post that stack trace here anyway since
that may provide some more context clues. In this case, I did catch two
separate reads to sysfs files, and then they appear to have competed.
The cat process seemed to already be off CPU, but the KCS
message is still coming in at the same time the python script was being
processed too. Only the python run was on CPU at time of crash. But NMI
panic was still on a idle CPU. Unfortunately, I didn't write down all
the logs this one, so it's missing the idle state NMI for watchdog, but
hopefully the snippets show what's happening. I posted this below.

> > 
> > > > >
> > > > > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > > > > are the hung tasks.
> > > > >
> > > > > I identified that sni_send()[1] could be locked behind the
> > > > > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > > > > save lock. I suspect this is where we're getting hung up. Below is a
> > > > > sample stack trace + log output.
> > > 
> > > Yeah, I don't see that in the traceback.  There is a lock in the KCS
> > > sender, but I don't see how that could do anything.
> > > 
> > > Maybe you could try changing the cpuidle handler?  That would be at
> > > least something to try.
> > > 
> > 
> > Would that help in forming a reproducer? I'd need to deploy any kernel
> > modifications fleet wide to cast a wide enough net. The lockups arn't
> > extremely consistent. We may get a couple or more a week.
> 
> Ah, so this isn't readily reproducable.  Bummer.
> 
> If the problem goes away if you change the cpuidle handler to something
> non-ACPI, that would be a big clue that it's an ACPI issue.
> 
> > 
> > Lastly, I have the rate limit patch backported. I'll be able to start
> > testing with that tomorrow, and same with loading the IPMI watchdog
> > module.
> 
> Ok.  I don't have much hope for it making much difference, but it's safe
> and will be coming in the next kernel release.
>

It occurred to me last night that I'd probably like a rate limit on the KCS
messages as well. I didn't see if a patch for that was made. I can whip
that up sometime next week, that could be of use to anyone.

[1533534.869508] [Hardware Error]: Corrected error, no action required.
[1533534.884635] [Hardware Error]: CPU:1 (17:31:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[1533534.912122] [Hardware Error]: Error Addr: 0x0000000313c7a020
[1533534.926641] [Hardware Error]: IPID: 0x0000009600350f00, Syndrome: 0x9fec08000a800a01
[1533534.943278] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[1533534.946635] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#3 (csrow:1 channel:3 page:0x0 offset:0x0 grain:64 syndrome:0x800)
[1533535.369487] INFO: task cat:1844873 blocked for more than 10 seconds.
[1533535.385145]       Tainted: G        W  O       6.12.35-cloudflare-2025.6.15 #1
[1533535.401614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1533535.418715] task:cat             state:D stack:0     pid:1844873 tgid:1844873 ppid:1844872 task_flags:0x400000 flags:0x00004002
[1533535.447475] Call Trace:
[1533535.458691]  <TASK>
[1533535.469154]  __schedule+0x4fa/0xbf0
[1533535.481433]  schedule+0x27/0xf0
[1533535.493181]  __get_guid+0xf4/0x130 [ipmi_msghandler]
[1533535.506325]  ? __pfx_autoremove_wake_function+0x10/0x10
[1533535.519910]  __bmc_get_device_id+0xd6/0xa30 [ipmi_msghandler]
[1533535.534459]  ? srso_return_thunk+0x5/0x5f
[1533535.546509]  ? srso_return_thunk+0x5/0x5f
[1533535.558540]  ? __memcg_slab_post_alloc_hook+0x21b/0x410
[1533535.571722]  aux_firmware_rev_show+0x38/0x90 [ipmi_msghandler]
[1533535.585304]  ? __kmalloc_node_noprof+0x3f6/0x450
[1533535.598144]  ? seq_read_iter+0x376/0x460
[1533535.609621]  dev_attr_show+0x1c/0x40
[1533535.621024]  sysfs_kf_seq_show+0x8f/0xe0
[1533535.632316]  seq_read_iter+0x11f/0x460
[1533535.643172]  ? security_file_permission+0x9/0xb0
[1533535.655102]  vfs_read+0x260/0x330
[1533535.665368]  ksys_read+0x65/0xe0
[1533535.675559]  do_syscall_64+0x4b/0x110
[1533535.686324]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[1533535.698530] RIP: 0033:0x7f72b587125d
[1533535.708857] RSP: 002b:00007ffccc21bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[1533535.723411] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f72b587125d
[1533535.737361] RDX: 0000000000020000 RSI: 00007f72b5755000 RDI: 0000000000000003
[1533535.751191] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
[1533535.764847] R10: 00007f72b5788b60 R11: 0000000000000246 R12: 00007f72b5755000
[1533535.778536] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000000
[1533535.792210]  </TASK>

crash> bt -l 1781073
PID: 1781073  TASK: ffff9d91c7040000  CPU: 81   COMMAND: "/usr/bin/python"
 #0 [ffffb3a171683c00] __schedule at ffffffff9d559eea
    /cfsetup_build/build/linux/kernel/sched/core.c: 5338
 #1 [ffffb3a171683c80] schedule at ffffffff9d55a617
    /cfsetup_build/build/linux/arch/x86/include/asm/preempt.h: 84
 #2 [ffffb3a171683c90] __get_guid at ffffffffc22aa574 [ipmi_msghandler]
 #3 [ffffb3a171683ce8] __bmc_get_device_id at ffffffffc22aa696 [ipmi_msghandler]
 #4 [ffffb3a171683da0] aux_firmware_rev_show at ffffffffc22ab1c8 [ipmi_msghandler]
 #5 [ffffb3a171683dd0] dev_attr_show at ffffffff9d1175dc
    /cfsetup_build/build/linux/drivers/base/core.c: 2425
 #6 [ffffb3a171683de8] sysfs_kf_seq_show at ffffffff9cc64caf
    /cfsetup_build/build/linux/fs/sysfs/file.c: 60
 #7 [ffffb3a171683e10] seq_read_iter at ffffffff9cbddf7f
    /cfsetup_build/build/linux/fs/seq_file.c: 230
 #8 [ffffb3a171683e68] vfs_read at ffffffff9cba8590
    /cfsetup_build/build/linux/fs/read_write.c: 489
 #9 [ffffb3a171683f00] ksys_read at ffffffff9cba9165
    /cfsetup_build/build/linux/fs/read_write.c: 713
#10 [ffffb3a171683f38] do_syscall_64 at ffffffff9d550c8b
    /cfsetup_build/build/linux/arch/x86/entry/common.c: 52
#11 [ffffb3a171683f50] entry_SYSCALL_64_after_hwframe at ffffffff9d60012f
    /cfsetup_build/build/linux/arch/x86/entry/entry_64.S: 130
    RIP: 00007f04e1b7c29c  RSP: 00007ffea7aaf6c0  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000a840f8  RCX: 00007f04e1b7c29c
    RDX: 0000000000001001  RSI: 000000002fd06ef0  RDI: 00000000000000c1
    RBP: 00007f04e1a82fc0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000001001
    R13: 000000002fd06ef0  R14: 00000000000000c1  R15: 0000000000a41520
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

crash> files 1781073
...
193 ffff9db5132e5800 ffff9dafb18bd200 ffff9da7b780bcf0 REG  /sys/devices/platform/ipmi_bmc.0/aux_firmware_revision

crash> log -c
...
[1533553.998160] [      C7] ipmi_si IPI0001:00: KCS in invalid state 6
[1533554.009156] [      C7] ipmi_si IPI0001:00: KCS in invalid state 8
[1533554.019973] [T1844873] ipmi_si IPI0001:00: KCS in invalid state 9
[1533554.031005] [     C81] ipmi_si IPI0001:00: IPMI message handler: device id fetch failed: 0xd5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
  2025-08-07 19:43         ` Frederick Lawler
@ 2025-08-07 20:29           ` Corey Minyard
  0 siblings, 0 replies; 8+ messages in thread
From: Corey Minyard @ 2025-08-07 20:29 UTC (permalink / raw)
  To: Frederick Lawler; +Cc: openipmi-developer, linux-kernel, kernel-team

On Thu, Aug 07, 2025 at 02:43:14PM -0500, Frederick Lawler wrote:
> 
> It occurred to me last night that I'd probably like a rate limit on the KCS
> messages as well. I didn't see if a patch for that was made. I can whip
> that up sometime next week, that could be of use to anyone.

That jogged my memory a bit; there is something called "maintenance
mode" in the IPMI driver.  It's used primarily for firmware updates,
but it's triggered by reset commands in addition to firmware update
commands.  It has three basic affects:

* It turns off automatic messages sent to the BMC by the driver
  (only fetching flags, I think).

* It changes the way the timing works to check for the BMC being ready
  a lot more often.  (This is a hardware check and shouldn't affect
  the BMC, but maybe it does on some.)

* It changes the timing for messages routed to the IPMB bus to give
  them more time.

It solved two problems:

* For systems without IPMI interrupts, firmware updates were taking
  forever.  

* When you would reset the BMC, the driver's automatic messages would
  generally time out.  And IPMB messages pending would time out.

The theory was that if the user reset the BMC, they wouldn't issue any
IPMI commands, and the driver wouldn't either, so it would leave the
BMC interface alone until it's done resetting.

It's not perfect, the reset or firmware update can happen over the LAN
interface, but it seemed to help a lot of people.

Anyway, after that long explaination, maybe that needs to be extended
and if the driver goes into maintenance mode have all sysfs accesses
to the BMC return an error.

It also might be a good idea to differentiate between resets and
firmware update commands.  After a reset nothing will probably work, but
the BMC is still partially function during a firmware update.  So no
IPMI commands at all for a little while after a reset.  That is a
behavioral change, but it's probably not a lot different that what would
happen, anyway.  The error just comes back faster.

None of this solves the basic issue, though.

I'm not exactly sure what you mean by a rate limit on KCS messages.  It
would lower the probability, perhaps, but it wouldn't eliminate the
problem, either.  Just not allowing anything during these times is
probably better.

> 
> [1533534.869508] [Hardware Error]: Corrected error, no action required.
> [1533534.884635] [Hardware Error]: CPU:1 (17:31:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
> [1533534.912122] [Hardware Error]: Error Addr: 0x0000000313c7a020
> [1533534.926641] [Hardware Error]: IPID: 0x0000009600350f00, Syndrome: 0x9fec08000a800a01
> [1533534.943278] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
> [1533534.946635] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#3 (csrow:1 channel:3 page:0x0 offset:0x0 grain:64 syndrome:0x800)
> [1533535.369487] INFO: task cat:1844873 blocked for more than 10 seconds.
> [1533535.385145]       Tainted: G        W  O       6.12.35-cloudflare-2025.6.15 #1
> [1533535.401614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [1533535.418715] task:cat             state:D stack:0     pid:1844873 tgid:1844873 ppid:1844872 task_flags:0x400000 flags:0x00004002
> [1533535.447475] Call Trace:
> [1533535.458691]  <TASK>
> [1533535.469154]  __schedule+0x4fa/0xbf0
> [1533535.481433]  schedule+0x27/0xf0
> [1533535.493181]  __get_guid+0xf4/0x130 [ipmi_msghandler]
> [1533535.506325]  ? __pfx_autoremove_wake_function+0x10/0x10
> [1533535.519910]  __bmc_get_device_id+0xd6/0xa30 [ipmi_msghandler]

Yeah, this is what I would expect to see if you are doing this operation
and the BMC is in reset.  It's going to sit there until it times out and
returns an error.

-corey

> [1533535.534459]  ? srso_return_thunk+0x5/0x5f
> [1533535.546509]  ? srso_return_thunk+0x5/0x5f
> [1533535.558540]  ? __memcg_slab_post_alloc_hook+0x21b/0x410
> [1533535.571722]  aux_firmware_rev_show+0x38/0x90 [ipmi_msghandler]
> [1533535.585304]  ? __kmalloc_node_noprof+0x3f6/0x450
> [1533535.598144]  ? seq_read_iter+0x376/0x460
> [1533535.609621]  dev_attr_show+0x1c/0x40
> [1533535.621024]  sysfs_kf_seq_show+0x8f/0xe0
> [1533535.632316]  seq_read_iter+0x11f/0x460
> [1533535.643172]  ? security_file_permission+0x9/0xb0
> [1533535.655102]  vfs_read+0x260/0x330
> [1533535.665368]  ksys_read+0x65/0xe0
> [1533535.675559]  do_syscall_64+0x4b/0x110
> [1533535.686324]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [1533535.698530] RIP: 0033:0x7f72b587125d
> [1533535.708857] RSP: 002b:00007ffccc21bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [1533535.723411] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f72b587125d
> [1533535.737361] RDX: 0000000000020000 RSI: 00007f72b5755000 RDI: 0000000000000003
> [1533535.751191] RBP: 0000000000020000 R08: 00000000ffffffff R09: 0000000000000000
> [1533535.764847] R10: 00007f72b5788b60 R11: 0000000000000246 R12: 00007f72b5755000
> [1533535.778536] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000000
> [1533535.792210]  </TASK>
> 
> crash> bt -l 1781073
> PID: 1781073  TASK: ffff9d91c7040000  CPU: 81   COMMAND: "/usr/bin/python"
>  #0 [ffffb3a171683c00] __schedule at ffffffff9d559eea
>     /cfsetup_build/build/linux/kernel/sched/core.c: 5338
>  #1 [ffffb3a171683c80] schedule at ffffffff9d55a617
>     /cfsetup_build/build/linux/arch/x86/include/asm/preempt.h: 84
>  #2 [ffffb3a171683c90] __get_guid at ffffffffc22aa574 [ipmi_msghandler]
>  #3 [ffffb3a171683ce8] __bmc_get_device_id at ffffffffc22aa696 [ipmi_msghandler]
>  #4 [ffffb3a171683da0] aux_firmware_rev_show at ffffffffc22ab1c8 [ipmi_msghandler]
>  #5 [ffffb3a171683dd0] dev_attr_show at ffffffff9d1175dc
>     /cfsetup_build/build/linux/drivers/base/core.c: 2425
>  #6 [ffffb3a171683de8] sysfs_kf_seq_show at ffffffff9cc64caf
>     /cfsetup_build/build/linux/fs/sysfs/file.c: 60
>  #7 [ffffb3a171683e10] seq_read_iter at ffffffff9cbddf7f
>     /cfsetup_build/build/linux/fs/seq_file.c: 230
>  #8 [ffffb3a171683e68] vfs_read at ffffffff9cba8590
>     /cfsetup_build/build/linux/fs/read_write.c: 489
>  #9 [ffffb3a171683f00] ksys_read at ffffffff9cba9165
>     /cfsetup_build/build/linux/fs/read_write.c: 713
> #10 [ffffb3a171683f38] do_syscall_64 at ffffffff9d550c8b
>     /cfsetup_build/build/linux/arch/x86/entry/common.c: 52
> #11 [ffffb3a171683f50] entry_SYSCALL_64_after_hwframe at ffffffff9d60012f
>     /cfsetup_build/build/linux/arch/x86/entry/entry_64.S: 130
>     RIP: 00007f04e1b7c29c  RSP: 00007ffea7aaf6c0  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 0000000000a840f8  RCX: 00007f04e1b7c29c
>     RDX: 0000000000001001  RSI: 000000002fd06ef0  RDI: 00000000000000c1
>     RBP: 00007f04e1a82fc0   R8: 0000000000000000   R9: 0000000000000000
>     R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000001001
>     R13: 000000002fd06ef0  R14: 00000000000000c1  R15: 0000000000a41520
>     ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
> 
> crash> files 1781073
> ...
> 193 ffff9db5132e5800 ffff9dafb18bd200 ffff9da7b780bcf0 REG  /sys/devices/platform/ipmi_bmc.0/aux_firmware_revision
> 
> crash> log -c
> ...
> [1533553.998160] [      C7] ipmi_si IPI0001:00: KCS in invalid state 6
> [1533554.009156] [      C7] ipmi_si IPI0001:00: KCS in invalid state 8
> [1533554.019973] [T1844873] ipmi_si IPI0001:00: KCS in invalid state 9
> [1533554.031005] [     C81] ipmi_si IPI0001:00: IPMI message handler: device id fetch failed: 0xd5
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-07 20:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 20:14 [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP Frederick Lawler
2025-08-06 20:19 ` Fred Lawler
2025-08-06 20:39   ` Corey Minyard
2025-08-06 21:16   ` Corey Minyard
2025-08-06 21:36     ` Frederick Lawler
2025-08-06 22:51       ` Corey Minyard
2025-08-07 19:43         ` Frederick Lawler
2025-08-07 20:29           ` Corey Minyard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).