From: Frederick Lawler <fred@cloudflare.com>
To: Corey Minyard <cory@minyard.net>
Cc: openipmi-developer@lists.sourceforge.net,
linux-kernel@vger.kernel.org, kernel-team@cloudflare.com
Subject: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
Date: Wed, 6 Aug 2025 15:14:35 -0500 [thread overview]
Message-ID: <aJO3q8JiVXKewMjW@CMGLRV3> (raw)
Hi Corey,
In kernel 6.12.y, while resetting the BMC, we can sometimes hit a hard LOCKUP
watchdog event, especially so while querying the BMC for basic device
information via sysfs.
I havn't been able to create a consistent reproducer yet, but I suspect
that these occur during high traffic, BMC is resetting, and reading
from the sysfs files in parallel. We're also using KCS to interface
with the BMC.
I can consistently reproduce hung tasks trivially with the following,
during a BMC reset:
while true; do cat aux_firmware_revision &>/dev/null; done &
I tried also tried to load the CPUs with stress-ng, but the best I can do
are the hung tasks.
I identified that sni_send()[1] could be locked behind the
spin_lock_irqsave() and within the KCS send handler, there's another irq
save lock. I suspect this is where we're getting hung up. Below is a
sample stack trace + log output.
I'm happy to provide traces and additional information, let me know.
Links:
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
[ 499.564572] [ T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
[ 499.588176] [ T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
[ 499.605284] [ T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
[ 805.906999] [ T12765] usb 1-1: USB disconnect, device number 2
[ 845.346020] [ T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
[ 845.485453] [ T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
[ 845.496823] [ T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 845.507242] [ T12765] usb 1-1: Product: USB Virtual Hub
[ 845.514946] [ T12765] usb 1-1: Manufacturer: Aspeed
[ 845.522363] [ T12765] usb 1-1: SerialNumber: 00000000
[ 845.530454] [ T12765] usb 1-1: Device is not authorized for usage
[ 853.774910] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.783794] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.792649] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.801461] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.810291] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.819069] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.827816] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.836581] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.845326] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.854074] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 853.862813] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
[ 863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[ 863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[ 863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[ 863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
[ 878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[ 878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[ 878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[ 878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
[ 880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[ 880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[ 880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[ 880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
[ 904.196122] [ C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
[ 904.196127] [ C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
[ 904.196126] [ C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
[ 904.196130] [ C33] Modules linked in:
[ 904.196129] [ C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
[ 904.196131] [ C97] Dazed and confused, but trying to continue
[ 904.196131] [ C33] nft_fwd_netdev
[ 904.196131] [ C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
[ 904.196133] [ C6] Dazed and confused, but trying to continue
[ 904.196133] [ C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
[ 904.196134] [ C33] nf_dup_netdev
[ 904.196134] [ C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
[ 904.196135] [ C101] Dazed and confused, but trying to continue
[ 904.196137] [ C99] Dazed and confused, but trying to continue
[ 904.196137] [ C33] xfrm_interface
[ 904.196136] [ C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
[ 904.196140] [ C102] Dazed and confused, but trying to continue
[ 904.196140] [ C33] xfrm6_tunnel
[ 904.196138] [ C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
[ 904.196140] [ C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
[ 904.196142] [ C35] Dazed and confused, but trying to continue
[ 904.196143] [ C69] Dazed and confused, but trying to continue
[ 904.196143] [ C33] nft_numgen
[ 904.196143] [ C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
[ 904.196144] [ C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
[ 904.196146] [ C123] Dazed and confused, but trying to continue
[ 904.196147] [ C121] Dazed and confused, but trying to continue
[ 904.196148] [ C58] Dazed and confused, but trying to continue
[ 904.196150] [ C33] nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
[ 904.196247] [ C33] sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
[ 904.196281] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
[ 904.196286] [ C33] Tainted: [O]=OOT_MODULE
[ 904.196287] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
[ 904.196290] [ C33] RIP: 0010:io_idle+0x3/0x30
[ 904.196298] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
[ 904.196301] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
[ 904.196304] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
[ 904.196306] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
[ 904.196308] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
[ 904.196309] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
[ 904.196311] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
[ 904.196313] [ C33] FS: 0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
[ 904.196316] [ C33] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 904.196318] [ C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
[ 904.196320] [ C33] PKRU: 55555554
[ 904.196322] [ C33] Call Trace:
[ 904.196324] [ C33] <TASK>
[ 904.196326] [ C33] acpi_idle_do_entry+0x22/0x50
[ 904.196336] [ C33] acpi_idle_enter+0x7b/0xd0
[ 904.196340] [ C33] cpuidle_enter_state+0x79/0x420
[ 904.196345] [ C33] cpuidle_enter+0x2d/0x40
[ 904.196352] [ C33] do_idle+0x176/0x1c0
[ 904.196358] [ C33] cpu_startup_entry+0x29/0x30
[ 904.196362] [ C33] start_secondary+0xf7/0x100
[ 904.196366] [ C33] common_startup_64+0x13e/0x141
[ 904.196374] [ C33] </TASK>
[ 904.196377] [ C33] Kernel panic - not syncing: Hard LOCKUP
[ 904.196379] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
[ 904.196383] [ C33] Tainted: [O]=OOT_MODULE
[ 904.196384] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
[ 904.196385] [ C33] Call Trace:
[ 904.196387] [ C33] <NMI>
[ 904.196389] [ C33] dump_stack_lvl+0x4b/0x70
[ 904.196394] [ C33] panic+0x106/0x2c4
[ 904.196401] [ C33] nmi_panic.cold+0xc/0xc
[ 904.196404] [ C33] watchdog_hardlockup_check.cold+0xc6/0xe8
[ 904.196409] [ C33] __perf_event_overflow+0x15a/0x450
[ 904.196416] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
[ 904.196421] [ C33] x86_pmu_handle_irq+0x18a/0x1c0
[ 904.196436] [ C33] ? set_pte_vaddr+0x40/0x50
[ 904.196439] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
[ 904.196442] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
[ 904.196445] [ C33] ? native_set_fixmap+0x63/0xb0
[ 904.196448] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
[ 904.196451] [ C33] ? ghes_copy_tofrom_phys+0x7a/0x100
[ 904.196457] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
[ 904.196460] [ C33] ? __ghes_peek_estatus.isra.0+0x49/0xa0
[ 904.196465] [ C33] amd_pmu_handle_irq+0x4b/0xc0
[ 904.196469] [ C33] perf_event_nmi_handler+0x2a/0x50
[ 904.196473] [ C33] nmi_handle.part.0+0x59/0x110
[ 904.196479] [ C33] default_do_nmi+0x127/0x180
[ 904.196483] [ C33] exc_nmi+0x103/0x180
[ 904.196486] [ C33] end_repeat_nmi+0xf/0x53
[ 904.196489] [ C33] RIP: 0010:io_idle+0x3/0x30
[ 904.196493] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
[ 904.196495] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
[ 904.196497] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
[ 904.196499] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
[ 904.196501] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
[ 904.196502] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
[ 904.196504] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
[ 904.196510] [ C33] ? io_idle+0x3/0x30
[ 904.196515] [ C33] ? io_idle+0x3/0x30
[ 904.196519] [ C33] </NMI>
[ 904.196520] [ C33] <TASK>
[ 904.196521] [ C33] acpi_idle_do_entry+0x22/0x50
[ 904.196526] [ C33] acpi_idle_enter+0x7b/0xd0
[ 904.196529] [ C33] cpuidle_enter_state+0x79/0x420
[ 904.196535] [ C33] cpuidle_enter+0x2d/0x40
[ 904.196539] [ C33] do_idle+0x176/0x1c0
[ 904.196544] [ C33] cpu_startup_entry+0x29/0x30
[ 904.196548] [ C33] start_secondary+0xf7/0x100
[ 904.196552] [ C33] common_startup_64+0x13e/0x141
[ 904.196559] [ C33] </TASK>
Best, Fred
next reply other threads:[~2025-08-06 20:14 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-06 20:14 Frederick Lawler [this message]
2025-08-06 20:19 ` [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP Fred Lawler
2025-08-06 20:39 ` Corey Minyard
2025-08-06 21:16 ` Corey Minyard
2025-08-06 21:36 ` Frederick Lawler
2025-08-06 22:51 ` Corey Minyard
2025-08-07 19:43 ` Frederick Lawler
2025-08-07 20:29 ` Corey Minyard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJO3q8JiVXKewMjW@CMGLRV3 \
--to=fred@cloudflare.com \
--cc=cory@minyard.net \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=openipmi-developer@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.