From: Corey Minyard <corey@minyard.net>
To: Frederick Lawler <fred@cloudflare.com>
Cc: openipmi-developer@lists.sourceforge.net,
linux-kernel@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
Date: Wed, 6 Aug 2025 17:51:29 -0500 [thread overview]
Message-ID: <aJPccSayM2nXk891@mail.minyard.net> (raw)
In-Reply-To: <aJPK6Vuxc1jL-uu_@CMGLRV3>
On Wed, Aug 06, 2025 at 04:36:41PM -0500, Frederick Lawler wrote:
> On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote:
> > On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> > > + CC: Corey Minyard <corey@minyard.net>
> > >
>
> > I'm wondering if something is happening with the BMC resetting and
> > interactions with ACPI involved in that. Adding the extra part of
> > trying to talk to the BMC while it's being reset could cause the BMC to
> > get confused and do bad things?
> >
>
> Sure, it's a possibility we explored. We have a lot of automation.
> Predominately of which is a prometheus module exporting IPMI information
> from the sysfs files. And we also have config management that's querying
> sysfs files to regulate updates etc... Sometimes, the config management
> automation will attempt to reset the BMC.
Ok. I have tests that do BMC resets, but I can't run at the scale you
do, and I'm running in a simulator so it's not going to be have the
same.
The other possibility is the processor goes into the idle code while
interrupts are off, but I think the kernel has checks all around that.
I can't think of how else a processor would get stuck in idle.
>
> > > >
> > > > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > > > are the hung tasks.
> > > >
> > > > I identified that sni_send()[1] could be locked behind the
> > > > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > > > save lock. I suspect this is where we're getting hung up. Below is a
> > > > sample stack trace + log output.
> >
> > Yeah, I don't see that in the traceback. There is a lock in the KCS
> > sender, but I don't see how that could do anything.
> >
> > Maybe you could try changing the cpuidle handler? That would be at
> > least something to try.
> >
>
> Would that help in forming a reproducer? I'd need to deploy any kernel
> modifications fleet wide to cast a wide enough net. The lockups arn't
> extremely consistent. We may get a couple or more a week.
Ah, so this isn't readily reproducable. Bummer.
If the problem goes away if you change the cpuidle handler to something
non-ACPI, that would be a big clue that it's an ACPI issue.
>
> Lastly, I have the rate limit patch backported. I'll be able to start
> testing with that tomorrow, and same with loading the IPMI watchdog
> module.
Ok. I don't have much hope for it making much difference, but it's safe
and will be coming in the next kernel release.
-corey
>
> > -corey
> >
> > > >
> > > > I'm happy to provide traces and additional information, let me know.
> > > >
> > > > Links:
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> > > >
> > > > [ 499.564572] [ T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > > > [ 499.588176] [ T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > > > [ 499.605284] [ T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > > > [ 805.906999] [ T12765] usb 1-1: USB disconnect, device number 2
> > > > [ 845.346020] [ T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > > > [ 845.485453] [ T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > > > [ 845.496823] [ T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > > > [ 845.507242] [ T12765] usb 1-1: Product: USB Virtual Hub
> > > > [ 845.514946] [ T12765] usb 1-1: Manufacturer: Aspeed
> > > > [ 845.522363] [ T12765] usb 1-1: SerialNumber: 00000000
> > > > [ 845.530454] [ T12765] usb 1-1: Device is not authorized for usage
> > > > [ 853.774910] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.783794] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.792649] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.801461] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.810291] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.819069] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.827816] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.836581] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.845326] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.854074] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.862813] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 904.196122] [ C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > > > [ 904.196127] [ C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > > > [ 904.196126] [ C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > > > [ 904.196130] [ C33] Modules linked in:
> > > > [ 904.196129] [ C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > > > [ 904.196131] [ C97] Dazed and confused, but trying to continue
> > > > [ 904.196131] [ C33] nft_fwd_netdev
> > > > [ 904.196131] [ C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > > > [ 904.196133] [ C6] Dazed and confused, but trying to continue
> > > > [ 904.196133] [ C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > > > [ 904.196134] [ C33] nf_dup_netdev
> > > > [ 904.196134] [ C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > > > [ 904.196135] [ C101] Dazed and confused, but trying to continue
> > > > [ 904.196137] [ C99] Dazed and confused, but trying to continue
> > > > [ 904.196137] [ C33] xfrm_interface
> > > > [ 904.196136] [ C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > > > [ 904.196140] [ C102] Dazed and confused, but trying to continue
> > > > [ 904.196140] [ C33] xfrm6_tunnel
> > > > [ 904.196138] [ C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > > > [ 904.196140] [ C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > > > [ 904.196142] [ C35] Dazed and confused, but trying to continue
> > > > [ 904.196143] [ C69] Dazed and confused, but trying to continue
> > > > [ 904.196143] [ C33] nft_numgen
> > > > [ 904.196143] [ C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > > > [ 904.196144] [ C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > > > [ 904.196146] [ C123] Dazed and confused, but trying to continue
> > > > [ 904.196147] [ C121] Dazed and confused, but trying to continue
> > > > [ 904.196148] [ C58] Dazed and confused, but trying to continue
> > > > [ 904.196150] [ C33] nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > > > [ 904.196247] [ C33] sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > > > [ 904.196281] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
> > > > [ 904.196286] [ C33] Tainted: [O]=OOT_MODULE
> > > > [ 904.196287] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [ 904.196290] [ C33] RIP: 0010:io_idle+0x3/0x30
> > > > [ 904.196298] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [ 904.196301] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [ 904.196304] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [ 904.196306] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [ 904.196308] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [ 904.196309] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [ 904.196311] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [ 904.196313] [ C33] FS: 0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > > > [ 904.196316] [ C33] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 904.196318] [ C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > > > [ 904.196320] [ C33] PKRU: 55555554
> > > > [ 904.196322] [ C33] Call Trace:
> > > > [ 904.196324] [ C33] <TASK>
> > > > [ 904.196326] [ C33] acpi_idle_do_entry+0x22/0x50
> > > > [ 904.196336] [ C33] acpi_idle_enter+0x7b/0xd0
> > > > [ 904.196340] [ C33] cpuidle_enter_state+0x79/0x420
> > > > [ 904.196345] [ C33] cpuidle_enter+0x2d/0x40
> > > > [ 904.196352] [ C33] do_idle+0x176/0x1c0
> > > > [ 904.196358] [ C33] cpu_startup_entry+0x29/0x30
> > > > [ 904.196362] [ C33] start_secondary+0xf7/0x100
> > > > [ 904.196366] [ C33] common_startup_64+0x13e/0x141
> > > > [ 904.196374] [ C33] </TASK>
> > > > [ 904.196377] [ C33] Kernel panic - not syncing: Hard LOCKUP
> > > > [ 904.196379] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
> > > > [ 904.196383] [ C33] Tainted: [O]=OOT_MODULE
> > > > [ 904.196384] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [ 904.196385] [ C33] Call Trace:
> > > > [ 904.196387] [ C33] <NMI>
> > > > [ 904.196389] [ C33] dump_stack_lvl+0x4b/0x70
> > > > [ 904.196394] [ C33] panic+0x106/0x2c4
> > > > [ 904.196401] [ C33] nmi_panic.cold+0xc/0xc
> > > > [ 904.196404] [ C33] watchdog_hardlockup_check.cold+0xc6/0xe8
> > > > [ 904.196409] [ C33] __perf_event_overflow+0x15a/0x450
> > > > [ 904.196416] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196421] [ C33] x86_pmu_handle_irq+0x18a/0x1c0
> > > > [ 904.196436] [ C33] ? set_pte_vaddr+0x40/0x50
> > > > [ 904.196439] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196442] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196445] [ C33] ? native_set_fixmap+0x63/0xb0
> > > > [ 904.196448] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196451] [ C33] ? ghes_copy_tofrom_phys+0x7a/0x100
> > > > [ 904.196457] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196460] [ C33] ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > > > [ 904.196465] [ C33] amd_pmu_handle_irq+0x4b/0xc0
> > > > [ 904.196469] [ C33] perf_event_nmi_handler+0x2a/0x50
> > > > [ 904.196473] [ C33] nmi_handle.part.0+0x59/0x110
> > > > [ 904.196479] [ C33] default_do_nmi+0x127/0x180
> > > > [ 904.196483] [ C33] exc_nmi+0x103/0x180
> > > > [ 904.196486] [ C33] end_repeat_nmi+0xf/0x53
> > > > [ 904.196489] [ C33] RIP: 0010:io_idle+0x3/0x30
> > > > [ 904.196493] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [ 904.196495] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [ 904.196497] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [ 904.196499] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [ 904.196501] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [ 904.196502] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [ 904.196504] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [ 904.196510] [ C33] ? io_idle+0x3/0x30
> > > > [ 904.196515] [ C33] ? io_idle+0x3/0x30
> > > > [ 904.196519] [ C33] </NMI>
> > > > [ 904.196520] [ C33] <TASK>
> > > > [ 904.196521] [ C33] acpi_idle_do_entry+0x22/0x50
> > > > [ 904.196526] [ C33] acpi_idle_enter+0x7b/0xd0
> > > > [ 904.196529] [ C33] cpuidle_enter_state+0x79/0x420
> > > > [ 904.196535] [ C33] cpuidle_enter+0x2d/0x40
> > > > [ 904.196539] [ C33] do_idle+0x176/0x1c0
> > > > [ 904.196544] [ C33] cpu_startup_entry+0x29/0x30
> > > > [ 904.196548] [ C33] start_secondary+0xf7/0x100
> > > > [ 904.196552] [ C33] common_startup_64+0x13e/0x141
> > > > [ 904.196559] [ C33] </TASK>
> > > >
> > > > Best, Fred
next prev parent reply other threads:[~2025-08-06 22:51 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-06 20:14 [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP Frederick Lawler
2025-08-06 20:19 ` Fred Lawler
2025-08-06 20:39 ` Corey Minyard
2025-08-06 21:16 ` Corey Minyard
2025-08-06 21:36 ` Frederick Lawler
2025-08-06 22:51 ` Corey Minyard [this message]
2025-08-07 19:43 ` Frederick Lawler
2025-08-07 20:29 ` Corey Minyard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJPccSayM2nXk891@mail.minyard.net \
--to=corey@minyard.net \
--cc=fred@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=openipmi-developer@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).