From: Corey Minyard <corey@minyard.net>
To: Frederick Lawler <fred@cloudflare.com>
Cc: openipmi-developer@lists.sourceforge.net,
linux-kernel@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP
Date: Wed, 6 Aug 2025 17:51:29 -0500 [thread overview]
Message-ID: <aJPccSayM2nXk891@mail.minyard.net> (raw)
In-Reply-To: <aJPK6Vuxc1jL-uu_@CMGLRV3>
On Wed, Aug 06, 2025 at 04:36:41PM -0500, Frederick Lawler wrote:
> On Wed, Aug 06, 2025 at 04:16:18PM -0500, Corey Minyard wrote:
> > On Wed, Aug 06, 2025 at 03:19:02PM -0500, Fred Lawler wrote:
> > > + CC: Corey Minyard <corey@minyard.net>
> > >
>
> > I'm wondering if something is happening with the BMC resetting and
> > interactions with ACPI involved in that. Adding the extra part of
> > trying to talk to the BMC while it's being reset could cause the BMC to
> > get confused and do bad things?
> >
>
> Sure, it's a possibility we explored. We have a lot of automation.
> Predominately of which is a prometheus module exporting IPMI information
> from the sysfs files. And we also have config management that's querying
> sysfs files to regulate updates etc... Sometimes, the config management
> automation will attempt to reset the BMC.
Ok. I have tests that do BMC resets, but I can't run at the scale you
do, and I'm running in a simulator so it's not going to be have the
same.
The other possibility is the processor goes into the idle code while
interrupts are off, but I think the kernel has checks all around that.
I can't think of how else a processor would get stuck in idle.
>
> > > >
> > > > I tried also tried to load the CPUs with stress-ng, but the best I can do
> > > > are the hung tasks.
> > > >
> > > > I identified that sni_send()[1] could be locked behind the
> > > > spin_lock_irqsave() and within the KCS send handler, there's another irq
> > > > save lock. I suspect this is where we're getting hung up. Below is a
> > > > sample stack trace + log output.
> >
> > Yeah, I don't see that in the traceback. There is a lock in the KCS
> > sender, but I don't see how that could do anything.
> >
> > Maybe you could try changing the cpuidle handler? That would be at
> > least something to try.
> >
>
> Would that help in forming a reproducer? I'd need to deploy any kernel
> modifications fleet wide to cast a wide enough net. The lockups arn't
> extremely consistent. We may get a couple or more a week.
Ah, so this isn't readily reproducable. Bummer.
If the problem goes away if you change the cpuidle handler to something
non-ACPI, that would be a big clue that it's an ACPI issue.
>
> Lastly, I have the rate limit patch backported. I'll be able to start
> testing with that tomorrow, and same with loading the IPMI watchdog
> module.
Ok. I don't have much hope for it making much difference, but it's safe
and will be coming in the next kernel release.
-corey
>
> > -corey
> >
> > > >
> > > > I'm happy to provide traces and additional information, let me know.
> > > >
> > > > Links:
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/ipmi/ipmi_msghandler.c?h=linux-6.12.y#n1899
> > > >
> > > > [ 499.564572] [ T27255] ip6_tunnel: pni_gre_814 xmit: Local address not yet configured!
> > > > [ 499.588176] [ T27255] ip6_tunnel: pni_gre_868 xmit: Local address not yet configured!
> > > > [ 499.605284] [ T27255] ip6_tunnel: pni_gre_871 xmit: Local address not yet configured!
> > > > [ 805.906999] [ T12765] usb 1-1: USB disconnect, device number 2
> > > > [ 845.346020] [ T12765] usb 1-1: new high-speed USB device number 3 using xhci_hcd
> > > > [ 845.485453] [ T12765] usb 1-1: New USB device found, idVendor=1d6b, idProduct=0107, bcdDevice= 1.00
> > > > [ 845.496823] [ T12765] usb 1-1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > > > [ 845.507242] [ T12765] usb 1-1: Product: USB Virtual Hub
> > > > [ 845.514946] [ T12765] usb 1-1: Manufacturer: Aspeed
> > > > [ 845.522363] [ T12765] usb 1-1: SerialNumber: 00000000
> > > > [ 845.530454] [ T12765] usb 1-1: Device is not authorized for usage
> > > > [ 853.774910] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.783794] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.792649] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.801461] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.810291] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.819069] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.827816] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.836581] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.845326] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.854074] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 853.862813] [ C119] ipmi_si IPI0001:00: KCS in invalid state 6
> > > > [ 863.934436] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.943420] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.952363] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 863.961296] [ T124929] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.616336] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.624905] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.633427] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 878.641954] [ T126542] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.310112] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.318682] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.327083] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 880.335483] [ T126681] ipmi_si IPI0001:00: KCS in invalid state 7
> > > > [ 904.196122] [ C33] watchdog: Watchdog detected hard LOCKUP on cpu 33
> > > > [ 904.196127] [ C97] Uhhuh. NMI received for unknown reason 3d on CPU 97.
> > > > [ 904.196126] [ C6] Uhhuh. NMI received for unknown reason 3d on CPU 6.
> > > > [ 904.196130] [ C33] Modules linked in:
> > > > [ 904.196129] [ C101] Uhhuh. NMI received for unknown reason 3d on CPU 101.
> > > > [ 904.196131] [ C97] Dazed and confused, but trying to continue
> > > > [ 904.196131] [ C33] nft_fwd_netdev
> > > > [ 904.196131] [ C99] Uhhuh. NMI received for unknown reason 2d on CPU 99.
> > > > [ 904.196133] [ C6] Dazed and confused, but trying to continue
> > > > [ 904.196133] [ C102] Uhhuh. NMI received for unknown reason 2d on CPU 102.
> > > > [ 904.196134] [ C33] nf_dup_netdev
> > > > [ 904.196134] [ C35] Uhhuh. NMI received for unknown reason 2d on CPU 35.
> > > > [ 904.196135] [ C101] Dazed and confused, but trying to continue
> > > > [ 904.196137] [ C99] Dazed and confused, but trying to continue
> > > > [ 904.196137] [ C33] xfrm_interface
> > > > [ 904.196136] [ C69] Uhhuh. NMI received for unknown reason 2d on CPU 69.
> > > > [ 904.196140] [ C102] Dazed and confused, but trying to continue
> > > > [ 904.196140] [ C33] xfrm6_tunnel
> > > > [ 904.196138] [ C121] Uhhuh. NMI received for unknown reason 2d on CPU 121.
> > > > [ 904.196140] [ C123] Uhhuh. NMI received for unknown reason 2d on CPU 123.
> > > > [ 904.196142] [ C35] Dazed and confused, but trying to continue
> > > > [ 904.196143] [ C69] Dazed and confused, but trying to continue
> > > > [ 904.196143] [ C33] nft_numgen
> > > > [ 904.196143] [ C61] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > > > [ 904.196144] [ C62] Uhhuh. NMI received for unknown reason 3d on CPU 62.
> > > > [ 904.196146] [ C123] Dazed and confused, but trying to continue
> > > > [ 904.196147] [ C121] Dazed and confused, but trying to continue
> > > > [ 904.196148] [ C58] Dazed and confused, but trying to continue
> > > > [ 904.196150] [ C33] nft_log nft_limit sit dummy ipip tunnel4 ip_gre gre xfrm_user xfrm_algo tls mpls_iptunnel mpls_router nft_ct nf_tables iptable_raw iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner xt_DSCP xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss xt_comment xt_tcpudp ip6table_filter ip6_tables nfnetlink_log udp_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio iptable_filter veth tcp_diag inet_diag mpls_gso act_mpls cls_flower cls_bpf sch_ingress ip_set_hash_ip ip_set_hash_net ip_set tcp_bbr sch_fq tun xt_bpf nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel udp_tunnel ip6_tunnel tunnel6 nvme_fabrics raid0 md_mod essiv dm_crypt trusted asn1_encoder tee dm_mod dax 8021q garp mrp stp llc ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel
> > > > [ 904.196247] [ C33] sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd xhci_pci binfmt_misc acpi_ipmi cryptd ipmi_si nvme rapl ipmi_devintf i2c_piix4 tiny_power_button bnxt_en xhci_hcd nvme_core ccp i2c_smbus ipmi_msghandler button fuse configfs nfnetlink efivarfs ip_tables x_tables bcmcrypt(O)
> > > > [ 904.196281] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
> > > > [ 904.196286] [ C33] Tainted: [O]=OOT_MODULE
> > > > [ 904.196287] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [ 904.196290] [ C33] RIP: 0010:io_idle+0x3/0x30
> > > > [ 904.196298] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [ 904.196301] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [ 904.196304] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [ 904.196306] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [ 904.196308] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [ 904.196309] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [ 904.196311] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [ 904.196313] [ C33] FS: 0000000000000000(0000) GS:ffff8aadcf680000(0000) knlGS:0000000000000000
> > > > [ 904.196316] [ C33] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 904.196318] [ C33] CR2: 00005632ce239000 CR3: 0000003944f72004 CR4: 0000000000770ef0
> > > > [ 904.196320] [ C33] PKRU: 55555554
> > > > [ 904.196322] [ C33] Call Trace:
> > > > [ 904.196324] [ C33] <TASK>
> > > > [ 904.196326] [ C33] acpi_idle_do_entry+0x22/0x50
> > > > [ 904.196336] [ C33] acpi_idle_enter+0x7b/0xd0
> > > > [ 904.196340] [ C33] cpuidle_enter_state+0x79/0x420
> > > > [ 904.196345] [ C33] cpuidle_enter+0x2d/0x40
> > > > [ 904.196352] [ C33] do_idle+0x176/0x1c0
> > > > [ 904.196358] [ C33] cpu_startup_entry+0x29/0x30
> > > > [ 904.196362] [ C33] start_secondary+0xf7/0x100
> > > > [ 904.196366] [ C33] common_startup_64+0x13e/0x141
> > > > [ 904.196374] [ C33] </TASK>
> > > > [ 904.196377] [ C33] Kernel panic - not syncing: Hard LOCKUP
> > > > [ 904.196379] [ C33] CPU: 33 UID: 0 PID: 0 Comm: swapper/33 Kdump: loaded Tainted: G O 6.12.34-cloudflare-2025.6.9 #1
> > > > [ 904.196383] [ C33] Tainted: [O]=OOT_MODULE
> > > > [ 904.196384] [ C33] Hardware name: GIGABYTE R162-Z12-CD-G11P5/MZ12-HD4-CD, BIOS M10-sig 02/17/2025
> > > > [ 904.196385] [ C33] Call Trace:
> > > > [ 904.196387] [ C33] <NMI>
> > > > [ 904.196389] [ C33] dump_stack_lvl+0x4b/0x70
> > > > [ 904.196394] [ C33] panic+0x106/0x2c4
> > > > [ 904.196401] [ C33] nmi_panic.cold+0xc/0xc
> > > > [ 904.196404] [ C33] watchdog_hardlockup_check.cold+0xc6/0xe8
> > > > [ 904.196409] [ C33] __perf_event_overflow+0x15a/0x450
> > > > [ 904.196416] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196421] [ C33] x86_pmu_handle_irq+0x18a/0x1c0
> > > > [ 904.196436] [ C33] ? set_pte_vaddr+0x40/0x50
> > > > [ 904.196439] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196442] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196445] [ C33] ? native_set_fixmap+0x63/0xb0
> > > > [ 904.196448] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196451] [ C33] ? ghes_copy_tofrom_phys+0x7a/0x100
> > > > [ 904.196457] [ C33] ? srso_alias_return_thunk+0x5/0xfbef5
> > > > [ 904.196460] [ C33] ? __ghes_peek_estatus.isra.0+0x49/0xa0
> > > > [ 904.196465] [ C33] amd_pmu_handle_irq+0x4b/0xc0
> > > > [ 904.196469] [ C33] perf_event_nmi_handler+0x2a/0x50
> > > > [ 904.196473] [ C33] nmi_handle.part.0+0x59/0x110
> > > > [ 904.196479] [ C33] default_do_nmi+0x127/0x180
> > > > [ 904.196483] [ C33] exc_nmi+0x103/0x180
> > > > [ 904.196486] [ C33] end_repeat_nmi+0xf/0x53
> > > > [ 904.196489] [ C33] RIP: 0010:io_idle+0x3/0x30
> > > > [ 904.196493] [ C33] Code: 8b 00 a8 08 75 07 e8 2c e4 ff ff 90 fa e9 c0 b3 1a 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 fa ec <48> 8b 05 96 42 d4 01 a9 00 00 00 80 75 11 80 3d 4a 42 d4 01 00 75
> > > > [ 904.196495] [ C33] RSP: 0018:ffff9afa88307e70 EFLAGS: 00000093
> > > > [ 904.196497] [ C33] RAX: 0000000000000000 RBX: ffff8abdf2b5d898 RCX: 0000000000000040
> > > > [ 904.196499] [ C33] RDX: 0000000000000814 RSI: ffff8abdf2b5d800 RDI: 0000000000000814
> > > > [ 904.196501] [ C33] RBP: 0000000000000002 R08: ffffffffa9dff860 R09: 0000000000000007
> > > > [ 904.196502] [ C33] R10: 000000e65239d580 R11: 071c71c71c71c71c R12: ffffffffa9dff860
> > > > [ 904.196504] [ C33] R13: ffffffffa9dff948 R14: 0000000000000002 R15: 0000000000000000
> > > > [ 904.196510] [ C33] ? io_idle+0x3/0x30
> > > > [ 904.196515] [ C33] ? io_idle+0x3/0x30
> > > > [ 904.196519] [ C33] </NMI>
> > > > [ 904.196520] [ C33] <TASK>
> > > > [ 904.196521] [ C33] acpi_idle_do_entry+0x22/0x50
> > > > [ 904.196526] [ C33] acpi_idle_enter+0x7b/0xd0
> > > > [ 904.196529] [ C33] cpuidle_enter_state+0x79/0x420
> > > > [ 904.196535] [ C33] cpuidle_enter+0x2d/0x40
> > > > [ 904.196539] [ C33] do_idle+0x176/0x1c0
> > > > [ 904.196544] [ C33] cpu_startup_entry+0x29/0x30
> > > > [ 904.196548] [ C33] start_secondary+0xf7/0x100
> > > > [ 904.196552] [ C33] common_startup_64+0x13e/0x141
> > > > [ 904.196559] [ C33] </TASK>
> > > >
> > > > Best, Fred
next prev parent reply other threads:[~2025-08-06 22:51 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-06 20:14 [BUG] ipmi_si: watchdog: Watchdog detected hard LOCKUP Frederick Lawler
2025-08-06 20:19 ` Fred Lawler
2025-08-06 20:39 ` Corey Minyard
2025-08-06 21:16 ` Corey Minyard
2025-08-06 21:36 ` Frederick Lawler
2025-08-06 22:51 ` Corey Minyard [this message]
2025-08-07 19:43 ` Frederick Lawler
2025-08-07 20:29 ` Corey Minyard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aJPccSayM2nXk891@mail.minyard.net \
--to=corey@minyard.net \
--cc=fred@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=openipmi-developer@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.