Hi,
On my machine I sometimes get a kernel panic when the wireless driver
crashes (fw dump)
This is a standard fedora kernel so I can grab debuginfo and symbolize
the stack trace if you'd like, but I think this message might be enough
in itself without extra lines (longer excerpt attached)
---
UBSAN: array-index-out-of-bounds in drivers/net/wireless/realtek/rtw89/pci.c:593:24
index 32767 is out of range for type 'rtw89_pci_tx_wd [512]'
CPU: 11 UID: 0 PID: 2179 Comm: irq/153-rtw89_p Not tainted 6.14.6-300.fc42.x86_64 #1
Call Trace:
dump_stack_lvl+0x5d/0x80
ubsan_epilogue+0x5/0x30
__ubsan_handle_out_of_bounds.cold+0x54/0x59
rtw89_pci_release_tx_skbs.isra.0+0x291/0x2d0 [rtw89_pci]
rtw89_pci_release_tx+0x1c/0x50 [rtw89_pci]
rtw89_pci_napi_poll+0x99/0x170 [rtw89_pci]
__napi_poll+0x2e/0x1b0
? nohz_balance_exit_idle+0x88/0x100
net_rx_action+0x333/0x420
handle_softirqs+0xed/0x340
? __pfx_irq_thread_fn+0x10/0x10
do_softirq.part.0+0x3b/0x60
__local_bh_enable_ip+0x60/0x70
rtw89_pci_interrupt_threadfn+0xf7/0x260 [rtw89_pci]
? __pfx_irq_thread+0x10/0x10
? __pfx_irq_thread_fn+0x10/0x10
irq_thread_fn+0x22/0x60
irq_thread+0xea/0x1c0
---
The line in question would be this line:
> txwd = &wd_ring->pages[seq];
(which matches as pages is an array of 512 rtw89_pci_tx_wd structs)
Checking seq < RTW89_PCI_TXWD_NUM_MAX is trivial and I could send a
patch, but if that data is really bogus I assume any local check could
be fooled e.g. the data could be < 512 and still incorrect.
I'm afraid I don't have a reproducer, this machine is weird in that I
have (had) ethernet and wlan both up in the same subnet so packets such
as broadcast would be received on both ends and things might be weird; I
also suspend/resume. I setup netconsole after the 3rd crash in ~2 weeks,
so it's rare enough.
It's a new machine so I don't know if it's a regression.
That 'FW status' message comes from rtw89_mac_get_err_status()
-> rtw89_fw_st_dbg_dump() so another error obviously came in first --
perhaps in that case just give up and don't process incoming packets?
I've just turned off radio as I don't actually use it, so leaving what
to do of this up to you; if you want me to test something I can rebuild
a kernel and test but not sure how long it'd take to actually hit the
bug.
Thanks,
--
Dominique