* [PATCH v3 1/3] wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths
2026-04-20 17:44 [PATCH v3 0/3] wifi: iwlwifi: mld: stability fixes around firmware error recovery Cole Leavitt
@ 2026-04-20 17:44 ` Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 2/3] wifi: iwlwifi: mld: fix TSO segmentation when AMSDU is disabled Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 3/3] wifi: iwlwifi: mld: skip TX when firmware is dead Cole Leavitt
2 siblings, 0 replies; 4+ messages in thread
From: Cole Leavitt @ 2026-04-20 17:44 UTC (permalink / raw)
To: linux-wireless; +Cc: greearb, miriam.rachel.korenblit, johannes, cole, stable
After firmware error is detected and STATUS_FW_ERROR is set, NAPI may
still be in-flight from a prior interrupt or get scheduled by the MSIX
IRQ handler before the error bit is processed. The NAPI poll functions
have no STATUS_FW_ERROR check and will continue processing stale RX ring
entries from dying firmware.
iwl_trans_reclaim() already early-returns on STATUS_FW_ERROR, so any
TX-response notification that makes it through to reclaim is a no-op.
What remains is:
* CPU spent parsing stale RX inside iwl_pcie_rx_handle() before
dispatching to the op_mode.
* No signal in the logs when the race fires, making the
post-FW-error sequence harder to debug.
Add STATUS_FW_ERROR early-returns with WARN_ONCE() in four places:
* iwl_pcie_napi_poll() (legacy NAPI poll)
* iwl_pcie_napi_poll_msix() (MSIX NAPI poll)
* iwl_mld_handle_tx_resp_notif()
* iwl_mld_handle_compressed_ba_notif()
Rationale:
1. Stop NAPI from consuming any more RX budget once firmware is
declared dead; the restart path will re-initialise the rings.
2. Provide a single, one-shot log line via WARN_ONCE so we can tell
from a user's dmesg whether the post-error race actually fired in
their configuration, which has been hard to reproduce outside
Ben Greear's test rig.
_iwl_trans_pcie_gen2_stop_device() already calls iwl_pcie_rx_napi_sync()
to quiesce NAPI during device teardown, but that runs much later in the
restart sequence; these checks close the window between error detection
and device stop.
Fixes: d1e879ec600f ("wifi: iwlwifi: add iwlmld sub-driver")
Cc: stable@vger.kernel.org
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Cole Leavitt <cole@unwrap.rs>
---
drivers/net/wireless/intel/iwlwifi/mld/tx.c | 19 +++++++++++++++++++
drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c | 18 ++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/drivers/net/wireless/intel/iwlwifi/mld/tx.c b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
index 546d09a38dab..e341d12e5233 100644
--- a/drivers/net/wireless/intel/iwlwifi/mld/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
@@ -1082,6 +1082,15 @@ void iwl_mld_handle_tx_resp_notif(struct iwl_mld *mld,
bool mgmt = false;
bool tx_failure = (status & TX_STATUS_MSK) != TX_STATUS_SUCCESS;
+ /* iwl_trans_reclaim() already guards on STATUS_FW_ERROR, but
+ * bail out earlier (and log once) so we can tell from dmesg
+ * whether this race actually fires in the field.
+ */
+ if (unlikely(test_bit(STATUS_FW_ERROR, &mld->trans->status))) {
+ WARN_ONCE(1, "iwlwifi: TX resp notif (sta=%d txq=%d) after FW error\n",
+ sta_id, txq_id);
+ return;
+ }
if (IWL_FW_CHECK(mld, tx_resp->frame_count != 1,
"Invalid tx_resp notif frame_count (%d)\n",
tx_resp->frame_count))
@@ -1360,6 +1369,16 @@ void iwl_mld_handle_compressed_ba_notif(struct iwl_mld *mld,
u8 sta_id = ba_res->sta_id;
struct ieee80211_link_sta *link_sta;
+ /* Same rationale as iwl_mld_handle_tx_resp_notif: redundant with
+ * iwl_trans_reclaim()'s own STATUS_FW_ERROR check, but fails fast
+ * and logs via WARN_ONCE when the race is actually hit.
+ */
+ if (unlikely(test_bit(STATUS_FW_ERROR, &mld->trans->status))) {
+ WARN_ONCE(1, "iwlwifi: BA notif (sta=%d) after FW error\n",
+ sta_id);
+ return;
+ }
+
if (!tfd_cnt)
return;
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c b/drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c
index fe263cdc2e4f..554c22777ec1 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c
@@ -1012,6 +1012,15 @@ static int iwl_pcie_napi_poll(struct napi_struct *napi, int budget)
trans_pcie = iwl_netdev_to_trans_pcie(napi->dev);
trans = trans_pcie->trans;
+ /* Don't process RX for dying firmware; the restart path will
+ * re-init the rings. WARN_ONCE helps surface whether this race
+ * actually fires in user dmesg.
+ */
+ if (unlikely(test_bit(STATUS_FW_ERROR, &trans->status))) {
+ WARN_ONCE(1, "iwlwifi: NAPI poll[%d] invoked after FW error\n",
+ rxq->id);
+ napi_complete_done(napi, 0);
+ return 0;
+ }
+
ret = iwl_pcie_rx_handle(trans, rxq->id, budget);
IWL_DEBUG_ISR(trans, "[%d] handled %d, budget %d\n",
@@ -1039,6 +1048,15 @@ static int iwl_pcie_napi_poll_msix(struct napi_struct *napi, int budget)
trans_pcie = iwl_netdev_to_trans_pcie(napi->dev);
trans = trans_pcie->trans;
+ if (unlikely(test_bit(STATUS_FW_ERROR, &trans->status))) {
+ WARN_ONCE(1,
+ "iwlwifi: NAPI MSIX poll[%d] invoked after FW error\n",
+ rxq->id);
+ napi_complete_done(napi, 0);
+ return 0;
+ }
+
ret = iwl_pcie_rx_handle(trans, rxq->id, budget);
IWL_DEBUG_ISR(trans, "[%d] handled %d, budget %d\n", rxq->id, ret,
budget);
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH v3 2/3] wifi: iwlwifi: mld: fix TSO segmentation when AMSDU is disabled
2026-04-20 17:44 [PATCH v3 0/3] wifi: iwlwifi: mld: stability fixes around firmware error recovery Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 1/3] wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths Cole Leavitt
@ 2026-04-20 17:44 ` Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 3/3] wifi: iwlwifi: mld: skip TX when firmware is dead Cole Leavitt
2 siblings, 0 replies; 4+ messages in thread
From: Cole Leavitt @ 2026-04-20 17:44 UTC (permalink / raw)
To: linux-wireless; +Cc: greearb, miriam.rachel.korenblit, johannes, cole, stable
When the TLC notification disables AMSDU for a TID, mld/tlc.c (line 858)
sets link_sta->agg.max_tid_amsdu_len[TID] to the sentinel value 1. The
TSO segmentation path in iwl_mld_tx_tso_segment() guards only against
zero, not this sentinel, so it reaches the num_subframes calculation:
num_subframes = (max_tid_amsdu_len + pad) / (subf_len + pad)
= (1 + 2) / (1534 + 2) = 0
and then passes num_subframes=0 to iwl_tx_tso_segment(), which sets
gso_size = num_subframes * mss = 0
Calling skb_gso_segment() with gso_size=0 produces an unbounded number
of tiny segments from a single GSO skb. On a BE200 we've observed the
expansion push thousands of micro-frames into the TX ring before the
rest are purged.
The MVM driver is immune because it gates on mvmsta->amsdu_enabled (see
mvm/tx.c lines 910-913) before reaching the num_subframes calculation.
MLD has no equivalent bitmap and relies solely on max_tid_amsdu_len,
which does not recognise the sentinel 1.
Fix by checking max_tid_amsdu_len == 1 at the existing guard and
falling back to non-AMSDU TSO segmentation (Suggested-by Miriam
Korenblit). Also add WARN_ON_ONCE(!num_subframes) after the division
as defense-in-depth so any future path that produces zero through a
different mechanism is logged rather than silently creating a
pathological GSO skb.
Downstream user-visible symptoms (TCP retransmit queue corruption on
some setups, firmware MMIO-poll hang on BE200 with c102 firmware) have
been reported in connection with this bug but the causal chain between
the GSO explosion and those symptoms is being investigated separately
with Ben Greear, so they are not claimed here.
Suggested-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com>
Fixes: d1e879ec600f ("wifi: iwlwifi: add iwlmld sub-driver")
Cc: stable@vger.kernel.org
Signed-off-by: Cole Leavitt <cole@unwrap.rs>
---
drivers/net/wireless/intel/iwlwifi/mld/tx.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/intel/iwlwifi/mld/tx.c b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
index e341d12e5233..8af58aabcd68 100644
--- a/drivers/net/wireless/intel/iwlwifi/mld/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
@@ -823,7 +823,7 @@ static int iwl_mld_tx_tso_segment(struct iwl_mld *mld, struct sk_buff *skb,
return -EINVAL;
max_tid_amsdu_len = sta->cur->max_tid_amsdu_len[tid];
- if (!max_tid_amsdu_len)
+ if (!max_tid_amsdu_len || max_tid_amsdu_len == 1)
return iwl_tx_tso_segment(skb, 1, netdev_flags, mpdus_skbs);
/* Sub frame header + SNAP + IP header + TCP header + MSS */
@@ -835,6 +835,9 @@ static int iwl_mld_tx_tso_segment(struct iwl_mld *mld, struct sk_buff *skb,
*/
num_subframes = (max_tid_amsdu_len + pad) / (subf_len + pad);
+ if (WARN_ON_ONCE(!num_subframes))
+ return iwl_tx_tso_segment(skb, 1, netdev_flags, mpdus_skbs);
+
if (sta->max_amsdu_subframes &&
num_subframes > sta->max_amsdu_subframes)
num_subframes = sta->max_amsdu_subframes;
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH v3 3/3] wifi: iwlwifi: mld: skip TX when firmware is dead
2026-04-20 17:44 [PATCH v3 0/3] wifi: iwlwifi: mld: stability fixes around firmware error recovery Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 1/3] wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 2/3] wifi: iwlwifi: mld: fix TSO segmentation when AMSDU is disabled Cole Leavitt
@ 2026-04-20 17:44 ` Cole Leavitt
2 siblings, 0 replies; 4+ messages in thread
From: Cole Leavitt @ 2026-04-20 17:44 UTC (permalink / raw)
To: linux-wireless; +Cc: greearb, miriam.rachel.korenblit, johannes, cole
When firmware encounters an error, STATUS_FW_ERROR is set but the
mac80211 TX path continues pulling frames from TXQs. Each frame
fails at iwl_trans_tx() which checks STATUS_FW_ERROR and returns
-EIO, but iwl_mld_tx_from_txq() keeps looping over every queued
frame. This burns CPU in a tight loop on dead firmware and can
cause soft lockups during firmware error recovery.
Add a STATUS_FW_ERROR check at the top of iwl_mld_tx_from_txq()
to stop pulling frames from mac80211 TXQs when firmware is dead.
Also guard iwl_mld_mac80211_tx() which bypasses the TXQ path
entirely and would otherwise continue feeding frames to dead
firmware.
Once STATUS_FW_ERROR is cleared during firmware restart, TX
resumes naturally with no explicit wake needed.
Fixes: d1e879ec600f ("wifi: iwlwifi: add iwlmld sub-driver")
Signed-off-by: Cole Leavitt <cole@unwrap.rs>
---
drivers/net/wireless/intel/iwlwifi/mld/mac80211.c | 4 ++++
drivers/net/wireless/intel/iwlwifi/mld/tx.c | 3 +++
2 files changed, 7 insertions(+)
diff --git a/drivers/net/wireless/intel/iwlwifi/mld/mac80211.c b/drivers/net/wireless/intel/iwlwifi/mld/mac80211.c
index 71a9a72c9ac0..0df3be3089c3 100644
--- a/drivers/net/wireless/intel/iwlwifi/mld/mac80211.c
+++ b/drivers/net/wireless/intel/iwlwifi/mld/mac80211.c
@@ -519,6 +519,10 @@ iwl_mld_mac80211_tx(struct ieee80211_hw *hw,
u32 link_id = u32_get_bits(info->control.flags,
IEEE80211_TX_CTRL_MLO_LINK);
+ if (unlikely(test_bit(STATUS_FW_ERROR, &mld->trans->status))) {
+ ieee80211_free_txskb(hw, skb);
+ return;
+ }
/* In AP mode, mgmt frames are sent on the bcast station,
* so the FW can't translate the MLD addr to the link addr. Do it here
*/
diff --git a/drivers/net/wireless/intel/iwlwifi/mld/tx.c b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
index 8af58aabcd68..33bd2e336166 100644
--- a/drivers/net/wireless/intel/iwlwifi/mld/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mld/tx.c
@@ -962,6 +962,9 @@ void iwl_mld_tx_from_txq(struct iwl_mld *mld, struct ieee80211_txq *txq)
struct sk_buff *skb = NULL;
u8 zero_addr[ETH_ALEN] = {};
+ if (unlikely(test_bit(STATUS_FW_ERROR, &mld->trans->status)))
+ return;
+
/*
* No need for threads to be pending here, they can leave the first
* taker all the work.
--
2.52.0
^ permalink raw reply related [flat|nested] 4+ messages in thread