public inbox for linux-wireless@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] wifi: iwlwifi: mld: stability fixes around firmware error recovery
@ 2026-04-20 17:44 Cole Leavitt
  2026-04-20 17:44 ` [PATCH v3 1/3] wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths Cole Leavitt
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Cole Leavitt @ 2026-04-20 17:44 UTC (permalink / raw)
  To: linux-wireless; +Cc: greearb, miriam.rachel.korenblit, johannes, cole

Three fixes for the iwlmld sub-driver observed on Intel BE200 (Wi-Fi 7).

1/3 adds STATUS_FW_ERROR guards to the NAPI poll functions and to the
    TX/BA notification handlers.  iwl_trans_reclaim() already has its
    own STATUS_FW_ERROR early-return, so these are defense-in-depth
    with WARN_ONCE instrumentation: if the suspected post-FW-error race
    fires, we now catch it early (before reclaim) and log it.  Tested by
    Ben Greear, who confirmed the WARN fires on his systems during
    firmware error recovery.

2/3 fixes a real TSO segmentation bug.  When the TLC notification
    disables AMSDU for a TID, max_tid_amsdu_len is set to the sentinel
    value 1 (see mld/tlc.c line 858).  The existing zero-only check in
    iwl_mld_tx_tso_segment() lets this sentinel through, producing
    num_subframes=0, which feeds gso_size=0 into iwl_tx_tso_segment()
    and downstream skb_gso_segment().

    MVM is immune because it gates on mvmsta->amsdu_enabled; MLD has no
    equivalent bitmap.  Fix by also treating the sentinel 1 as "AMSDU
    disabled" at the existing guard, and add a WARN_ON_ONCE(!num_subframes)
    after the division so any future path that produces zero through a
    different mechanism is caught and reported rather than silently
    creating a pathological GSO skb.

3/3 adds STATUS_FW_ERROR checks at the top of iwl_mld_tx_from_txq() and
    iwl_mld_mac80211_tx() to stop pulling frames for dead firmware,
    eliminating an observed soft lockup during firmware error recovery.
    Revised per Johannes Berg's feedback to use status-bit checks rather
    than ieee80211_stop_queues()/wake_queues() which do not interact
    well with TXQ-based APIs.

Changes since v2:
  - 1/3:
      * Stripped inadvertent clang-format style churn from v2; the diff
        is now only the functional STATUS_FW_ERROR guards (four hunks,
        ~30 lines added).
      * Rewrote commit message: the v2 message claimed a proven
        SSN-corruption -> iwl_trans_reclaim() UAF chain, but
        iwl_trans_reclaim() already checks STATUS_FW_ERROR itself
        (iwl-trans.c:~663), so that chain cannot actually reach the
        queue walk.  The patch is more accurately described as
        "earlier STATUS_FW_ERROR guards with WARN_ONCE instrumentation
        for diagnosis of suspected post-FW-error NAPI scheduling."
      * Kept Tested-by: Ben Greear.

  - 2/3:
      * Removed the speculative "TCP retransmit queue UAF / refcount
        underflow in tcp_shifted_skb / NULL deref in tcp_rack_detect_loss"
        chain from the commit message per Ben Greear's feedback.  Those
        symptoms are real but the causal link to this bug was not
        directly traced; describing them as consequences of this patch
        was overclaiming.
      * Commit message now states only what can be traced in-tree:
        sentinel 1 -> num_subframes=0 -> gso_size=0 -> unbounded
        skb_gso_segment() output.  Downstream symptom attribution is
        left for the separate investigation Ben and I have underway.
      * Code change is unchanged from v2.

  - 3/3: Unchanged from v2 beyond rebase context.

To Miriam's question in the v2 thread ("Was the soft lockup happening
as a consequence of the bug fixed in 2/3?"):  Yes, our typical trace is
2/3's GSO explosion -> firmware receives malformed AMSDU descriptors
-> firmware hangs in an MMIO poll (FSEQ_ERROR_CODE 0x67A00000,
SYSTEM_STATISTICS_CMD timeout) -> 3/3's dead-firmware TX path keeps
spinning -> soft lockup.  Full dmesg attached to a follow-up on the
v2 thread so the Intel firmware team can investigate the c102 MMIO
poll hang separately; the kernel-side chain is independently
reproducible with a small test case.

Cole Leavitt (3):
  wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths
  wifi: iwlwifi: mld: fix TSO segmentation when AMSDU is disabled
  wifi: iwlwifi: mld: skip TX when firmware is dead

 drivers/net/wireless/intel/iwlwifi/mld/mac80211.c  |  4 ++++
 drivers/net/wireless/intel/iwlwifi/mld/tx.c        | 22 ++++++++++++++++++++++
 drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/rx.c| 18 ++++++++++++++++++
 3 files changed, 44 insertions(+)

base-commit: 3aae9383f42f687221c011d7ee87529398e826b3
-- 
2.52.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-20 17:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 17:44 [PATCH v3 0/3] wifi: iwlwifi: mld: stability fixes around firmware error recovery Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 1/3] wifi: iwlwifi: add STATUS_FW_ERROR guards to NAPI/TX-notif paths Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 2/3] wifi: iwlwifi: mld: fix TSO segmentation when AMSDU is disabled Cole Leavitt
2026-04-20 17:44 ` [PATCH v3 3/3] wifi: iwlwifi: mld: skip TX when firmware is dead Cole Leavitt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox