public inbox for linux-mediatek@lists.infradead.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Felix Fietkau <nbd@nbd.name>, Sasha Levin <sashal@kernel.org>,
	lorenzo@kernel.org, ryder.lee@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org
Subject: [PATCH AUTOSEL 7.0-6.18] wifi: mt76: flush pending TX before channel switch
Date: Mon, 20 Apr 2026 09:18:46 -0400	[thread overview]
Message-ID: <20260420132314.1023554-132-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Felix Fietkau <nbd@nbd.name>

[ Upstream commit 0dcef1cbae27d806cd29c296cc03ad6e8ece771d ]

mt76_tx() queues frames on wcid->tx_pending for async processing by
tx_worker. In __mt76_set_channel(), the worker gets disabled before it
may have run, and the subsequent wait only checks DMA ring queues, not
the software pending list. This means frames like nullfunc PS frames
from mt76_offchannel_notify() may never be transmitted on the correct
channel.

Fix this by running mt76_txq_schedule_pending() synchronously after
disabling the tx_worker but before setting MT76_RESET, which would
otherwise cause mt76_txq_schedule_pending_wcid() to bail out.

Link: https://patch.msgid.link/20260309060730.87840-8-nbd@nbd.name
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Good. I now have a complete picture. Let me compile the full analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem**: wifi: mt76 (MediaTek WiFi driver)
- **Action verb**: "flush" — implies fixing missing work that should
  have been done
- **Summary**: Flush pending TX frames before channel switch to ensure
  they're sent on the correct channel

Record: [wifi: mt76] [flush] [Ensure pending TX frames are transmitted
before channel switch]

### Step 1.2: Tags
- **Link**:
  `https://patch.msgid.link/20260309060730.87840-8-nbd@nbd.name` — part
  of a series (patch 8)
- **Signed-off-by**: Felix Fietkau `<nbd@nbd.name>` — the mt76
  maintainer and original author

Record: No Fixes: tag (expected for autosel candidates). No Reported-by.
No Cc: stable. Author is subsystem maintainer. Part number "8" in msgid
suggests this is part of a series.

### Step 1.3: Commit Body Analysis
The commit clearly explains the bug:
- `mt76_tx()` queues frames on `wcid->tx_pending` for async processing
  by `tx_worker`
- In `__mt76_set_channel()`, the worker gets disabled *after*
  `MT76_RESET` is set
- `mt76_txq_schedule_pending_wcid()` bails out when `MT76_RESET` is set
  (line 626 of tx.c)
- The `wait_event_timeout` only checks DMA ring queues via
  `mt76_has_tx_pending()`, NOT the software pending list
- Result: Frames like nullfunc PS frames may never be transmitted on the
  correct channel

Record: Bug = TX frames lost during channel switch due to ordering issue
between MT76_RESET flag and tx_worker disable. Symptom = nullfunc power-
save frames not transmitted. Root cause = MT76_RESET set before
schedule_pending runs, causing bail-out.

### Step 1.4: Hidden Bug Fix Detection
This is an explicit bug fix, not disguised. The commit clearly describes
lost TX frames.

Record: Explicit bug fix — not a hidden fix.

---

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **mac80211.c**: 3 lines reordered (moved `mt76_worker_disable` before
  `set_bit(MT76_RESET)`, added `mt76_txq_schedule_pending()` call)
- **mt76.h**: 1 line added (function declaration)
- **tx.c**: 1 line changed (`static void` → `void`)
- **Total**: ~5 lines of meaningful change
- **Functions modified**: `__mt76_set_channel()`,
  `mt76_txq_schedule_pending()` (visibility only)

Record: [3 files, ~5 lines changed] [__mt76_set_channel reordering,
mt76_txq_schedule_pending visibility] [Single-subsystem surgical fix]

### Step 2.2: Code Flow Change

**Before** (`__mt76_set_channel()`):
1. `set_bit(MT76_RESET, &phy->state)` — blocks
   `mt76_txq_schedule_pending_wcid`
2. `mt76_worker_disable(&dev->tx_worker)` — stops worker
3. `wait_event_timeout(... !mt76_has_tx_pending ...)` — only checks DMA
   queues

**After**:
1. `mt76_worker_disable(&dev->tx_worker)` — stops worker first
2. `mt76_txq_schedule_pending(phy)` — synchronously flush software
   pending list to DMA
3. `set_bit(MT76_RESET, &phy->state)` — now safe to set
4. `wait_event_timeout(... !mt76_has_tx_pending ...)` — DMA queues now
   include flushed frames

Record: Fix reorders operations so pending frames get flushed to DMA
rings before MT76_RESET blocks further processing.

### Step 2.3: Bug Mechanism
This is a **logic/ordering bug** leading to **TX frame loss**. The
`MT76_RESET` flag acts as a gate in `mt76_txq_schedule_pending_wcid()`
(line 626), and it was being set too early, preventing software-queued
frames from ever reaching the hardware.

Record: [Logic/ordering bug] [MT76_RESET set too early prevents software
TX queue flushing → frame loss]

### Step 2.4: Fix Quality
- Obviously correct: The reordering is logically sound — disable worker,
  flush pending, then set reset flag
- Minimal and surgical: ~5 lines changed
- Regression risk: Very low — the only new code path is calling
  `mt76_txq_schedule_pending()` synchronously, which already runs as
  part of `mt76_txq_schedule_all()` via the tx_worker. The worker is
  already disabled at this point, so no concurrency concern.

Record: [High quality fix, obviously correct, minimal scope, very low
regression risk]

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
The buggy code pattern (`set_bit` before `mt76_worker_disable`) was
introduced by:
- `f4fdd7716290a2` ("wifi: mt76: partially move channel change code to
  core") — v6.12
- `0b3be9d1d34e21` ("wifi: mt76: add separate tx scheduling queue for
  off-channel tx") — v6.12

These two commits together created the bug: one added the off-channel TX
pending mechanism, the other moved channel change code to core with the
wrong ordering.

Record: Bug introduced in v6.12 by commits f4fdd7716290a2 and
0b3be9d1d34e21.

### Step 3.2: Fixes Tag
No Fixes: tag present (expected for autosel). However, the bug clearly
traces back to `0b3be9d1d34e` — two related fixes (228bc0e79c852 and
49fba87205bec) already reference it with `Fixes:` tags.

Record: Related fixes 228bc0e79c852 and 49fba87205bec both fix
0b3be9d1d34e — this is a third fix for the same problematic commit.

### Step 3.3: File History
Related recent commits:
- `228bc0e79c852` (v6.14): "only enable tx worker after setting the
  channel" — Fixes: 0b3be9d1d34e
- `49fba87205bec`: "fix linked list corruption" — Fixes: 0b3be9d1d34e
- `bdeac7815629c`: "free pending offchannel tx frames on wcid cleanup"

Record: Multiple follow-up fixes to the same offchannel TX code. This
commit is standalone — only needs the pre-existing
mt76_txq_schedule_pending function.

### Step 3.4: Author
Felix Fietkau (`nbd@nbd.name`) is the mt76 subsystem maintainer and
original author of the driver. Very high confidence in fix correctness.

Record: Author is the mt76 maintainer — highest trust level.

### Step 3.5: Dependencies
- `mt76_txq_schedule_pending()` exists since v6.12 (commit 0b3be9d1d34e)
- `__mt76_set_channel()` exists since v6.14 (commit 82334623af0cd2)
- For v6.12 backport: function is called `mt76_set_channel()` with
  different context — needs adaptation
- For v6.14+/7.0: should apply cleanly or with minimal context
  adjustment
- Commit `228bc0e79c852` (v6.14) should ideally be present first, as it
  repositions `mt76_worker_enable()`. The v6.12 code has enable before
  `set_channel`, which was moved by that fix.

Record: Dependencies on 0b3be9d1d34e (present since v6.12) and
82334623af0cd2 (v6.14). For v6.12 backport, adaptation is needed.

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

### Step 4.1: Patch Discussion
b4 dig could not find the exact commit by the Link msgid. The msgid
suggests this is patch 8 of a series from Felix Fietkau dated
2026-03-09.

Record: Part of a larger series. Could not retrieve full lore discussion
due to lore anti-bot protections.

### Step 4.2: Reviewers
Felix Fietkau is both author and maintainer — self-reviewed. This is
normal for mt76 where he is the primary maintainer.

Record: Author is subsystem maintainer.

### Step 4.3-4.5: Bug Report
No Reported-by tag, no syzbot report. This appears to be found through
code review by the maintainer. No stable-specific discussion found.

Record: Found by maintainer code review.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.2: Key Functions
- `__mt76_set_channel()` — called from `mt76_set_channel()`
  (mac80211.c:1071) and `mt76_chanctx_assign_vif()` (channel.c:29)
- `mt76_set_channel()` is the mac80211 callback for channel changes
- This is called during every channel switch, scan, remain-on-channel —
  a **common operation**

### Step 5.3-5.4: Call Chain
`mac80211 → mt76_set_channel() → __mt76_set_channel()` — this is the
standard channel switch path, reachable during normal WiFi operation
(scanning, roaming, channel changes).

Record: Commonly triggered during WiFi scanning and channel switching.

### Step 5.5: Similar Patterns
The MT76_RESET bail-out pattern in `mt76_txq_schedule_pending_wcid()`
(line 626) is the direct cause. The same flag check exists in other TX
scheduling paths (lines 492, 546) which are also affected by the
ordering.

Record: MT76_RESET acts as gating mechanism in multiple TX paths.

---

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Bug Existence in Stable
- **v6.6 and earlier**: Bug does NOT exist —
  `mt76_txq_schedule_pending()` and the offchannel TX mechanism weren't
  added until v6.12
- **v6.12**: Bug EXISTS — has both the offchannel TX pending mechanism
  and the wrong ordering in `mt76_set_channel()`
- **v6.14+**: Bug EXISTS — has `__mt76_set_channel()` with the wrong
  ordering

Record: Bug exists in v6.12+ stable trees.

### Step 6.2: Backport Complications
- **v7.0**: Should apply cleanly
- **v6.14**: Should apply cleanly or near-cleanly (function name same)
- **v6.12**: Needs adaptation — different function name
  (`mt76_set_channel` vs `__mt76_set_channel`), different surrounding
  code (mutex_lock, cancel_delayed_work), may also need 228bc0e79c852 as
  prerequisite

Record: Clean for v6.14+; needs rework for v6.12.

---

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem Criticality
WiFi driver (mt76) — **IMPORTANT**. MediaTek MT76xx chipsets are
extremely common in consumer routers, laptops (mt7921/mt7922), and
access points (mt7915, mt7996). This is one of the most widely used WiFi
driver families in Linux.

Record: [drivers/net/wireless/mediatek/mt76] [IMPORTANT — very common
WiFi hardware]

### Step 7.2: Activity
Very active subsystem with frequent fixes from the maintainer.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
All users of MT76-based WiFi devices (mt7603, mt76x0, mt76x2, mt7615,
mt7915, mt7921/mt7922, mt7996) — a very large user base including laptop
users, router users, and embedded systems.

Record: [Driver-specific but very widely used]

### Step 8.2: Trigger Conditions
- Triggered during any channel switch (scanning, roaming, offchannel
  operations)
- WiFi scanning happens automatically and frequently
- The bug window is when frames are queued on `tx_pending` just before a
  channel switch

Record: [Common trigger — scanning happens regularly on all WiFi
devices]

### Step 8.3: Failure Mode
- TX frame loss — nullfunc PS frames not transmitted → AP may not know
  client is going off-channel → potential packet loss, connectivity
  issues
- Not a crash, but a functional correctness issue affecting WiFi
  reliability

Record: [Failure mode: TX frame loss during channel switch] [Severity:
MEDIUM-HIGH — affects WiFi reliability]

### Step 8.4: Risk-Benefit
- **Benefit**: Fixes TX frame loss during channel switch on widely-used
  WiFi hardware. Medium-high benefit.
- **Risk**: Very low — 5 lines of reordering, obviously correct,
  authored by maintainer
- **Ratio**: Strong benefit, minimal risk

Record: [Benefit: medium-high] [Risk: very low] [Favorable ratio]

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence

**FOR backporting**:
- Fixes a real bug: TX frames lost during channel switch
- Small, surgical fix: ~5 lines of meaningful change
- Authored by the mt76 maintainer (Felix Fietkau)
- Affects widely-used WiFi hardware
- Common trigger (WiFi scanning)
- Obviously correct: just reordering operations and adding a synchronous
  flush
- Very low regression risk
- Related to same code that already has two other Fixes: tagged commits

**AGAINST backporting**:
- No Fixes: tag (expected for autosel)
- No Reported-by (found by maintainer review)
- Part of a larger series (patch 8) — but the fix is self-contained
- v6.12 backport would need adaptation
- Not a crash — "just" frame loss (but impacts WiFi reliability)

### Step 9.2: Stable Rules Checklist
1. Obviously correct? **YES** — ordering fix is straightforward
2. Fixes a real bug? **YES** — TX frame loss during channel switch
3. Important issue? **YES** — WiFi reliability on common hardware
4. Small and contained? **YES** — ~5 lines across 3 files in same
   subsystem
5. No new features? **YES** — just fixes ordering
6. Can apply to stable? **YES** for v6.14+; **needs rework** for v6.12

### Step 9.3: Exception Categories
Not an exception category — this is a straightforward bug fix.

### Step 9.4: Decision
The fix addresses a real TX frame loss bug during WiFi channel switching
on widely-used MT76 hardware. It is small, obviously correct, authored
by the subsystem maintainer, and carries minimal regression risk. The
bug affects WiFi reliability for a large user base.

---

## Verification

- [Phase 1] Parsed tags: Link to patch.msgid.link, Signed-off-by Felix
  Fietkau (mt76 maintainer)
- [Phase 2] Diff analysis: Reorders
  set_bit(MT76_RESET)/mt76_worker_disable, adds
  mt76_txq_schedule_pending() call, makes function non-static
- [Phase 2] Confirmed MT76_RESET bail-out at tx.c line 626:
  `test_bit(MT76_RESET, &phy->state)` causes schedule_pending_wcid to
  return -1
- [Phase 2] Confirmed mt76_has_tx_pending() only checks DMA queues
  (q->queued), not software pending list
- [Phase 3] git blame: `set_bit(MT76_RESET)` ordering from
  f4fdd7716290a2 (v6.12), offchannel pending from 0b3be9d1d34e21 (v6.12)
- [Phase 3] git show 228bc0e79c852: Related fix "only enable tx worker
  after setting the channel", Fixes: 0b3be9d1d34e (v6.14)
- [Phase 3] git show 49fba87205bec: Related fix "fix linked list
  corruption", Fixes: 0b3be9d1d34e
- [Phase 3] Confirmed __mt76_set_channel() introduced in 82334623af0cd2
  (v6.14)
- [Phase 3] Felix Fietkau confirmed as mt76 maintainer via git log
  --author
- [Phase 5] mt76_txq_schedule_pending called from mt76_txq_schedule_all
  → mt76_tx_worker_run → tx_worker — normal TX path
- [Phase 6] Confirmed mt76_txq_schedule_pending does NOT exist in v6.6
  (bug not present pre-v6.12)
- [Phase 6] Confirmed v6.12 has the same bug pattern (set_bit before
  worker_disable, no schedule_pending call)
- [Phase 6] v6.12 uses `mt76_set_channel()` not `__mt76_set_channel()` —
  backport needs adaptation
- UNVERIFIED: Could not access lore.kernel.org discussion thread due to
  anti-bot protections

**YES**

 drivers/net/wireless/mediatek/mt76/mac80211.c | 5 +++--
 drivers/net/wireless/mediatek/mt76/mt76.h     | 1 +
 drivers/net/wireless/mediatek/mt76/tx.c       | 2 +-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c
index d0c522909e980..73d252e0a7bf3 100644
--- a/drivers/net/wireless/mediatek/mt76/mac80211.c
+++ b/drivers/net/wireless/mediatek/mt76/mac80211.c
@@ -1030,9 +1030,10 @@ int __mt76_set_channel(struct mt76_phy *phy, struct cfg80211_chan_def *chandef,
 	int timeout = HZ / 5;
 	int ret;
 
-	set_bit(MT76_RESET, &phy->state);
-
 	mt76_worker_disable(&dev->tx_worker);
+	mt76_txq_schedule_pending(phy);
+
+	set_bit(MT76_RESET, &phy->state);
 	wait_event_timeout(dev->tx_wait, !mt76_has_tx_pending(phy), timeout);
 	mt76_update_survey(phy);
 
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index d05e83ea1cacc..7bba0831bc0eb 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -1518,6 +1518,7 @@ void mt76_stop_tx_queues(struct mt76_phy *phy, struct ieee80211_sta *sta,
 void mt76_tx_check_agg_ssn(struct ieee80211_sta *sta, struct sk_buff *skb);
 void mt76_txq_schedule(struct mt76_phy *phy, enum mt76_txq_id qid);
 void mt76_txq_schedule_all(struct mt76_phy *phy);
+void mt76_txq_schedule_pending(struct mt76_phy *phy);
 void mt76_tx_worker_run(struct mt76_dev *dev);
 void mt76_tx_worker(struct mt76_worker *w);
 void mt76_release_buffered_frames(struct ieee80211_hw *hw,
diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c
index 0753acf2eccb8..ab62591b7a260 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -660,7 +660,7 @@ mt76_txq_schedule_pending_wcid(struct mt76_phy *phy, struct mt76_wcid *wcid,
 	return ret;
 }
 
-static void mt76_txq_schedule_pending(struct mt76_phy *phy)
+void mt76_txq_schedule_pending(struct mt76_phy *phy)
 {
 	LIST_HEAD(tx_list);
 	int ret = 0;
-- 
2.53.0



  parent reply	other threads:[~2026-04-20 13:26 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.19] wifi: mt76: avoid to set ACK for MCU command if wait_resp is not set Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.18] phy: phy-mtk-tphy: Update names and format of kernel-doc comments Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] Bluetooth: btmtk: add MT7902 MCU support Sasha Levin
2026-04-20 13:18 ` Sasha Levin [this message]
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] wifi: mt76: fix list corruption in mt76_wcid_cleanup Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: add missing lock protection in mt76_sta_state for sta_event callback Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: btmtk: improve mt79xx firmware setup retry flow Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: Disable Rx hdr_trans in monitor mode Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: Skip scan process during suspend Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] wifi: mt76: mt76x02: wake queues after reconfig Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: resolve link after acquiring mt76 mutex Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix queue pause after scan due to wrong channel switch reason Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628 Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.6] wifi: mt76: mt7996: reset device after MCU message timeout Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix frequency separation for station STR mode Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 6.18] net: airoha: Fix memory leak in airoha_qdma_rx_process() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-132-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=angelogioacchino.delregno@collabora.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=matthias.bgg@gmail.com \
    --cc=nbd@nbd.name \
    --cc=patches@lists.linux.dev \
    --cc=ryder.lee@mediatek.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox