From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Chad Monroe <chad@monroe.io>, Felix Fietkau <nbd@nbd.name>,
Sasha Levin <sashal@kernel.org>,
lorenzo@kernel.org, ryder.lee@mediatek.com,
matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-mediatek@lists.infradead.org
Subject: [PATCH AUTOSEL 7.0-6.6] wifi: mt76: mt7996: reset device after MCU message timeout
Date: Mon, 20 Apr 2026 09:20:45 -0400 [thread overview]
Message-ID: <20260420132314.1023554-251-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>
From: Chad Monroe <chad@monroe.io>
[ Upstream commit d2b860454ea2df8f336e9b859da7ffb27f43444d ]
Trigger a full reset after MCU message timeout.
Signed-off-by: Chad Monroe <chad@monroe.io>
Link: https://patch.msgid.link/6e05ed063f3763ad3457633c56b60a728a49a6f0.1765203753.git.chad@monroe.io
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the information needed for the complete analysis. Let me
compile the findings.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
**Step 1.1: Subject Line**
- Subsystem: `wifi: mt76: mt7996`
- Action verb: "reset" (implies recovery from a failure state)
- Summary: Trigger a device reset when MCU message timeouts occur on
mt7996
- Record: [wifi/mt76/mt7996] [reset] [Trigger full device recovery after
MCU message timeout]
**Step 1.2: Tags**
- `Signed-off-by: Chad Monroe <chad@monroe.io>` - patch author
- `Link: https://patch.msgid.link/...` - reference to lore
- `Signed-off-by: Felix Fietkau <nbd@nbd.name>` - maintainer/committer
of mt76
- No Fixes: tag (expected for manual review candidates)
- No Cc: stable (expected)
- No Reported-by tag
- Record: Author is Chad Monroe; applied by Felix Fietkau (the mt76
subsystem maintainer)
**Step 1.3: Commit Body**
- Body is very brief: "Trigger a full reset after MCU message timeout."
- No stack traces or reproduction steps given
- The mt7915 equivalent (commit 10f73bb3938f7c5) provides more context:
"MCU hangs do not trigger watchdog interrupts, so they can only be
detected through MCU message timeouts. Ensure that the hardware gets
restarted when that happens in order to prevent a permanent stuck
state."
- Record: Bug = MCU hang leaves device permanently stuck. Symptom = WiFi
device becomes non-functional, requires reboot. Root cause = MCU hang
without watchdog interrupt, only detectable via message timeout, no
recovery triggered.
**Step 1.4: Hidden Bug Fix Detection**
- "reset device after MCU message timeout" - this is clearly a fix for a
missing recovery path. Without it, a firmware hang results in a
permanent stuck state.
- Record: This IS a bug fix. The device becomes permanently stuck
without it.
---
## PHASE 2: DIFF ANALYSIS
**Step 2.1: Inventory**
- `mcu.c`: +9 lines (in `mt7996_mcu_parse_response`)
- `mac.c`: +5 lines (in `mt7996_reset`)
- Total: ~14 lines added, 0 removed
- Scope: Single-driver, surgical fix in two closely-related functions
- Record: 2 files, +14 lines, functions: mt7996_mcu_parse_response,
mt7996_reset
**Step 2.2: Code Flow Changes**
Hunk 1 (mcu.c): In `mt7996_mcu_parse_response()`, when `skb == NULL`
(MCU timeout):
- **Before**: Log error, return -ETIMEDOUT. No recovery action.
- **After**: Log error, atomically set `MT76_MCU_RESET` bit (via
`test_and_set_bit` to prevent duplicates), set `recovery.restart =
true`, wake up MCU wait queue, queue `reset_work`, wake up
`reset_wait`, then return -ETIMEDOUT.
Hunk 2 (mac.c): In `mt7996_reset()`, before the existing `queue_work`:
- **Before**: Always queue reset_work and wake reset_wait
unconditionally.
- **After**: If `MT_MCU_CMD_STOP_DMA` is set, additionally set
`MT76_MCU_RESET` bit and wake up MCU wait queue, aborting pending MCU
operations before reset.
**Step 2.3: Bug Mechanism**
- Category: Missing error recovery / permanent hardware stuck state
- The MCU can hang in a way that doesn't trigger a hardware watchdog
interrupt. The only indication is MCU message timeouts. Without this
patch, timeouts just return an error code but never trigger device
recovery. The device becomes permanently non-functional.
- Record: Missing recovery mechanism. MCU hang → timeout → error return
→ no recovery → permanent stuck state.
**Step 2.4: Fix Quality**
- Obviously correct: Mirrors the exact same pattern used in mt7915
(commit 10f73bb3938f7c5) and mt7915's STOP_DMA handling (commit
b13cd593ef2402).
- Minimal/surgical: Only adds recovery trigger code at the exact points
needed.
- `test_and_set_bit` prevents duplicate resets.
- Regression risk: Very low. The reset_work handler already handles
`recovery.restart = true` properly. The STOP_DMA path already exists
for other triggers.
- Record: High quality fix, obviously correct, mirrors established
patterns.
---
## PHASE 3: GIT HISTORY INVESTIGATION
**Step 3.1: Blame**
- `mt7996_mcu_parse_response()`: Unchanged since original driver
addition by Shayne Chen (commit 98686cd21624c7, November 2022, v6.2).
- `mt7996_reset()`: Added by Bo Jiao (commit 27015b6fbcca83, April 2023,
v6.4) as "enable full system reset support".
- The buggy code (missing recovery trigger) has been present since the
driver was first created.
- Record: Bug present since v6.2 (mcu.c) and v6.4 (mac.c had
mt7996_reset without STOP_DMA handling).
**Step 3.2: No Fixes: Tag**
- N/A - no Fixes: tag present (expected).
**Step 3.3: File History**
- The mt7996 reset infrastructure was significantly improved in v6.18
(ace5d3b6b49e8 "improve hardware restart reliability"). However, the
basic recovery mechanism has been in place since v6.4.
- The commit `beb01caa570c52` in v6.18 decreased MCU timeouts to allow
faster recovery - this patch's logic works with either timeout value.
- Record: This commit is standalone; no prerequisites needed beyond the
v6.4 reset infrastructure.
**Step 3.4: Author**
- Chad Monroe is a contributor to mt76 (5 commits found in the driver).
- Felix Fietkau (nbd@nbd.name) is THE mt76 subsystem maintainer - he
applied the patch.
- Felix also authored the identical fix for mt7915 (10f73bb3938f7c5).
- Record: Applied by subsystem maintainer. Author is a regular
contributor.
**Step 3.5: Dependencies**
- All structures/flags used already exist: `MT76_MCU_RESET`,
`recovery.restart`, `mcu.wait`, `reset_work`, `reset_wait`,
`MT_MCU_CMD_STOP_DMA`.
- No new functions or data structures introduced.
- Record: Fully self-contained, no dependencies on other uncommitted
patches.
---
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
**Step 4.1: Original Discussion**
- lore.kernel.org returned anti-bot protection; direct web access was
blocked.
- b4 dig of the mt7915 equivalent found it was part of a 24-patch series
(v2) by Felix Fietkau from August 2024.
- The mt7996 version is by Chad Monroe and was ported from the mt7915
fix.
- Record: Could not access lore directly due to anti-bot protection. b4
confirmed the mt7915 version was part of Felix Fietkau's cleanup
series.
**Step 4.2: Reviewer**
- Applied by Felix Fietkau, the mt76 subsystem maintainer.
- Record: Subsystem maintainer applied the patch directly.
**Step 4.3-4.5**: Blocked by lore anti-bot protection. No additional
information could be gathered.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
**Step 5.1: Functions Modified**
- `mt7996_mcu_parse_response()` - MCU response parser (callback)
- `mt7996_reset()` - device reset entry point
**Step 5.2: Callers of `mt7996_mcu_parse_response`**
- Registered as `.mcu_parse_response` in `mt7996_mcu_ops` (mcu.c line
3363).
- Called from `mt76_mcu_skb_send_and_get_msg()` in `mcu.c` (core mt76
code, line 122).
- This is the universal MCU message response handler - called for EVERY
MCU command the driver issues.
- Record: Called for every MCU message. Critical, high-frequency path.
**Step 5.3: Callers of `mt7996_reset`**
- Called from interrupt context and error recovery paths.
- Used by `mt7996_irq_tasklet()` when MCU command interrupts occur.
- Record: Called from interrupt handler / tasklet context.
**Step 5.4: Call Chain**
- Any WiFi operation → MCU command → `mt76_mcu_skb_send_and_get_msg()` →
wait for response → `mt7996_mcu_parse_response()` → if timeout →
trigger reset
- This path is reachable from normal WiFi operations (scan, associate,
channel switch, etc.)
- Record: Fully reachable from normal user operations.
**Step 5.5: Similar Patterns**
- mt7915 has identical recovery logic (10f73bb3938f7c5 +
b13cd593ef2402).
- mt7921/mt7925 have similar reset mechanisms.
- Record: Well-established pattern across the mt76 driver family.
---
## PHASE 6: STABLE TREE ANALYSIS
**Step 6.1: Buggy Code in Stable Trees**
- mt7996 driver added in v6.2.
- `mt7996_reset()` added in v6.4.
- `mt7996_mcu_parse_response()` unchanged since v6.2.
- The mcu.c part of the fix applies to 6.2+. The mac.c part applies to
6.4+.
- Affected stable trees: 6.6.y, 6.12.y, and any other active LTS that
includes mt7996.
- Record: Bug exists in 6.6.y and all later stable trees.
**Step 6.2: Backport Complications**
- The code being modified is unchanged since original introduction.
- Should apply cleanly to 6.6.y.
- Record: Expected clean apply.
**Step 6.3: No Related Fixes Already in Stable**
- No similar fix found in stable trees.
- Record: No existing fix for this issue in stable.
---
## PHASE 7: SUBSYSTEM CONTEXT
**Step 7.1: Subsystem**
- WiFi driver (drivers/net/wireless/mediatek/mt76/mt7996)
- MT7996 is MediaTek's Wi-Fi 7 (802.11be) chipset - used in routers and
access points.
- Criticality: IMPORTANT - WiFi is critical infrastructure for many
users.
- Record: [WiFi driver] [IMPORTANT - affects mt7996 hardware users]
**Step 7.2: Activity**
- Very actively developed - dozens of commits in recent releases.
- Active MLO/Wi-Fi 7 development ongoing.
- Record: Highly active subsystem.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
**Step 8.1: Affected Users**
- All users of mt7996/mt7992 WiFi hardware (routers, access points, PCIe
WiFi cards).
- Record: Hardware-specific, but MT7996 is a current-generation popular
WiFi chipset.
**Step 8.2: Trigger Conditions**
- Triggers when MCU firmware hangs without issuing a watchdog interrupt.
- Can happen during normal operation (firmware bugs, hardware glitches).
- Not user-triggered in the security sense, but can happen during
routine WiFi operation.
- Record: Firmware hang during normal operation. Not predictable but
happens in practice (same fix was needed for mt7915).
**Step 8.3: Failure Mode**
- Without the fix: WiFi device becomes permanently non-functional until
reboot.
- This is a system hang from the WiFi perspective.
- Severity: HIGH (permanent loss of WiFi connectivity, requires reboot)
- Record: Permanent device stuck state. Severity: HIGH.
**Step 8.4: Risk-Benefit Ratio**
- Benefit: HIGH - prevents permanent WiFi device failure, enables
automatic recovery.
- Risk: VERY LOW - 14 lines, uses `test_and_set_bit` for safety, mirrors
proven mt7915 pattern, no changes to public APIs or data structures.
- Record: Benefit HIGH, Risk VERY LOW. Excellent ratio.
---
## PHASE 9: FINAL SYNTHESIS
**Step 9.1: Evidence Summary**
FOR backporting:
- Fixes permanent device stuck state (WiFi becomes completely non-
functional)
- Small and surgical fix (14 lines, 2 files in same driver)
- Mirrors identical fix already applied to sister chip mt7915 (proven
pattern)
- Applied by mt76 subsystem maintainer (Felix Fietkau)
- No dependencies on other patches
- Buggy code present since driver introduction (v6.2/v6.4)
- `test_and_set_bit` prevents duplicate resets (safe concurrent
behavior)
- Should apply cleanly to stable trees
AGAINST backporting:
- Brief commit message (but supplemented by identical mt7915 fix
explanation)
- No explicit stable nomination or Reported-by (but this is expected for
manual review candidates)
**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** - mirrors proven mt7915
pattern, applied by maintainer
2. Fixes a real bug? **YES** - permanent device stuck state
3. Important issue? **YES** - device becomes permanently non-functional
4. Small and contained? **YES** - 14 lines in single driver
5. No new features or APIs? **YES** - only adds error recovery
6. Can apply to stable? **YES** - code unchanged since introduction
**Step 9.3: Exception Categories**
- Not an exception category - this is a standard bug fix.
**Step 9.4: Decision**
This is a clear bug fix that prevents a permanent device stuck state.
It's small, obviously correct, follows established patterns, and was
applied by the subsystem maintainer. The benefit-to-risk ratio is
excellent.
---
## Verification
- [Phase 1] Parsed tags: Link to patch.msgid, signed by author (Chad
Monroe) and maintainer (Felix Fietkau)
- [Phase 2] Diff analysis: +9 lines in mcu.c (reset trigger on timeout),
+5 lines in mac.c (STOP_DMA handling)
- [Phase 3] git blame: mt7996_mcu_parse_response unchanged since
98686cd21624c7 (v6.2); mt7996_reset from 27015b6fbcca83 (v6.4)
- [Phase 3] git show 10f73bb3938f7c5: confirmed identical mt7915 fix
exists with clear explanation of the permanent stuck state bug
- [Phase 3] git show b13cd593ef2402: confirmed identical mt7915 STOP_DMA
handling pattern
- [Phase 3] git log --author: Chad Monroe has 5 commits in mt76, is a
regular contributor
- [Phase 4] b4 dig: found mt7915 version in Felix Fietkau's series at
lore
- [Phase 4] lore direct access blocked by anti-bot protection
- [Phase 5] Grep MT76_MCU_RESET: confirmed flag is used across 15+ files
in mt76 for reset coordination
- [Phase 5] Grep mt7996_mcu_parse_response: confirmed it's registered as
mcu_ops callback (line 3363)
- [Phase 5] Grep reset_work: confirmed reset_work/reset_wait
infrastructure exists since v6.4
- [Phase 6] git tag --contains: mt7996_reset present since v6.4; driver
since v6.2
- [Phase 6] Current 7.0 code confirmed: mt7996_reset lacks STOP_DMA
handling, parse_response lacks reset trigger
- [Phase 8] Failure mode: permanent WiFi device stuck state requiring
reboot - severity HIGH
- UNVERIFIED: Exact lore discussion content could not be accessed due to
anti-bot protection
**YES**
drivers/net/wireless/mediatek/mt76/mt7996/mac.c | 5 +++++
drivers/net/wireless/mediatek/mt76/mt7996/mcu.c | 9 +++++++++
2 files changed, 14 insertions(+)
diff --git a/drivers/net/wireless/mediatek/mt76/mt7996/mac.c b/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
index d4f3ee943b472..b7aa51481ce82 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
@@ -2737,6 +2737,11 @@ void mt7996_reset(struct mt7996_dev *dev)
return;
}
+ if (READ_ONCE(dev->recovery.state) & MT_MCU_CMD_STOP_DMA) {
+ set_bit(MT76_MCU_RESET, &dev->mphy.state);
+ wake_up(&dev->mt76.mcu.wait);
+ }
+
queue_work(dev->mt76.wq, &dev->reset_work);
wake_up(&dev->reset_wait);
}
diff --git a/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c b/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
index c0c042de477b8..54776f0703876 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
@@ -209,6 +209,7 @@ static int
mt7996_mcu_parse_response(struct mt76_dev *mdev, int cmd,
struct sk_buff *skb, int seq)
{
+ struct mt7996_dev *dev = container_of(mdev, struct mt7996_dev, mt76);
struct mt7996_mcu_rxd *rxd;
struct mt7996_mcu_uni_event *event;
int mcu_cmd = FIELD_GET(__MCU_CMD_FIELD_ID, cmd);
@@ -217,6 +218,14 @@ mt7996_mcu_parse_response(struct mt76_dev *mdev, int cmd,
if (!skb) {
dev_err(mdev->dev, "Message %08x (seq %d) timeout\n",
cmd, seq);
+
+ if (!test_and_set_bit(MT76_MCU_RESET, &dev->mphy.state)) {
+ dev->recovery.restart = true;
+ wake_up(&dev->mt76.mcu.wait);
+ queue_work(dev->mt76.wq, &dev->reset_work);
+ wake_up(&dev->reset_wait);
+ }
+
return -ETIMEDOUT;
}
--
2.53.0
next prev parent reply other threads:[~2026-04-20 13:31 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.19] wifi: mt76: avoid to set ACK for MCU command if wait_resp is not set Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.18] phy: phy-mtk-tphy: Update names and format of kernel-doc comments Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] Bluetooth: btmtk: add MT7902 MCU support Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: flush pending TX before channel switch Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] wifi: mt76: fix list corruption in mt76_wcid_cleanup Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: add missing lock protection in mt76_sta_state for sta_event callback Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: btmtk: improve mt79xx firmware setup retry flow Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: Disable Rx hdr_trans in monitor mode Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: Skip scan process during suspend Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] wifi: mt76: mt76x02: wake queues after reconfig Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: resolve link after acquiring mt76 mutex Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix queue pause after scan due to wrong channel switch reason Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628 Sasha Levin
2026-04-20 13:20 ` Sasha Levin [this message]
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix frequency separation for station STR mode Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 6.18] net: airoha: Fix memory leak in airoha_qdma_rx_process() Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420132314.1023554-251-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=angelogioacchino.delregno@collabora.com \
--cc=chad@monroe.io \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=linux-wireless@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=matthias.bgg@gmail.com \
--cc=nbd@nbd.name \
--cc=patches@lists.linux.dev \
--cc=ryder.lee@mediatek.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox