public inbox for linux-mediatek@lists.infradead.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Chad Monroe <chad@monroe.io>, Felix Fietkau <nbd@nbd.name>,
	Sasha Levin <sashal@kernel.org>,
	lorenzo@kernel.org, ryder.lee@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	linux-wireless@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-mediatek@lists.infradead.org
Subject: [PATCH AUTOSEL 7.0-6.6] wifi: mt76: mt7996: reset device after MCU message timeout
Date: Mon, 20 Apr 2026 09:20:45 -0400	[thread overview]
Message-ID: <20260420132314.1023554-251-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Chad Monroe <chad@monroe.io>

[ Upstream commit d2b860454ea2df8f336e9b859da7ffb27f43444d ]

Trigger a full reset after MCU message timeout.

Signed-off-by: Chad Monroe <chad@monroe.io>
Link: https://patch.msgid.link/6e05ed063f3763ad3457633c56b60a728a49a6f0.1765203753.git.chad@monroe.io
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for the complete analysis. Let me
compile the findings.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
- Subsystem: `wifi: mt76: mt7996`
- Action verb: "reset" (implies recovery from a failure state)
- Summary: Trigger a device reset when MCU message timeouts occur on
  mt7996
- Record: [wifi/mt76/mt7996] [reset] [Trigger full device recovery after
  MCU message timeout]

**Step 1.2: Tags**
- `Signed-off-by: Chad Monroe <chad@monroe.io>` - patch author
- `Link: https://patch.msgid.link/...` - reference to lore
- `Signed-off-by: Felix Fietkau <nbd@nbd.name>` - maintainer/committer
  of mt76
- No Fixes: tag (expected for manual review candidates)
- No Cc: stable (expected)
- No Reported-by tag
- Record: Author is Chad Monroe; applied by Felix Fietkau (the mt76
  subsystem maintainer)

**Step 1.3: Commit Body**
- Body is very brief: "Trigger a full reset after MCU message timeout."
- No stack traces or reproduction steps given
- The mt7915 equivalent (commit 10f73bb3938f7c5) provides more context:
  "MCU hangs do not trigger watchdog interrupts, so they can only be
  detected through MCU message timeouts. Ensure that the hardware gets
  restarted when that happens in order to prevent a permanent stuck
  state."
- Record: Bug = MCU hang leaves device permanently stuck. Symptom = WiFi
  device becomes non-functional, requires reboot. Root cause = MCU hang
  without watchdog interrupt, only detectable via message timeout, no
  recovery triggered.

**Step 1.4: Hidden Bug Fix Detection**
- "reset device after MCU message timeout" - this is clearly a fix for a
  missing recovery path. Without it, a firmware hang results in a
  permanent stuck state.
- Record: This IS a bug fix. The device becomes permanently stuck
  without it.

---

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- `mcu.c`: +9 lines (in `mt7996_mcu_parse_response`)
- `mac.c`: +5 lines (in `mt7996_reset`)
- Total: ~14 lines added, 0 removed
- Scope: Single-driver, surgical fix in two closely-related functions
- Record: 2 files, +14 lines, functions: mt7996_mcu_parse_response,
  mt7996_reset

**Step 2.2: Code Flow Changes**

Hunk 1 (mcu.c): In `mt7996_mcu_parse_response()`, when `skb == NULL`
(MCU timeout):
- **Before**: Log error, return -ETIMEDOUT. No recovery action.
- **After**: Log error, atomically set `MT76_MCU_RESET` bit (via
  `test_and_set_bit` to prevent duplicates), set `recovery.restart =
  true`, wake up MCU wait queue, queue `reset_work`, wake up
  `reset_wait`, then return -ETIMEDOUT.

Hunk 2 (mac.c): In `mt7996_reset()`, before the existing `queue_work`:
- **Before**: Always queue reset_work and wake reset_wait
  unconditionally.
- **After**: If `MT_MCU_CMD_STOP_DMA` is set, additionally set
  `MT76_MCU_RESET` bit and wake up MCU wait queue, aborting pending MCU
  operations before reset.

**Step 2.3: Bug Mechanism**
- Category: Missing error recovery / permanent hardware stuck state
- The MCU can hang in a way that doesn't trigger a hardware watchdog
  interrupt. The only indication is MCU message timeouts. Without this
  patch, timeouts just return an error code but never trigger device
  recovery. The device becomes permanently non-functional.
- Record: Missing recovery mechanism. MCU hang → timeout → error return
  → no recovery → permanent stuck state.

**Step 2.4: Fix Quality**
- Obviously correct: Mirrors the exact same pattern used in mt7915
  (commit 10f73bb3938f7c5) and mt7915's STOP_DMA handling (commit
  b13cd593ef2402).
- Minimal/surgical: Only adds recovery trigger code at the exact points
  needed.
- `test_and_set_bit` prevents duplicate resets.
- Regression risk: Very low. The reset_work handler already handles
  `recovery.restart = true` properly. The STOP_DMA path already exists
  for other triggers.
- Record: High quality fix, obviously correct, mirrors established
  patterns.

---

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame**
- `mt7996_mcu_parse_response()`: Unchanged since original driver
  addition by Shayne Chen (commit 98686cd21624c7, November 2022, v6.2).
- `mt7996_reset()`: Added by Bo Jiao (commit 27015b6fbcca83, April 2023,
  v6.4) as "enable full system reset support".
- The buggy code (missing recovery trigger) has been present since the
  driver was first created.
- Record: Bug present since v6.2 (mcu.c) and v6.4 (mac.c had
  mt7996_reset without STOP_DMA handling).

**Step 3.2: No Fixes: Tag**
- N/A - no Fixes: tag present (expected).

**Step 3.3: File History**
- The mt7996 reset infrastructure was significantly improved in v6.18
  (ace5d3b6b49e8 "improve hardware restart reliability"). However, the
  basic recovery mechanism has been in place since v6.4.
- The commit `beb01caa570c52` in v6.18 decreased MCU timeouts to allow
  faster recovery - this patch's logic works with either timeout value.
- Record: This commit is standalone; no prerequisites needed beyond the
  v6.4 reset infrastructure.

**Step 3.4: Author**
- Chad Monroe is a contributor to mt76 (5 commits found in the driver).
- Felix Fietkau (nbd@nbd.name) is THE mt76 subsystem maintainer - he
  applied the patch.
- Felix also authored the identical fix for mt7915 (10f73bb3938f7c5).
- Record: Applied by subsystem maintainer. Author is a regular
  contributor.

**Step 3.5: Dependencies**
- All structures/flags used already exist: `MT76_MCU_RESET`,
  `recovery.restart`, `mcu.wait`, `reset_work`, `reset_wait`,
  `MT_MCU_CMD_STOP_DMA`.
- No new functions or data structures introduced.
- Record: Fully self-contained, no dependencies on other uncommitted
  patches.

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

**Step 4.1: Original Discussion**
- lore.kernel.org returned anti-bot protection; direct web access was
  blocked.
- b4 dig of the mt7915 equivalent found it was part of a 24-patch series
  (v2) by Felix Fietkau from August 2024.
- The mt7996 version is by Chad Monroe and was ported from the mt7915
  fix.
- Record: Could not access lore directly due to anti-bot protection. b4
  confirmed the mt7915 version was part of Felix Fietkau's cleanup
  series.

**Step 4.2: Reviewer**
- Applied by Felix Fietkau, the mt76 subsystem maintainer.
- Record: Subsystem maintainer applied the patch directly.

**Step 4.3-4.5**: Blocked by lore anti-bot protection. No additional
information could be gathered.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Functions Modified**
- `mt7996_mcu_parse_response()` - MCU response parser (callback)
- `mt7996_reset()` - device reset entry point

**Step 5.2: Callers of `mt7996_mcu_parse_response`**
- Registered as `.mcu_parse_response` in `mt7996_mcu_ops` (mcu.c line
  3363).
- Called from `mt76_mcu_skb_send_and_get_msg()` in `mcu.c` (core mt76
  code, line 122).
- This is the universal MCU message response handler - called for EVERY
  MCU command the driver issues.
- Record: Called for every MCU message. Critical, high-frequency path.

**Step 5.3: Callers of `mt7996_reset`**
- Called from interrupt context and error recovery paths.
- Used by `mt7996_irq_tasklet()` when MCU command interrupts occur.
- Record: Called from interrupt handler / tasklet context.

**Step 5.4: Call Chain**
- Any WiFi operation → MCU command → `mt76_mcu_skb_send_and_get_msg()` →
  wait for response → `mt7996_mcu_parse_response()` → if timeout →
  trigger reset
- This path is reachable from normal WiFi operations (scan, associate,
  channel switch, etc.)
- Record: Fully reachable from normal user operations.

**Step 5.5: Similar Patterns**
- mt7915 has identical recovery logic (10f73bb3938f7c5 +
  b13cd593ef2402).
- mt7921/mt7925 have similar reset mechanisms.
- Record: Well-established pattern across the mt76 driver family.

---

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1: Buggy Code in Stable Trees**
- mt7996 driver added in v6.2.
- `mt7996_reset()` added in v6.4.
- `mt7996_mcu_parse_response()` unchanged since v6.2.
- The mcu.c part of the fix applies to 6.2+. The mac.c part applies to
  6.4+.
- Affected stable trees: 6.6.y, 6.12.y, and any other active LTS that
  includes mt7996.
- Record: Bug exists in 6.6.y and all later stable trees.

**Step 6.2: Backport Complications**
- The code being modified is unchanged since original introduction.
- Should apply cleanly to 6.6.y.
- Record: Expected clean apply.

**Step 6.3: No Related Fixes Already in Stable**
- No similar fix found in stable trees.
- Record: No existing fix for this issue in stable.

---

## PHASE 7: SUBSYSTEM CONTEXT

**Step 7.1: Subsystem**
- WiFi driver (drivers/net/wireless/mediatek/mt76/mt7996)
- MT7996 is MediaTek's Wi-Fi 7 (802.11be) chipset - used in routers and
  access points.
- Criticality: IMPORTANT - WiFi is critical infrastructure for many
  users.
- Record: [WiFi driver] [IMPORTANT - affects mt7996 hardware users]

**Step 7.2: Activity**
- Very actively developed - dozens of commits in recent releases.
- Active MLO/Wi-Fi 7 development ongoing.
- Record: Highly active subsystem.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Affected Users**
- All users of mt7996/mt7992 WiFi hardware (routers, access points, PCIe
  WiFi cards).
- Record: Hardware-specific, but MT7996 is a current-generation popular
  WiFi chipset.

**Step 8.2: Trigger Conditions**
- Triggers when MCU firmware hangs without issuing a watchdog interrupt.
- Can happen during normal operation (firmware bugs, hardware glitches).
- Not user-triggered in the security sense, but can happen during
  routine WiFi operation.
- Record: Firmware hang during normal operation. Not predictable but
  happens in practice (same fix was needed for mt7915).

**Step 8.3: Failure Mode**
- Without the fix: WiFi device becomes permanently non-functional until
  reboot.
- This is a system hang from the WiFi perspective.
- Severity: HIGH (permanent loss of WiFi connectivity, requires reboot)
- Record: Permanent device stuck state. Severity: HIGH.

**Step 8.4: Risk-Benefit Ratio**
- Benefit: HIGH - prevents permanent WiFi device failure, enables
  automatic recovery.
- Risk: VERY LOW - 14 lines, uses `test_and_set_bit` for safety, mirrors
  proven mt7915 pattern, no changes to public APIs or data structures.
- Record: Benefit HIGH, Risk VERY LOW. Excellent ratio.

---

## PHASE 9: FINAL SYNTHESIS

**Step 9.1: Evidence Summary**

FOR backporting:
- Fixes permanent device stuck state (WiFi becomes completely non-
  functional)
- Small and surgical fix (14 lines, 2 files in same driver)
- Mirrors identical fix already applied to sister chip mt7915 (proven
  pattern)
- Applied by mt76 subsystem maintainer (Felix Fietkau)
- No dependencies on other patches
- Buggy code present since driver introduction (v6.2/v6.4)
- `test_and_set_bit` prevents duplicate resets (safe concurrent
  behavior)
- Should apply cleanly to stable trees

AGAINST backporting:
- Brief commit message (but supplemented by identical mt7915 fix
  explanation)
- No explicit stable nomination or Reported-by (but this is expected for
  manual review candidates)

**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** - mirrors proven mt7915
   pattern, applied by maintainer
2. Fixes a real bug? **YES** - permanent device stuck state
3. Important issue? **YES** - device becomes permanently non-functional
4. Small and contained? **YES** - 14 lines in single driver
5. No new features or APIs? **YES** - only adds error recovery
6. Can apply to stable? **YES** - code unchanged since introduction

**Step 9.3: Exception Categories**
- Not an exception category - this is a standard bug fix.

**Step 9.4: Decision**
This is a clear bug fix that prevents a permanent device stuck state.
It's small, obviously correct, follows established patterns, and was
applied by the subsystem maintainer. The benefit-to-risk ratio is
excellent.

---

## Verification

- [Phase 1] Parsed tags: Link to patch.msgid, signed by author (Chad
  Monroe) and maintainer (Felix Fietkau)
- [Phase 2] Diff analysis: +9 lines in mcu.c (reset trigger on timeout),
  +5 lines in mac.c (STOP_DMA handling)
- [Phase 3] git blame: mt7996_mcu_parse_response unchanged since
  98686cd21624c7 (v6.2); mt7996_reset from 27015b6fbcca83 (v6.4)
- [Phase 3] git show 10f73bb3938f7c5: confirmed identical mt7915 fix
  exists with clear explanation of the permanent stuck state bug
- [Phase 3] git show b13cd593ef2402: confirmed identical mt7915 STOP_DMA
  handling pattern
- [Phase 3] git log --author: Chad Monroe has 5 commits in mt76, is a
  regular contributor
- [Phase 4] b4 dig: found mt7915 version in Felix Fietkau's series at
  lore
- [Phase 4] lore direct access blocked by anti-bot protection
- [Phase 5] Grep MT76_MCU_RESET: confirmed flag is used across 15+ files
  in mt76 for reset coordination
- [Phase 5] Grep mt7996_mcu_parse_response: confirmed it's registered as
  mcu_ops callback (line 3363)
- [Phase 5] Grep reset_work: confirmed reset_work/reset_wait
  infrastructure exists since v6.4
- [Phase 6] git tag --contains: mt7996_reset present since v6.4; driver
  since v6.2
- [Phase 6] Current 7.0 code confirmed: mt7996_reset lacks STOP_DMA
  handling, parse_response lacks reset trigger
- [Phase 8] Failure mode: permanent WiFi device stuck state requiring
  reboot - severity HIGH
- UNVERIFIED: Exact lore discussion content could not be accessed due to
  anti-bot protection

**YES**

 drivers/net/wireless/mediatek/mt76/mt7996/mac.c | 5 +++++
 drivers/net/wireless/mediatek/mt76/mt7996/mcu.c | 9 +++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7996/mac.c b/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
index d4f3ee943b472..b7aa51481ce82 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7996/mac.c
@@ -2737,6 +2737,11 @@ void mt7996_reset(struct mt7996_dev *dev)
 		return;
 	}
 
+	if (READ_ONCE(dev->recovery.state) & MT_MCU_CMD_STOP_DMA) {
+		set_bit(MT76_MCU_RESET, &dev->mphy.state);
+		wake_up(&dev->mt76.mcu.wait);
+	}
+
 	queue_work(dev->mt76.wq, &dev->reset_work);
 	wake_up(&dev->reset_wait);
 }
diff --git a/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c b/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
index c0c042de477b8..54776f0703876 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7996/mcu.c
@@ -209,6 +209,7 @@ static int
 mt7996_mcu_parse_response(struct mt76_dev *mdev, int cmd,
 			  struct sk_buff *skb, int seq)
 {
+	struct mt7996_dev *dev = container_of(mdev, struct mt7996_dev, mt76);
 	struct mt7996_mcu_rxd *rxd;
 	struct mt7996_mcu_uni_event *event;
 	int mcu_cmd = FIELD_GET(__MCU_CMD_FIELD_ID, cmd);
@@ -217,6 +218,14 @@ mt7996_mcu_parse_response(struct mt76_dev *mdev, int cmd,
 	if (!skb) {
 		dev_err(mdev->dev, "Message %08x (seq %d) timeout\n",
 			cmd, seq);
+
+		if (!test_and_set_bit(MT76_MCU_RESET, &dev->mphy.state)) {
+			dev->recovery.restart = true;
+			wake_up(&dev->mt76.mcu.wait);
+			queue_work(dev->mt76.wq, &dev->reset_work);
+			wake_up(&dev->reset_wait);
+		}
+
 		return -ETIMEDOUT;
 	}
 
-- 
2.53.0



  parent reply	other threads:[~2026-04-20 13:31 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.19] wifi: mt76: avoid to set ACK for MCU command if wait_resp is not set Sasha Levin
2026-04-20 13:17 ` [PATCH AUTOSEL 7.0-6.18] phy: phy-mtk-tphy: Update names and format of kernel-doc comments Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] Bluetooth: btmtk: add MT7902 MCU support Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: flush pending TX before channel switch Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.6] wifi: mt76: fix list corruption in mt76_wcid_cleanup Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: add missing lock protection in mt76_sta_state for sta_event callback Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: btmtk: improve mt79xx firmware setup retry flow Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: Disable Rx hdr_trans in monitor mode Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: Skip scan process during suspend Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] wifi: mt76: mt76x02: wake queues after reconfig Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.12] wifi: mt76: mt7925: resolve link after acquiring mt76 mutex Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix queue pause after scan due to wrong channel switch reason Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0-6.12] net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628 Sasha Levin
2026-04-20 13:20 ` Sasha Levin [this message]
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-5.10] drm/mediatek: mtk_dsi: enable hs clock during pre-enable Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.18] wifi: mt76: mt7996: fix frequency separation for station STR mode Sasha Levin
2026-04-20 13:22 ` [PATCH AUTOSEL 6.18] net: airoha: Fix memory leak in airoha_qdma_rx_process() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-251-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=angelogioacchino.delregno@collabora.com \
    --cc=chad@monroe.io \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux-wireless@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=matthias.bgg@gmail.com \
    --cc=nbd@nbd.name \
    --cc=patches@lists.linux.dev \
    --cc=ryder.lee@mediatek.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox