[PATCH AUTOSEL 7.0-5.10] net: qrtr: fix endian handling of confirm

public inbox for linux-arm-msm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 7.0-5.10] net: qrtr: fix endian handling of confirm_rx field
       [not found] <20260420131539.986432-1-sashal@kernel.org>
@ 2026-04-20 13:07 ` Sasha Levin
  2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: hci_qca: disable power control for WCN7850 when bt_en is not defined Sasha Levin
  2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.6] Bluetooth: hci_qca: Fix missing wakeup during SSR memdump handling Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-04-20 13:07 UTC (permalink / raw)
  To: patches, stable
  Cc: Alexander Wilhelm, Manivannan Sadhasivam, Simon Horman,
	David S. Miller, Sasha Levin, edumazet, kuba, pabeni,
	linux-arm-msm, netdev, linux-kernel

From: Alexander Wilhelm <alexander.wilhelm@westermo.com>

[ Upstream commit e4cf6087cab382c7031e6b436ec55202fa9f2d7b ]

Convert confirm_rx to little endian when enqueueing and convert it back on
receive. This fixes control flow on big endian hosts, little endian is
unaffected.

On transmit, store confirm_rx as __le32 using cpu_to_le32(). On receive,
apply le32_to_cpu() before using the value. !! ensures the value is 0 or 1
in native endianness, so the conversion isn’t strictly required here, but
it is kept for consistency and clarity.

Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Alexander Wilhelm <alexander.wilhelm@westermo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a comprehensive analysis.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: PARSE THE SUBJECT LINE
Record: **[net: qrtr]** **[fix]** Fix endian handling of `confirm_rx`
field — the word "fix" is used explicitly, targeting an endianness bug
in QRTR protocol.

### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>** —
  QRTR/Qualcomm ARM SoC maintainer
- **Signed-off-by: Alexander Wilhelm <alexander.wilhelm@westermo.com>**
  — Author (Westermo, embedded networking)
- **Reviewed-by: Simon Horman <horms@kernel.org>** — Netdev reviewer
- **Signed-off-by: David S. Miller <davem@davemloft.net>** — Net
  subsystem maintainer merged it

No Fixes: tag (it was deliberately removed at Simon Horman's request
during review — see Phase 4). No Reported-by tag. No Cc: stable tag.

### Step 1.3: ANALYZE THE COMMIT BODY TEXT
The commit explains that:
- On transmit, `confirm_rx` needs `cpu_to_le32()` because the header
  struct field is `__le32`
- On receive, `le32_to_cpu()` should be applied before use
- The `!!` ensures the value is 0 or 1, so the receive-side conversion
  isn't strictly required but is kept for consistency
- This "fixes control flow on big endian hosts"

Record: Bug: Missing endian conversion for the `confirm_rx` field on
both TX and RX paths. Symptom: Broken flow control on big-endian hosts.
Little-endian unaffected. Root cause: `confirm_rx` was stored/read as
native endian into a `__le32` field.

### Step 1.4: DETECT HIDDEN BUG FIXES
Not hidden — this is an explicit endianness bug fix. The subject says
"fix" directly.

---

## PHASE 2: DIFF ANALYSIS — LINE BY LINE

### Step 2.1: INVENTORY THE CHANGES
- **File**: `net/qrtr/af_qrtr.c` — 2 lines changed (1 modified in TX
  path, 1 modified in RX path)
- **Functions modified**: `qrtr_node_enqueue()` (TX),
  `qrtr_endpoint_post()` (RX)
- **Scope**: Single-file, extremely surgical fix

### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE

**Hunk 1 (line 364, TX path in `qrtr_node_enqueue`):**
- Before: `hdr->confirm_rx = !!confirm_rx;` — stores native-endian int
  into `__le32` field
- After: `hdr->confirm_rx = cpu_to_le32(!!confirm_rx);` — properly
  converts to little-endian
- On LE hosts: `cpu_to_le32` is a no-op, identical behavior
- On BE hosts: Value 1 was stored as `0x00000001` in native (big-endian)
  byte order = `0x01000000` in LE interpretation. Now correctly stored
  as LE 1.

**Hunk 2 (line 465, RX path in `qrtr_endpoint_post`):**
- Before: `cb->confirm_rx = !!v1->confirm_rx;` — reads `__le32` as
  native int
- After: `cb->confirm_rx = !!le32_to_cpu(v1->confirm_rx);` — properly
  converts from LE first
- Due to `!!`, the result on the receive side was already correct (any
  non-zero becomes 1). The fix adds the conversion for
  correctness/consistency.

### Step 2.3: IDENTIFY THE BUG MECHANISM
Category: **Endianness/type bug (f)**. The `qrtr_hdr_v1` struct declares
`confirm_rx` as `__le32`, and every other field in the struct uses
proper `cpu_to_le32()`/`le32_to_cpu()` conversions — except
`confirm_rx`. This is the one field that was missed.

### Step 2.4: ASSESS THE FIX QUALITY
- **Obviously correct**: Yes — it follows the exact same pattern as all
  adjacent fields (type, src_node_id, etc.)
- **Minimal**: Yes — 2 lines, exactly matching the existing code pattern
- **Regression risk**: Essentially zero. On LE hosts (the vast
  majority), these are no-ops. On BE hosts, this makes the behavior
  correct.

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: BLAME THE CHANGED LINES
- TX line (364): `hdr->confirm_rx = !!confirm_rx;` — introduced by
  commit **5fdeb0d372ab** ("net: qrtr: Implement outgoing flow
  control"), authored 2020-01-13, first appeared in **v5.6-rc1**
- RX line (465): `cb->confirm_rx = !!v1->confirm_rx;` — introduced by
  commit **194ccc88297ae** ("net: qrtr: Support decoding incoming v2
  packets"), authored 2017-10-10, first appeared in **v4.15**

Record: The buggy TX code has been present since v5.6. The buggy RX code
since v4.15. Both are in all active stable trees (5.10, 5.15, 6.1, 6.6,
6.12, 7.0).

### Step 3.2: FOLLOW THE FIXES: TAG
The v2 submission HAD `Fixes: 5fdeb0d372ab` but it was removed at Simon
Horman's request. The original buggy commit 5fdeb0d372ab ("Implement
outgoing flow control") is present in v5.6+ and all active stable trees.

### Step 3.3: CHECK FILE HISTORY
Recent changes to `af_qrtr.c` are unrelated refactoring (xarray
conversion, treewide changes, proto_ops changes). No recent endianness
fixes.

### Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS
Alexander Wilhelm from Westermo has a clear pattern of fixing endianness
bugs in Qualcomm subsystems: QMI encoding/decoding, MHI BHI vector
table, ath12k QMI data. This is part of an effort to make Qualcomm
subsystems work on big-endian platforms.

### Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS
None. The fix applies directly to the original buggy lines without any
prerequisites.

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

### Step 4.1: FIND THE ORIGINAL PATCH DISCUSSION
Found via yhbt.net mirror of lore.kernel.org. The patch went through 3
versions:
- **v1** (2026-03-20): Initial submission with Fixes tag, targeted at
  `net`
- **v2** (2026-03-24): Rebase on latest net tree, improved commit
  message, still had Fixes tag
- **v3** (2026-03-26): Rebase on `net-next`, Fixes tag removed at Simon
  Horman's request

### Step 4.2: KEY REVIEWER FEEDBACK
**Simon Horman** (netdev reviewer): "But as this isn't strictly
necessary let's target net-next and drop the Fixes tag." This is a
**negative signal** for stable backport — the netdev reviewer explicitly
downgraded from fix to enhancement.

**Manivannan Sadhasivam** (QRTR maintainer) disagreed: "FWIW: Adding
Fixes tag doesn't mean that the patch should be queued for -rcS." Mani
thought the Fixes tag was appropriate.

### Step 4.3: BUG REPORT
No external bug report. The author found this during systematic
endianness auditing.

### Step 4.4: RELATED PATCHES
This is a standalone fix. Not part of a series.

### Step 4.5: STABLE MAILING LIST HISTORY
No stable-specific discussion found.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1: IDENTIFY KEY FUNCTIONS
- `qrtr_node_enqueue()` — TX path
- `qrtr_endpoint_post()` — RX path

### Step 5.2: TRACE CALLERS
- `qrtr_node_enqueue()` is called from: `qrtr_sendmsg()` (the main
  sendmsg path), `qrtr_send_resume_tx()`, and broadcast path. It's the
  core TX function.
- `qrtr_endpoint_post()` is called from: MHI driver (`qrtr_mhi.c`), SMD
  driver (`qrtr_smd.c`), tun driver (`qrtr_tun.c`). It's the core RX
  entry point — called for EVERY incoming QRTR packet.

### Step 5.3-5.4: CALL CHAIN
`qrtr_endpoint_post()` is called directly from hardware transport
drivers on every received packet. `qrtr_node_enqueue()` is called on
every transmitted packet. Both are hot-path functions.

### Step 5.5: SIMILAR PATTERNS
All other fields in `qrtr_hdr_v1` already use proper endian conversions.
`confirm_rx` was the only one missed.

---

## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

### Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE TREES?
The TX bug (5fdeb0d372ab) exists in **v5.6+**, so all active stable
trees: 5.10.y, 5.15.y, 6.1.y, 6.6.y, 6.12.y.
The RX bug (194ccc88297ae) exists since **v4.15**.

### Step 6.2: BACKPORT COMPLICATIONS
The code at these two lines has not changed since introduction. The
patch should apply cleanly to all active stable trees.

### Step 6.3: RELATED FIXES ALREADY IN STABLE
None found.

---

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

### Step 7.1: SUBSYSTEM CRITICALITY
**net/qrtr** — Qualcomm IPC Router, used for communication between Linux
and Qualcomm firmware (modem, WiFi, etc.).
Criticality: **PERIPHERAL** — affects users of Qualcomm SoC platforms
running big-endian kernels (very niche). Qualcomm SoCs are little-endian
ARM, so the primary users are unaffected.

### Step 7.2: SUBSYSTEM ACTIVITY
Moderate activity — mostly maintenance fixes, not heavy development.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: WHO IS AFFECTED
Only big-endian hosts that use QRTR. This is extremely niche — Qualcomm
SoCs are LE ARM. However, Westermo (author's company) apparently runs BE
systems with QRTR, and there could be other embedded platforms.

### Step 8.2: TRIGGER CONDITIONS
Every QRTR data transmission on a big-endian host. The TX side stores
the wrong endianness, which means the remote end receives a malformed
`confirm_rx` value. The RX side is actually mitigated by `!!` (any non-
zero normalizes to 1).

### Step 8.3: FAILURE MODE SEVERITY
On big-endian hosts: The flow control mechanism (confirm_rx/resume_tx)
breaks. The TX side sends `confirm_rx` in wrong byte order. If the
remote firmware compares `confirm_rx` directly to 1 (rather than using
`!!`), it won't send RESUME_TX messages, causing the sender to
eventually block or exhaust remote resources.
Severity: **MEDIUM** — broken flow control on a niche platform.

### Step 8.4: RISK-BENEFIT RATIO
- **Benefit**: Fixes real protocol-level correctness bug on BE hosts.
  Low impact population but real for those affected.
- **Risk**: Essentially zero. `cpu_to_le32`/`le32_to_cpu` are no-ops on
  LE. The fix follows the established pattern used by every other field
  in the same struct.

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: COMPILE THE EVIDENCE

**FOR backporting:**
- Real endianness correctness bug
- Extremely small fix (2 lines), obviously correct
- Zero regression risk on LE systems (no-op), fixes BE systems
- Follows the identical pattern of all adjacent header fields
- Present in all stable trees since v5.6
- Reviewed by 2 reviewers, merged by David Miller
- Manivannan Sadhasivam (QRTR maintainer) considered the Fixes tag
  appropriate

**AGAINST backporting:**
- Simon Horman explicitly said "not strictly necessary," asked to drop
  Fixes tag and target net-next
- Big-endian QRTR usage is extremely niche
- The `!!` on the RX path mitigates the receive side (only TX is truly
  broken)
- No user bug report — found during code audit

### Step 9.2: STABLE RULES CHECKLIST
1. Obviously correct and tested? **Yes** — standard endian conversion
   pattern
2. Fixes a real bug? **Yes** — broken wire protocol on BE hosts
3. Important issue? **Borderline** — broken flow control, but only on BE
   hosts using QRTR
4. Small and contained? **Yes** — 2 lines in 1 file
5. No new features? **Correct**
6. Can apply to stable? **Yes** — clean apply expected

### Step 9.3: EXCEPTION CATEGORIES
None applicable.

### Step 9.4: DECISION
This is a genuine endianness bug fix with zero regression risk. While
the affected population is niche, the fix is so small and obviously
correct that the risk/benefit ratio strongly favors inclusion. The
netdev reviewer's pushback was about targeting net vs. net-next (merge
window process), not about whether the fix is correct. The QRTR
maintainer (Mani) disagreed with dropping the Fixes tag. For stable
users on big-endian platforms running QRTR, this fixes a real protocol-
level correctness issue in the flow control mechanism.

---

## Verification

- [Phase 1] Parsed tags: Reviewed-by from Manivannan Sadhasivam and
  Simon Horman, SOB from David S. Miller (merger)
- [Phase 2] Diff analysis: 2 lines changed — adding `cpu_to_le32()` on
  TX, `le32_to_cpu()` on RX, matching pattern of all other fields
- [Phase 2] Verified `qrtr_hdr_v1.confirm_rx` is declared as `__le32`
  (line 44 of af_qrtr.c)
- [Phase 3] git blame: TX line introduced by 5fdeb0d372ab (v5.6,
  2020-01-13), RX line by 194ccc88297ae (v4.15, 2017-10-10)
- [Phase 3] git merge-base: Confirmed 5fdeb0d372ab is in v5.10, v5.15,
  v6.1, v6.6 (all active stable trees)
- [Phase 3] Author check: Alexander Wilhelm has 7 commits all fixing
  Qualcomm endianness bugs
- [Phase 4] Mailing list (yhbt.net mirror): Found full v2 thread. Simon
  Horman said "not strictly necessary," Mani disagreed
- [Phase 4] Patch went v1->v2->v3; v3 dropped Fixes tag, targeted net-
  next at reviewer request
- [Phase 5] Callers verified: `qrtr_node_enqueue` is core TX path,
  `qrtr_endpoint_post` is core RX entry point (EXPORT_SYMBOL_GPL)
- [Phase 5] Verified all other `qrtr_hdr_v1` fields use proper endian
  conversions — only `confirm_rx` was missed
- [Phase 6] Code is unchanged at buggy lines since introduction — clean
  apply expected
- [Phase 8] Risk assessment: zero risk on LE (no-op conversions), fixes
  correctness on BE

**YES**

 net/qrtr/af_qrtr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/qrtr/af_qrtr.c b/net/qrtr/af_qrtr.c
index d77e9c8212da5..7cec6a7859b03 100644
--- a/net/qrtr/af_qrtr.c
+++ b/net/qrtr/af_qrtr.c
@@ -361,7 +361,7 @@ static int qrtr_node_enqueue(struct qrtr_node *node, struct sk_buff *skb,
 	}

 	hdr->size = cpu_to_le32(len);
-	hdr->confirm_rx = !!confirm_rx;
+	hdr->confirm_rx = cpu_to_le32(!!confirm_rx);

 	rc = skb_put_padto(skb, ALIGN(len, 4) + sizeof(*hdr));

@@ -462,7 +462,7 @@ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len)
 		cb->type = le32_to_cpu(v1->type);
 		cb->src_node = le32_to_cpu(v1->src_node_id);
 		cb->src_port = le32_to_cpu(v1->src_port_id);
-		cb->confirm_rx = !!v1->confirm_rx;
+		cb->confirm_rx = !!le32_to_cpu(v1->confirm_rx);
 		cb->dst_node = le32_to_cpu(v1->dst_node_id);
 		cb->dst_port = le32_to_cpu(v1->dst_port_id);

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 7.0-6.1] Bluetooth: hci_qca: disable power control for WCN7850 when bt_en is not defined
       [not found] <20260420131539.986432-1-sashal@kernel.org>
  2026-04-20 13:07 ` [PATCH AUTOSEL 7.0-5.10] net: qrtr: fix endian handling of confirm_rx field Sasha Levin
@ 2026-04-20 13:08 ` Sasha Levin
  2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.6] Bluetooth: hci_qca: Fix missing wakeup during SSR memdump handling Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-04-20 13:08 UTC (permalink / raw)
  To: patches, stable
  Cc: Shuai Zhang, Bartosz Golaszewski, Luiz Augusto von Dentz,
	Sasha Levin, brgl, marcel, luiz.dentz, linux-arm-msm,
	linux-bluetooth, linux-kernel

From: Shuai Zhang <shuai.zhang@oss.qualcomm.com>

[ Upstream commit 7b75867803a8712bdf7683c31d71d3d5e28ce821 ]

On platforms using an M.2 slot with both UART and USB support, bt_en is
pulled high by hardware. In this case, software-based power control
should be disabled. The current platforms are Lemans-EVK and Monaco-EVK.

Add QCA_WCN7850 to the existing condition so that power_ctrl_enabled is
cleared when bt_en is not software-controlled (or absent), aligning its
behavior with WCN6750 and WCN6855

Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile the results.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1:** [Subsystem: Bluetooth/hci_qca] [Action: "disable" / "add"]
[Summary: Disable software power control for WCN7850 when bt_en GPIO is
not defined (HW-managed)]

**Step 1.2:** Tags found:
- `Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com>` - Author,
  Qualcomm BT developer
- `Reviewed-by: Bartosz Golaszewski
  <bartosz.golaszewski@oss.qualcomm.com>` - This is the author of the
  prerequisite commit `0fb410c914eb03` that introduced the
  WCN6750/WCN6855-only check. His review is highly significant.
- `Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>` -
  BT subsystem maintainer applied it
- No Fixes: tag, no Cc: stable, no Reported-by (expected for candidates)

**Step 1.3:** The commit body explains: On Lemans-EVK and Monaco-EVK
platforms (M.2 slot with UART+USB), bt_en is pulled high by hardware.
Software power control must be disabled. Without this,
`power_ctrl_enabled` remains true for WCN7850, causing
`HCI_QUIRK_NON_PERSISTENT_SETUP` and `qca_power_off` shutdown handler to
be set incorrectly.

**Step 1.4:** This IS a bug fix disguised as "aligning behavior." The
commit adds WCN7850 to a condition that was already handling the same
scenario for WCN6750/WCN6855, making WCN7850 broken on affected
platforms.

## PHASE 2: DIFF ANALYSIS

**Step 2.1:** Single file changed: `drivers/bluetooth/hci_qca.c`, +2/-1
lines. Function modified: `qca_serdev_probe()`. Scope: single-file,
single-hunk surgical fix.

**Step 2.2:** Before: When `bt_en` is NULL, only WCN6750 and WCN6855 got
`power_ctrl_enabled=false`. After: WCN7850 also gets
`power_ctrl_enabled=false`. This affects the probe path where the power
control strategy is decided.

**Step 2.3:** Bug category: Logic/correctness fix - missing SoC type in
a condition. When `power_ctrl_enabled` remains incorrectly true:
- `HCI_QUIRK_NON_PERSISTENT_SETUP` is set (line 2532)
- `hdev->shutdown = qca_power_off` is set (line 2533)
- The SSR recovery in `fce1a9244a0f8` checks this quirk and takes the
  wrong path

**Step 2.4:** Fix is obviously correct - follows established pattern.
Zero regression risk (only adds a SoC type to an OR chain). Reviewed by
the author of the prerequisite code.

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1:** `git blame` shows the condition was introduced by
`0fb410c914eb03` (Bartosz Golaszewski, 2025-05-27), which restructured
the code to restrict the `power_ctrl_enabled=false` check to
WCN6750/WCN6855 only. WCN7850 was inadvertently omitted.

**Step 3.2:** No Fixes: tag. The root cause is `0fb410c914eb03` which
has its own `Fixes: 3d05fc82237a` and `Cc: stable@vger.kernel.org`.
WCN7850 was missed when `0fb410c914eb03` restricted the condition.

**Step 3.3:** Related commits by same author: `fce1a9244a0f8` "Fix SSR
fail when BT_EN is pulled up by hw" - this is the companion fix that
depends on `HCI_QUIRK_NON_PERSISTENT_SETUP` being correctly set. This
commit is standalone but paired with `0fb410c914eb03`.

**Step 3.4:** Shuai Zhang is a regular Qualcomm BT contributor. Bartosz
Golaszewski (reviewer) wrote the prerequisite code.

**Step 3.5:** This commit depends on `0fb410c914eb03` being present.
That commit is in mainline (first tagged in v6.16/v6.17) but NOT yet in
any stable tree (not in v6.6, v6.12, or v6.14). In stable trees without
`0fb410c914eb03`, the code has an unconditional check (`if
(!qcadev->bt_en) power_ctrl_enabled = false;`) that covers ALL SoC types
including WCN7850. The bug only manifests after `0fb410c914eb03` is
applied.

## PHASE 4: MAILING LIST RESEARCH

The patch went through v1 -> v2 -> v3. v1 had review feedback from
Dmitry Baryshkov requesting more context about affected platforms. v2/v3
added platform details (Lemans-EVK, Monaco-EVK). Bartosz Golaszewski
(who wrote the prerequisite commit) gave Reviewed-by on v3. Luiz von
Dentz (BT maintainer) applied it to bluetooth-next. No NAKs, no concerns
about the code change itself.

## PHASE 5: CODE SEMANTIC ANALYSIS

`power_ctrl_enabled` controls two behaviors in `qca_serdev_probe()`:
1. Setting `HCI_QUIRK_NON_PERSISTENT_SETUP` quirk
2. Registering `qca_power_off` as shutdown handler

When `power_ctrl_enabled` is incorrectly true for WCN7850 with HW-
managed bt_en:
- `qca_power_off` -> `qca_power_shutdown()` falls to default case:
  `gpiod_set_value_cansleep(NULL, 0)` which is a no-op (safe)
- But the quirk `HCI_QUIRK_NON_PERSISTENT_SETUP` being set causes the
  SSR recovery code (`fce1a9244a0f8`) to skip critical recovery steps,
  leading to SSR failure (HCI reset timeout)

## PHASE 6: STABLE TREE ANALYSIS

- **v6.6**: WCN7850 exists (12 references). Code structure is completely
  different (`IS_ERR_OR_NULL` pattern). Bug exists differently but this
  patch wouldn't apply without significant rework.
- **v6.12/v6.14**: `3d05fc82237aa9` is present but `0fb410c914eb03` is
  NOT. The check is unconditional (`if (!qcadev->bt_en)
  power_ctrl_enabled = false;`), so the bug does NOT exist yet. However,
  when `0fb410c914eb03` (tagged `Cc: stable`) is backported, it WILL
  introduce this bug by restricting the check to WCN6750/WCN6855 only.
- This patch must be paired with `0fb410c914eb03` when backporting.

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

- Subsystem: Bluetooth driver (IMPORTANT, affects Qualcomm BT hardware
  users)
- Criticality: Driver-specific, but WCN7850 is a widely-used Qualcomm BT
  chip (SM8550 platforms and others)
- Active subsystem with regular contributions

## PHASE 8: IMPACT AND RISK ASSESSMENT

- **Who is affected**: WCN7850 users on platforms where bt_en is HW-
  controlled (Lemans-EVK, Monaco-EVK with M.2 slot)
- **Trigger**: Always, during probe on affected hardware. Not timing-
  dependent.
- **Failure mode**: SSR failure - BT controller cannot recover from
  firmware crash. HCI reset times out. MEDIUM-HIGH severity (Bluetooth
  becomes non-functional after FW crash until reboot)
- **Benefit**: High for affected hardware users
- **Risk**: Very low - 1 line addition to an OR condition, obviously
  correct pattern

## PHASE 9: FINAL SYNTHESIS

**Evidence FOR:**
- Fixes real hardware issue on WCN7850 platforms with HW-managed bt_en
- Trivially small (1 line), obviously correct
- Reviewed by the author of the prerequisite code
- Applied by BT subsystem maintainer
- Follows established pattern (WCN6750/WCN6855 already handled)
- Without this fix, SSR recovery fails on affected platforms
- Falls under "hardware quirk/workaround" exception category

**Evidence AGAINST:**
- Depends on `0fb410c914eb03` (not yet in stable trees)
- Limited platform scope (Lemans-EVK, Monaco-EVK)
- In current stable trees, the bug doesn't exist yet (unconditional
  check)

**Stable rules checklist:**
1. Obviously correct and tested? YES (trivial 1-line addition, reviewed)
2. Fixes a real bug? YES (SSR failure on affected hardware)
3. Important issue? YES (BT becomes non-functional after FW crash)
4. Small and contained? YES (1 line change)
5. No new features? YES (just extends existing condition)
6. Can apply to stable? YES, when paired with `0fb410c914eb03`

## Verification

- [Phase 1] Parsed tags: Reviewed-by from Bartosz Golaszewski
  (prerequisite commit author), applied by BT maintainer Luiz von Dentz
- [Phase 2] Diff: 1 line added to OR condition in `qca_serdev_probe()`,
  adding `QCA_WCN7850`
- [Phase 3] `git blame`: condition introduced by `0fb410c914eb03`
  (2025-05-27), which restricted check to WCN6750/WCN6855
- [Phase 3] `0fb410c914eb03` has `Cc: stable`, `Fixes: 3d05fc82237a` -
  will be backported
- [Phase 3] `0fb410c914eb03` NOT in v6.6, v6.12, v6.14 (verified with
  `git merge-base`)
- [Phase 3] WCN7850 support (`e0c1278ac89b0`) is in v6.6 (2023-08-24)
- [Phase 4] Patchew: v3 reviewed by Bartosz Golaszewski, applied by Luiz
  von Dentz. No NAKs.
- [Phase 5] `power_ctrl_enabled=true` causes incorrect
  `HCI_QUIRK_NON_PERSISTENT_SETUP` and SSR recovery failure
- [Phase 5] Verified `qca_power_shutdown()` default case:
  `gpiod_set_value_cansleep(NULL, 0)` is safe (no-op)
- [Phase 6] v6.12/v6.14: unconditional check exists, bug not present.
  Bug will appear when `0fb410c914eb03` is backported
- [Phase 6] v6.6: different code structure, would need rework
- [Phase 8] Failure mode: SSR failure (HCI reset timeout, BT non-
  functional until reboot), MEDIUM-HIGH severity

**YES**

 drivers/bluetooth/hci_qca.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index bb9f002aa85e9..edc907c4e870a 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -2471,7 +2471,8 @@ static int qca_serdev_probe(struct serdev_device *serdev)

 		if (!qcadev->bt_en &&
 		    (data->soc_type == QCA_WCN6750 ||
-		     data->soc_type == QCA_WCN6855))
+		     data->soc_type == QCA_WCN6855 ||
+		     data->soc_type == QCA_WCN7850))
 			power_ctrl_enabled = false;

 		qcadev->sw_ctrl = devm_gpiod_get_optional(&serdev->dev, "swctrl",
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 7.0-6.6] Bluetooth: hci_qca: Fix missing wakeup during SSR memdump handling
       [not found] <20260420131539.986432-1-sashal@kernel.org>
  2026-04-20 13:07 ` [PATCH AUTOSEL 7.0-5.10] net: qrtr: fix endian handling of confirm_rx field Sasha Levin
  2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: hci_qca: disable power control for WCN7850 when bt_en is not defined Sasha Levin
@ 2026-04-20 13:08 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-04-20 13:08 UTC (permalink / raw)
  To: patches, stable
  Cc: Shuai Zhang, Bartosz Golaszewski, Paul Menzel,
	Luiz Augusto von Dentz, Sasha Levin, brgl, marcel, luiz.dentz,
	linux-arm-msm, linux-bluetooth, linux-kernel

From: Shuai Zhang <shuai.zhang@oss.qualcomm.com>

[ Upstream commit c347ca17d62a32c25564fee0ca3a2a7bc2d5fd6f ]

When a Bluetooth controller encounters a coredump, it triggers the
Subsystem Restart (SSR) mechanism. The controller first reports the
coredump data and, once the upload is complete, sends a hw_error
event. The host relies on this event to proceed with subsequent
recovery actions.

If the host has not finished processing the coredump data when the
hw_error event is received, it waits until either the processing is
complete or the 8-second timeout expires before handling the event.

The current implementation clears QCA_MEMDUMP_COLLECTION using
clear_bit(), which does not wake up waiters sleeping in
wait_on_bit_timeout(). As a result, the waiting thread may remain
blocked until the timeout expires even if the coredump collection
has already completed.

Fix this by clearing QCA_MEMDUMP_COLLECTION with
clear_and_wake_up_bit(), which also wakes up the waiting thread and
allows the hw_error handling to proceed immediately.

Test case:
- Trigger a controller coredump using:
    hcitool cmd 0x3f 0c 26
- Tested on QCA6390.
- Capture HCI logs using btmon.
- Verify that the delay between receiving the hw_error event and
  initiating the power-off sequence is reduced compared to the
  timeout-based behavior.

Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile it.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: PARSE THE SUBJECT LINE
- **Subsystem prefix:** Bluetooth: hci_qca
- **Action verb:** "Fix" - explicitly a bug fix
- **One-line summary:** Fix missing wakeup during SSR memdump handling -
  `clear_bit()` doesn't wake up waiters sleeping in
  `wait_on_bit_timeout()`.

### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Reviewed-by:** Bartosz Golaszewski (Qualcomm contributor,
  knowledgeable in this driver)
- **Reviewed-by:** Paul Menzel (known active reviewer)
- **Signed-off-by:** Shuai Zhang <shuai.zhang@oss.qualcomm.com> (author,
  Qualcomm - QCA chipset vendor)
- **Signed-off-by:** Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
  (Bluetooth maintainer)
- No Fixes: tag, no Reported-by, no syzbot. Absence of Fixes: is
  expected.

### Step 1.3: ANALYZE THE COMMIT BODY TEXT
- **Bug:** When Bluetooth controller encounters a coredump (SSR), it
  sends memdump data then sends `hw_error` event. The host calls
  `wait_on_bit_timeout()` on `QCA_MEMDUMP_COLLECTION` to wait for the
  collection to complete. But the collection worker clears the bit with
  `clear_bit()`, which does NOT wake up the waiter.
- **Symptom:** The waiting thread blocks for the full 8-second timeout
  (`MEMDUMP_TIMEOUT_MS = 8000`) even when collection finishes early.
- **Root cause:** API misuse - `wait_on_bit_timeout()` documentation
  explicitly requires wakeup via `wake_up_bit()` or
  `clear_and_wake_up_bit()`.
- **Test:** Tested on QCA6390 hardware using `hcitool` and btmon.

### Step 1.4: DETECT HIDDEN BUG FIXES
This is an explicitly stated bug fix, not disguised. The
`wait_on_bit_timeout` API documentation (in `include/linux/wait_bit.h`)
states: "The clearing of the bit must be signalled with wake_up_bit(),
often as clear_and_wake_up_bit()." Using plain `clear_bit()` is an API
violation.

---

## PHASE 2: DIFF ANALYSIS - LINE BY LINE

### Step 2.1: INVENTORY THE CHANGES
- **File:** `drivers/bluetooth/hci_qca.c` only
- **Changes:** 2 lines changed (2 `clear_bit` → `clear_and_wake_up_bit`)
- **Functions modified:** `qca_controller_memdump()` (2 locations)
- **Scope:** Single-file, single-function surgical fix

### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE
**Hunk 1 (line 1108):** Error path when `hci_devcd_init()` fails:
- Before: `clear_bit(QCA_MEMDUMP_COLLECTION, &qca->flags)` — clears bit
  but no wakeup
- After: `clear_and_wake_up_bit(QCA_MEMDUMP_COLLECTION, &qca->flags)` —
  clears bit AND wakes waiting thread

**Hunk 2 (line 1186):** Normal completion path (last sequence received):
- Before: same `clear_bit()` without wakeup
- After: same `clear_and_wake_up_bit()` with wakeup

### Step 2.3: IDENTIFY THE BUG MECHANISM
This is a **synchronization bug**: missing wakeup. The
`qca_wait_for_dump_collection()` function calls `wait_on_bit_timeout()`
which puts the thread to sleep waiting for the bit to be cleared AND a
wakeup signal. Without the wakeup, the thread sleeps for the full
8-second timeout.

### Step 2.4: ASSESS THE FIX QUALITY
- **Obviously correct:** Yes. The `wait_on_bit_timeout` documentation
  explicitly states wakeup is required.
- **Minimal:** Yes, 2 line changes.
- **Regression risk:** Negligible. `clear_and_wake_up_bit()` does
  exactly what `clear_bit()` does plus a wakeup. No new side effects.

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: BLAME THE CHANGED LINES
- Line 1108 (`clear_bit`): Introduced by `06d3fdfcdf5cef` (Sai Teja
  Aluvala, 2023-06-14) — v6.6-rc1
- Line 1186 (`clear_bit`): Introduced by `7c2c3e63e1e97c` (Venkata
  Lakshmi, 2020-02-14) — v5.7-rc1
- `wait_on_bit_timeout` (line 1606): Introduced by `d841502c79e3fd`
  (Balakrishna Godavarthi, 2020-01-02) — v5.6-rc1

So the bug at line 1186 has existed since v5.7, and the bug at line 1108
since v6.6.

### Step 3.2: FOLLOW THE FIXES: TAG
No Fixes: tag (expected).

### Step 3.3: CHECK FILE HISTORY
Recent changes to `hci_qca.c` are active (73 commits since v5.15). The
file sees regular activity.

### Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS
Shuai Zhang is a Qualcomm contributor with multiple commits to the QCA
Bluetooth stack. The fix was reviewed by the Bluetooth maintainer (Luiz
Augusto von Dentz).

### Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS
None. `clear_and_wake_up_bit()` has existed since v4.17. The fix is a
drop-in replacement for `clear_bit()` at two locations.

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

### Step 4.1-4.2: FIND THE ORIGINAL PATCH DISCUSSION
Found at:
https://yhbt.net/lore/lkml/177583080679.2077665.8641347877052929776.git-
patchwork-notify@kernel.org/T/

The patch went through **7 revisions** (v1 through v7), indicating
extensive review:
- v5→v6: Changed from `wake_up_bit` to `clear_and_wake_up_bit` (the
  proper API)
- Applied to bluetooth-next by Luiz Augusto von Dentz (Bluetooth
  maintainer)
- Commit in bluetooth-next: `9f07d5d04826`

### Step 4.3: BUG REPORT
No external bug report — the author identified the issue through
code/testing.

### Step 4.4-4.5: RELATED PATCHES AND STABLE HISTORY
This is a standalone single-patch fix. No series dependencies.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.4: KEY FUNCTIONS AND CALL CHAINS
The affected path:
1. Bluetooth controller crashes → sends memdump data → sends `hw_error`
   event
2. `qca_hw_error()` or `qca_reset()` → calls
   `qca_wait_for_dump_collection()` → `wait_on_bit_timeout()` on
   `QCA_MEMDUMP_COLLECTION`
3. Concurrently, `qca_controller_memdump()` (workqueue) processes dump
   packets
4. On completion, `qca_controller_memdump()` clears
   `QCA_MEMDUMP_COLLECTION` — but without waking up the waiter in step 2
5. Result: waiter in step 2 sleeps for full 8 seconds even though
   collection finished

Both `qca_hw_error()` and `qca_reset()` call
`qca_wait_for_dump_collection()`, so both paths are affected.

### Step 5.5: SIMILAR PATTERNS
No other `clear_bit`/`wait_on_bit_timeout` mismatches found in this
file.

---

## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

### Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE TREES?
- The `clear_bit` at the completion path (line 1186) has been present
  since v5.7, so it exists in stable trees 5.10.y, 5.15.y, 6.1.y, 6.6.y,
  6.12.y.
- The `clear_bit` at the error path (line 1108) was introduced in v6.6,
  so only in 6.6.y, 6.12.y.

### Step 6.2: BACKPORT COMPLICATIONS
The patch should apply cleanly or with minor context adjustments. The
two lines being changed are simple API call replacements. Older trees
may not have the first hunk (line 1108) since that code was added in
v6.6.

### Step 6.3: RELATED FIXES ALREADY IN STABLE
No related fixes for this specific bug found.

---

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

### Step 7.1: SUBSYSTEM CRITICALITY
- **Subsystem:** drivers/bluetooth — Bluetooth driver for Qualcomm
  chipsets
- **Criticality:** IMPORTANT — QCA Bluetooth chipsets are widely used in
  laptops, phones, and embedded systems

### Step 7.2: SUBSYSTEM ACTIVITY
Active subsystem with regular commits.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: AFFECTED POPULATION
Users of QCA Bluetooth chipsets (QCA6390 and similar) — a significant
population in the Android and laptop ecosystem.

### Step 8.2: TRIGGER CONDITIONS
Triggered when the Bluetooth controller crashes and SSR begins. Not
common in normal operation, but when it happens (coredump, hw error),
the 8-second unnecessary delay is always present.

### Step 8.3: FAILURE MODE SEVERITY
- **Failure mode:** Unnecessary 8-second delay during Bluetooth recovery
  after controller crash
- **Severity:** MEDIUM — Not a crash, not data corruption, not a
  security issue. It's a latency bug during error recovery that affects
  usability.

### Step 8.4: RISK-BENEFIT RATIO
- **Benefit:** Eliminates unnecessary 8-second delay during SSR
  recovery. Correct API usage per documentation.
- **Risk:** Very low — 2-line change, `clear_and_wake_up_bit` is a
  strict superset of `clear_bit` (clear + wakeup). No new side effects.
- **Ratio:** Very favorable — minimal risk for a real improvement in
  error recovery behavior.

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: COMPILE THE EVIDENCE

**FOR backporting:**
- Fixes a genuine API misuse bug (`clear_bit` instead of
  `clear_and_wake_up_bit`)
- 2-line surgical fix, trivially correct per API documentation
- Extensively reviewed (7 revisions), 2 Reviewed-by tags
- Applied by Bluetooth maintainer
- Tested on real hardware (QCA6390)
- Very low regression risk
- Bug exists since v5.7 for one path and v6.6 for the other
- `clear_and_wake_up_bit()` exists since v4.17, no dependency issues

**AGAINST backporting:**
- The bug doesn't cause crashes, data corruption, or security issues
- Only manifests during SSR (error recovery), not during normal
  operation
- It's a latency/usability improvement rather than a fix for a hard
  failure

### Step 9.2: STABLE RULES CHECKLIST
1. **Obviously correct and tested?** YES — API documentation is
   explicit, tested on hardware
2. **Fixes a real bug?** YES — missing wakeup causing unnecessary
   8-second delay
3. **Important issue?** MEDIUM — recovery delay, not crash/corruption
4. **Small and contained?** YES — 2 lines in 1 file
5. **No new features or APIs?** YES — just corrects API usage
6. **Can apply to stable trees?** YES — `clear_and_wake_up_bit` exists
   since v4.17

### Step 9.3: EXCEPTION CATEGORIES
Not applicable — this is a standard bug fix.

### Step 9.4: DECISION
The fix is tiny, obviously correct, well-reviewed, and addresses a real
bug where `wait_on_bit_timeout()` is used with `clear_bit()` instead of
`clear_and_wake_up_bit()`, causing an unnecessary 8-second blocking
delay during Bluetooth error recovery. While the impact is "only" a
latency issue during recovery (not a crash or data loss), the extremely
low risk of the fix and the real user-visible improvement make this
appropriate for stable.

---

## Verification

- [Phase 1] Parsed tags: Reviewed-by from Bartosz Golaszewski and Paul
  Menzel; SOB from author and Bluetooth maintainer
- [Phase 2] Diff analysis: 2 lines changed in
  `qca_controller_memdump()`, replacing `clear_bit()` with
  `clear_and_wake_up_bit()`
- [Phase 3] git blame: Line 1108 introduced in `06d3fdfcdf5cef`
  (v6.6-rc1); Line 1186 introduced in `7c2c3e63e1e97c` (v5.7-rc1);
  `wait_on_bit_timeout` introduced in `d841502c79e3fd` (v5.6-rc1)
- [Phase 3] git describe: `clear_and_wake_up_bit` introduced in
  `8236b0ae31c83` (v4.17-rc4), present in all active stable trees
- [Phase 4] lore thread found: patch went through v1→v7, applied to
  bluetooth-next by maintainer as `9f07d5d04826`
- [Phase 4] No NAKs or objections in the discussion thread
- [Phase 5] Call chain: `qca_hw_error()`/`qca_reset()` →
  `qca_wait_for_dump_collection()` → `wait_on_bit_timeout()` waits for
  bit cleared by `qca_controller_memdump()` workqueue
- [Phase 5] Verified `wait_on_bit_timeout()` documentation in
  `include/linux/wait_bit.h` lines 118-120 explicitly requires
  `clear_and_wake_up_bit()`
- [Phase 6] Buggy code exists in stable trees 5.10+, 5.15+, 6.1+, 6.6+,
  6.12+ (second hunk); 6.6+, 6.12+ (first hunk)
- [Phase 6] `MEMDUMP_TIMEOUT_MS` is 8000 (8 seconds) — confirmed at line
  54
- [Phase 8] Failure mode: 8-second unnecessary delay during Bluetooth
  SSR recovery, severity MEDIUM

**YES**

 drivers/bluetooth/hci_qca.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index edc907c4e870a..524e47392f919 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -1105,7 +1105,7 @@ static void qca_controller_memdump(struct work_struct *work)
 				qca->qca_memdump = NULL;
 				qca->memdump_state = QCA_MEMDUMP_COLLECTED;
 				cancel_delayed_work(&qca->ctrl_memdump_timeout);
-				clear_bit(QCA_MEMDUMP_COLLECTION, &qca->flags);
+				clear_and_wake_up_bit(QCA_MEMDUMP_COLLECTION, &qca->flags);
 				clear_bit(QCA_IBS_DISABLED, &qca->flags);
 				mutex_unlock(&qca->hci_memdump_lock);
 				return;
@@ -1183,7 +1183,7 @@ static void qca_controller_memdump(struct work_struct *work)
 			kfree(qca->qca_memdump);
 			qca->qca_memdump = NULL;
 			qca->memdump_state = QCA_MEMDUMP_COLLECTED;
-			clear_bit(QCA_MEMDUMP_COLLECTION, &qca->flags);
+			clear_and_wake_up_bit(QCA_MEMDUMP_COLLECTION, &qca->flags);
 		}

 		mutex_unlock(&qca->hci_memdump_lock);
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-20 13:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260420131539.986432-1-sashal@kernel.org>
2026-04-20 13:07 ` [PATCH AUTOSEL 7.0-5.10] net: qrtr: fix endian handling of confirm_rx field Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.1] Bluetooth: hci_qca: disable power control for WCN7850 when bt_en is not defined Sasha Levin
2026-04-20 13:08 ` [PATCH AUTOSEL 7.0-6.6] Bluetooth: hci_qca: Fix missing wakeup during SSR memdump handling Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox