[PATCH AUTOSEL 7.0-6.19] spi: tegra210-quad: Fix false positive WARN on interrupt timeout with transfer complete

public inbox for linux-spi@vger.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Breno Leitao <leitao@debian.org>, Mark Brown <broonie@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	ldewangan@nvidia.com, thierry.reding@kernel.org,
	jonathanh@nvidia.com, skomatineni@nvidia.com,
	linux-tegra@vger.kernel.org, linux-spi@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-6.19] spi: tegra210-quad: Fix false positive WARN on interrupt timeout with transfer complete
Date: Mon, 20 Apr 2026 09:19:21 -0400	[thread overview]
Message-ID: <20260420132314.1023554-167-sashal@kernel.org> (raw)
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Breno Leitao <leitao@debian.org>

[ Upstream commit 5b94c94caafcad3c77cc6b1d213a93bf5dc0a98e ]

The WARN_ON_ONCE/WARN_ON fired unconditionally on any completion
timeout, including the recoverable case where the interrupt was lost but
the hardware actually finished the transfer. This produced a noisy splat
with a full call trace even though the driver successfully recovered via
tegra_qspi_handle_timeout().

Since tegra210 uses threaded interrupts, the transfer completion can be
signaled before the interrupt fires, making this false positive case
common in practice.

Almost all the hosts I sysadmin in my fleet produce the following splat:

	WARNING: CPU: 47 PID: 844 at drivers/spi/spi-tegra210-quad.c:1226 tegra_qspi_transfer_one_message+0x8a4/0xba8
	....
	tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete

Move WARN_ON_ONCE/WARN_ON to fire only on real unrecoverable timeouts,
i.e., when tegra_qspi_handle_timeout() confirms the hardware did NOT
complete. This makes the warning actionable instead of just polluting
the metrics.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260408-tegra_warn-v1-1-669a3bc74d77@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the data needed for a comprehensive analysis. Here is the
full walkthrough:

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: PARSE THE SUBJECT LINE
Record: [spi: tegra210-quad] [Fix] [false positive WARN on interrupt
timeout with transfer complete]

### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Signed-off-by:** Breno Leitao <leitao@debian.org> (author)
- **Link:** https://patch.msgid.link/20260408-tegra_warn-v1-1-
  669a3bc74d77@debian.org
- **Signed-off-by:** Mark Brown <broonie@kernel.org> (SPI subsystem
  maintainer)
- No Fixes: tag (expected for autosel candidates)
- No Reported-by: (author is the reporter, via commit message text)
- No Cc: stable tag (expected)

Record: Author is a regular tegra210-quad contributor (9 commits in this
driver). Patch was accepted by the SPI subsystem maintainer Mark Brown.

### Step 1.3: ANALYZE THE COMMIT BODY TEXT
The bug: `WARN_ON_ONCE`/`WARN_ON` fires unconditionally on any
completion timeout, including the recoverable case where the interrupt
was lost but hardware finished the transfer. The driver already has
`tegra_qspi_handle_timeout()` that recovers from this situation.

Symptom: Full kernel warning stack trace on every recoverable timeout:
```
WARNING: CPU: 47 PID: 844 at drivers/spi/spi-tegra210-quad.c:1226
tegra_qspi_transfer_one_message+0x8a4/0xba8
```

Scale: Author states "Almost all the hosts I sysadmin in my fleet
produce the following splat." This is common because tegra210 uses
threaded interrupts where transfer completion can be signaled before the
IRQ fires.

Root cause: The WARN fires before `tegra_qspi_handle_timeout()` has a
chance to determine if the hardware actually completed.

Record: [False positive WARN on recoverable timeout] [WARNING splat with
full call trace on every occurrence] [Common in practice across fleet of
Tegra machines] [Root cause: WARN placed before recovery check]

### Step 1.4: DETECT HIDDEN BUG FIXES
This IS a bug fix - the WARN fires when there's no actual problem. With
`panic_on_warn=1` (verified exists in `kernel/panic.c`), this causes
kernel panics on recoverable situations.

Record: [Yes, this is a real bug fix. WARN fires on non-error
conditions, which causes panics with panic_on_warn=1 and pollutes
logs/monitoring on all other systems.]

---

## PHASE 2: DIFF ANALYSIS - LINE BY LINE

### Step 2.1: INVENTORY THE CHANGES
- **File:** `drivers/spi/spi-tegra210-quad.c`
- **Two hunks:** Both doing the same logical change
  - Hunk 1 (`tegra_qspi_combined_seq_xfer`): -1 line, +2 lines (net +1)
  - Hunk 2 (`tegra_qspi_non_combined_seq_xfer`): -1 line, +2 lines (net
    +1)
- **Functions modified:** `tegra_qspi_combined_seq_xfer`,
  `tegra_qspi_non_combined_seq_xfer`
- **Scope:** Single-file surgical fix, two identical pattern changes

Record: [1 file, 2 functions, +2/-2 logical changes, scope: surgical
fix]

### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE
**Hunk 1 (combined_seq_xfer):**
- BEFORE: `if (WARN_ON_ONCE(ret == 0))` fires WARN then calls
  `tegra_qspi_handle_timeout()` inside the `if` body
- AFTER: `if (ret == 0)` (no WARN), then calls
  `tegra_qspi_handle_timeout()`, and only `WARN_ON_ONCE(1)` if
  handle_timeout returns < 0 (real unrecoverable timeout)

**Hunk 2 (non_combined_seq_xfer):**
- BEFORE: `if (WARN_ON(ret == 0))` fires WARN then calls
  `tegra_qspi_handle_timeout()`
- AFTER: `if (ret == 0)` (no WARN), then calls
  `tegra_qspi_handle_timeout()`, and only `WARN_ON(1)` if handle_timeout
  returns < 0

Record: [Both hunks: WARN moved from unconditional timeout to only real-
timeout branch after recovery check]

### Step 2.3: IDENTIFY THE BUG MECHANISM
Category: **Logic/correctness fix** - The WARN was placed at the wrong
scope level. It fires for all timeouts (including recoverable ones where
hardware completed but interrupt was delayed), when it should only fire
for genuine failures.

Record: [Logic bug: WARN fires at wrong scope. Fix moves WARN to only
trigger after confirming hardware did NOT complete.]

### Step 2.4: ASSESS THE FIX QUALITY
- Obviously correct: YES - the WARN should only fire for real errors,
  not recoverable conditions
- Minimal/surgical: YES - just 2 identical changes moving WARN placement
- Regression risk: VERY LOW - the same WARN still fires on real
  timeouts, and the recovery path is untouched
- No red flags: single file, same subsystem, no API changes

Record: [Fix is obviously correct, minimal, zero regression risk]

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: BLAME THE CHANGED LINES
- Line 1226 (`WARN_ON_ONCE`): introduced by `41c721fc09393` (Breno
  Leitao, 2025-04-01) in v6.15
- Line 1343 (`WARN_ON`): original from `921fc1838fb036` (Sowjanya
  Komatineni, 2020-12-21) in v5.11
- Lines 1227-1248 (timeout handling): introduced by `380fd29d57abe`
  (Vishwaroop A, 2025-10-28) in v6.19

Record: [WARN_ON from v5.11, WARN_ON_ONCE since v6.15, recovery logic
since v6.19. The false positive only became apparent after 380fd29d57abe
added recovery, revealing the WARN fires even when recovery succeeds.]

### Step 3.2: FOLLOW THE FIXES: TAG
No Fixes: tag present. However, logically this fixes the combination of
`41c721fc09393` (which introduced WARN_ON_ONCE in combined path) and
`380fd29d57abe` (which added the recovery but didn't move the WARN).

Record: [No Fixes: tag. Logically fixes the interaction between WARN
placement and the recovery path added in 380fd29d57abe.]

### Step 3.3: CHECK FILE HISTORY FOR RELATED CHANGES
Recent commits show heavy work on this driver by the same author:
- Spinlock protection for curr_xfer (6 patches, v6.19)
- IRQ_HANDLED fix (v6.19)
- Rate limiting (v6.15)
- WARN_ON_ONCE (v6.15)

Record: [Standalone fix. No prerequisites beyond the already-merged
recovery handler. Author is the most active recent contributor to this
driver.]

### Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS
Breno Leitao has 9 commits to this driver, including the prior
WARN_ON_ONCE change and the spinlock protection series. He is clearly
intimately familiar with this driver and has been fixing real production
issues.

Record: [Active contributor to this driver, fixing real production
issues across a fleet.]

### Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS
This patch depends on `380fd29d57abe` ("Check hardware status on
timeout") which introduced `tegra_qspi_handle_timeout()`. That commit is
in v6.19+.

Record: [Requires tegra_qspi_handle_timeout() from 380fd29d57abe
(v6.19). Patch cannot apply to older stable trees.]

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

### Step 4.1: FIND THE ORIGINAL PATCH DISCUSSION
Lore.kernel.org is behind Anubis anti-bot protection. B4 dig could not
be used directly on the new commit hash (not yet in tree).

The patch was submitted as v1 on 2026-04-08 (based on msgid). It was a
single patch (not part of a series). Mark Brown applied it directly,
suggesting straightforward acceptance.

Record: [Patch accepted directly by SPI maintainer. No concerns or NAKs
visible. v1 was applied directly (no revisions needed).]

### Step 4.2: CHECK WHO REVIEWED THE PATCH
Mark Brown (SPI subsystem maintainer) signed off, indicating direct
review and acceptance.

Record: [Mark Brown (subsystem maintainer) reviewed and applied.]

### Step 4.3-4.5: BUG REPORT / RELATED PATCHES / STABLE DISCUSSION
The commit message itself serves as the bug report - the author
describes fleet-wide impact. No separate bug report.

Record: [Author is the reporter. Fleet-wide impact documented in commit
message.]

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1: IDENTIFY KEY FUNCTIONS IN THE DIFF
- `tegra_qspi_combined_seq_xfer()` - Combined sequence transfer path
- `tegra_qspi_non_combined_seq_xfer()` - Non-combined sequence transfer
  path

### Step 5.2: TRACE CALLERS
Both functions are called from `tegra_qspi_transfer_one_message()`,
which is the SPI core transfer callback. This is the main transfer path
for ALL QSPI operations on Tegra.

Record: [Called from core SPI transfer path - every QSPI transaction
goes through this code.]

### Step 5.3-5.4: TRACE CALLEES / CALL CHAIN
The affected code calls `tegra_qspi_handle_timeout()` which checks
hardware status (QSPI_RDY bit) and manually processes completion if
hardware finished. This is a standard SPI data path.

Record: [Standard SPI data transfer path, reachable from any QSPI
operation on Tegra platforms.]

### Step 5.5: SEARCH FOR SIMILAR PATTERNS
The exact same pattern exists in both functions (combined and non-
combined), and the fix addresses both identically.

Record: [Two instances of the same pattern, both fixed.]

---

## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

### Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE TREES?
- **v6.19.y:** YES - has `tegra_qspi_handle_timeout()` and both
  WARN_ON/WARN_ON_ONCE. Patch applies cleanly (verified code is
  identical to v7.0).
- **v6.18.y and earlier:** NO - `tegra_qspi_handle_timeout()` does not
  exist. The timeout handling is completely different. This patch is
  INAPPLICABLE.
- **v6.6.y, v6.1.y, v5.15.y:** NO - completely different timeout code.

Record: [Applicable ONLY to v6.19.y and v7.0. No applicability to older
stable trees.]

### Step 6.2: CHECK FOR BACKPORT COMPLICATIONS
For v6.19.y: The code is identical - clean apply expected.

Record: [Clean apply to v6.19.y. Not applicable to older trees.]

### Step 6.3: CHECK IF RELATED FIXES ARE ALREADY IN STABLE
No related fix is in stable for this specific false positive WARN issue.

Record: [No existing fix in stable for this issue.]

---

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

### Step 7.1: IDENTIFY THE SUBSYSTEM AND ITS CRITICALITY
- Subsystem: `drivers/spi` - SPI bus driver for Tegra QSPI controller
- Criticality: IMPORTANT for Tegra (NVIDIA) platforms (widely used in
  embedded/automotive/Jetson)

Record: [SPI driver subsystem, IMPORTANT for Tegra/Jetson platforms]

### Step 7.2: ASSESS SUBSYSTEM ACTIVITY
Very active - 20 recent commits to this file, ongoing improvements and
bug fixes.

Record: [Very active subsystem with ongoing fixes]

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: DETERMINE WHO IS AFFECTED
All users of Tegra210+ QSPI hardware. This includes NVIDIA Jetson
platforms widely used in robotics, automotive, and embedded
applications.

Record: [All Tegra QSPI users - Jetson/embedded platforms]

### Step 8.2: DETERMINE THE TRIGGER CONDITIONS
- Trigger: Any QSPI transfer where interrupt is delayed (common with
  threaded IRQs)
- Frequency: Very common - author reports "almost all hosts in my fleet"
- No privilege required - happens during normal SPI operations

Record: [Very common, happens during normal QSPI operations, no special
trigger needed]

### Step 8.3: DETERMINE THE FAILURE MODE SEVERITY
- Without `panic_on_warn`: MEDIUM - log pollution, monitoring noise,
  full stack trace on recoverable events
- With `panic_on_warn=1`: CRITICAL - kernel panic on a non-error
  condition
- The warning count can be extremely high (94,451 in commit
  41c721fc09393's message)

Record: [MEDIUM severity (CRITICAL with panic_on_warn=1). False positive
warnings pollute logs and can cause panics.]

### Step 8.4: CALCULATE RISK-BENEFIT RATIO
- **BENEFIT:** HIGH - eliminates false positive warnings that are common
  across Tegra fleet. Prevents panics on `panic_on_warn=1` systems.
  Makes the warning actionable (only fires on real errors).
- **RISK:** VERY LOW - 2-line change per function, obviously correct, no
  behavioral change except WARN placement.

Record: [HIGH benefit, VERY LOW risk. Excellent ratio.]

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: COMPILE THE EVIDENCE

**FOR backporting:**
- Fixes a real, common false positive WARN that fires during normal
  recoverable operation
- Can cause kernel panics on `panic_on_warn=1` systems
- Author documents fleet-wide impact ("almost all hosts in my fleet")
- Fix is tiny (2 identical 1-line changes), obviously correct, zero
  regression risk
- Accepted directly by SPI maintainer Mark Brown without revisions
- Author is an experienced contributor to this driver with 9 commits
- Applies cleanly to v6.19.y stable tree

**AGAINST backporting:**
- Only applicable to v6.19.y (depends on v6.19's
  `tegra_qspi_handle_timeout()`)
- Without `panic_on_warn`, this is "just" log noise (though significant
  noise)
- Limited to Tegra QSPI hardware users

### Step 9.2: APPLY THE STABLE RULES CHECKLIST
1. Obviously correct and tested? **YES** - trivially verifiable, tested
   in author's fleet
2. Fixes a real bug that affects users? **YES** - false positive WARN,
   panic with panic_on_warn
3. Important issue? **YES** - WARN with full call trace on every
   recoverable timeout; panic on panic_on_warn
4. Small and contained? **YES** - 2 identical single-line changes in one
   file
5. No new features or APIs? **YES** - just moves WARN placement
6. Can apply to stable trees? **YES** - cleanly to v6.19.y

### Step 9.3: CHECK FOR EXCEPTION CATEGORIES
Not an exception category - this is a straightforward bug fix.

### Step 9.4: MAKE YOUR DECISION
The fix is small, surgical, obviously correct, and fixes a real problem
that affects real Tegra users in production. It eliminates false
positive warnings that can cause panics on `panic_on_warn=1` systems and
significantly reduces log noise. The fix meets all stable kernel
criteria.

---

## Verification

- [Phase 1] Parsed tags: Signed-off-by author (Breno Leitao), Signed-
  off-by maintainer (Mark Brown), Link to patch
- [Phase 2] Diff analysis: 2 identical changes in 2 functions, moving
  WARN from outer scope to real-timeout-only branch
- [Phase 3] git blame: WARN_ON_ONCE at line 1226 from commit
  41c721fc09393 (v6.15); recovery handler from 380fd29d57abe (v6.19)
- [Phase 3] git show 380fd29d57abe: confirmed this introduced
  tegra_qspi_handle_timeout(), the function that makes the false
  positive detectable
- [Phase 3] git show 41c721fc09393: confirmed 94,451 warnings reported
  per host
- [Phase 3] git log --author: author has 9 commits to this driver
- [Phase 4] b4 dig found prior series by same author; current patch is
  standalone v1 accepted directly
- [Phase 5] Functions modified are in the core SPI transfer path, called
  from tegra_qspi_transfer_one_message
- [Phase 6] v6.19 code verified identical to v7.0 for affected areas
  (tegra_qspi_handle_timeout exists, WARN_ON locations match)
- [Phase 6] v6.12, v6.6 confirmed: NO tegra_qspi_handle_timeout
  function, completely different timeout code - patch inapplicable
- [Phase 6] v6.15, v6.18 confirmed: WARN_ON_ONCE exists but NO
  tegra_qspi_handle_timeout - patch inapplicable
- [Phase 8] panic_on_warn confirmed in kernel/panic.c: WARN triggers
  check_panic_on_warn() which calls panic()
- UNVERIFIED: Could not access lore.kernel.org due to Anubis bot
  protection; unable to verify full review discussion

**YES**

 drivers/spi/spi-tegra210-quad.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 7ea5aa993596f..05f1753e2bbb1 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1223,7 +1223,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 					(&tqspi->xfer_completion,
 					QSPI_DMA_TIMEOUT);

-			if (WARN_ON_ONCE(ret == 0)) {
+			if (ret == 0) {
 				/*
 				 * Check if hardware completed the transfer
 				 * even though interrupt was lost or delayed.
@@ -1232,6 +1232,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 				ret = tegra_qspi_handle_timeout(tqspi);
 				if (ret < 0) {
 					/* Real timeout - clean up and fail */
+					WARN_ON_ONCE(1);
 					dev_err(tqspi->dev, "transfer timeout\n");

 					/* Abort transfer by resetting pio/dma bit */
@@ -1340,7 +1341,7 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,

 		ret = wait_for_completion_timeout(&tqspi->xfer_completion,
 						  QSPI_DMA_TIMEOUT);
-		if (WARN_ON(ret == 0)) {
+		if (ret == 0) {
 			/*
 			 * Check if hardware completed the transfer even though
 			 * interrupt was lost or delayed. If so, process the
@@ -1349,6 +1350,7 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 			ret = tegra_qspi_handle_timeout(tqspi);
 			if (ret < 0) {
 				/* Real timeout - clean up and fail */
+				WARN_ON(1);
 				dev_err(tqspi->dev, "transfer timeout\n");

 				if (tqspi->is_curr_dma_xfer)
-- 
2.53.0

next      parent reply	other threads:[~2026-04-20 13:27 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:19 ` Sasha Levin [this message]
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] spi: rzv2h-rspi: Fix max_speed_hz advertising prohibited bit rate Sasha Levin
2026-04-20 13:20 ` [PATCH AUTOSEL 7.0] spi: stm32: fix rx DMA request error handling Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:7ea5aa993596 dfblob:05f1753e2bbb )
 OR (
bs:"[PATCH AUTOSEL 7.0-6.19] spi: tegra210-quad: Fix false positive WARN on interrupt timeout with transfer complete" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260420132314.1023554-167-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=broonie@kernel.org \
    --cc=jonathanh@nvidia.com \
    --cc=ldewangan@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=skomatineni@nvidia.com \
    --cc=stable@vger.kernel.org \
    --cc=thierry.reding@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox