[PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
@ 2026-01-16 10:41 Breno Leitao
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
the following failure sequence:

1) After running for a while, the interrupt gets marked as spurious:

    irq 63: nobody cared (try booting with the "irqpoll" option)
    Disabling IRQ #63

2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
   responsible for signaling xfer_completion.
   Once the interrupt is disabled, xfer_completion is never completed, causing
   transfers to hit the timeout:

    WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0

3) The timeout handler completes the transfer:

    tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete

4) Later, the ISR thread finally runs and crashes trying to dereference
   curr_xfer which the timeout handler already set to NULL:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
    lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
    Call trace:
      handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)

Root cause analysis identified three issues:

1) Race condition on tqspi->curr_xfer

   The curr_xfer pointer can change during ISR execution without proper
   synchronization. The timeout path clears curr_xfer while the ISR
   thread may still be accessing it.

   This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
   adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
   at entry and exit - the value changes mid-execution. I've used the
   following test to reproduce this issue:

   https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

   The existing comment in the ISR acknowledges this race but the
   protection is insufficient:

       /*
        * Occasionally the IRQ thread takes a long time to wake up (usually
        * when the CPU that it's running on is excessively busy) and we have
        * already reached the timeout before and cleaned up the timed out
        * transfer. Avoid any processing in that case and bail out early.
        */

   This is bad because tqspi->curr_xfer can just get NULLed

2) Incorrect IRQ_NONE return causing spurious IRQ detection

   When the timeout handler processes a transfer before the ISR thread
   runs, tegra_qspi_isr_thread() returns IRQ_NONE.

   After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
   and disables it - but these were legitimate interrupts that happened to be
   processed by the timeout path first.

   Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
   the interrupt (!?)

3) Complex locking makes full protection difficult

   Ideally the entire tqspi structure would be protected by tqspi->lock,
   but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
   which can sleep, preventing the lock from being held across the entire
   ISR execution.

   Usama Arif has some ideas here, and he can share more.

This patchset addresses these issues:

Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
already processed the transfer. Use the QSPI_RDY bit in
QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
distinguish real interrupts from truly spurious ones.

Protect curr_xfer access with spinlock everywhere in the code, given
Interrupt handling can run in parallel with timeout and transfer.
This prevents the NULL pointer dereference by ensuring curr_xfer cannot
be cleared while being checked.

While this may not provide complete protection for all tqspi fields
(which might be necessary?!), it fixes the observed crashes and prevents
the spurious IRQ detection that was disabling the interrupt entirely.

This was tested with a simple TPM application, where the TPM lives
behind the tegra qspi driver:

https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

A special thanks for Usama Arif for his help investigating the problem
and helping with the fixes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (6):
      spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
      spi: tegra210-quad: Move curr_xfer read inside spinlock
      spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
      spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer check in IRQ handler

 drivers/spi/spi-tegra210-quad.c | 54 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 4 deletions(-)
---
base-commit: 9b7977f9e39b7768c70c2aa497f04e7569fd3e00
change-id: 20260112-tegra_xfer-6acb30a6720f

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-16 11:33   ` Thierry Reding
  2026-01-16 11:48   ` Usama Arif
  2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

When the ISR thread wakes up late and finds that the timeout handler
has already processed the transfer (curr_xfer is NULL), return
IRQ_HANDLED instead of IRQ_NONE.

Use a similar approach to tegra_qspi_handle_timeout() by reading
QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
hardware actually completed the transfer. If QSPI_RDY is set, the
interrupt was legitimate and triggered by real hardware activity.
The fact that the timeout path handled it first doesn't make it
spurious. Returning IRQ_NONE incorrectly suggests the interrupt
wasn't for this device, which can cause issues with shared interrupt
lines and interrupt accounting.

Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9..f0408c0b4b98 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1552,15 +1552,30 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
 	struct tegra_qspi *tqspi = context_data;
+	u32 status;
+
+	/*
+	 * Read transfer status to check if interrupt was triggered by transfer
+	 * completion
+	 */
+	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
 
 	/*
 	 * Occasionally the IRQ thread takes a long time to wake up (usually
 	 * when the CPU that it's running on is excessively busy) and we have
 	 * already reached the timeout before and cleaned up the timed out
 	 * transfer. Avoid any processing in that case and bail out early.
+	 *
+	 * If no transfer is in progress, check if this was a real interrupt
+	 * that the timeout handler already processed, or a spurious one.
 	 */
-	if (!tqspi->curr_xfer)
-		return IRQ_NONE;
+	if (!tqspi->curr_xfer) {
+		/* Spurious interrupt - transfer not ready */
+		if (!(status & QSPI_RDY))
+			return IRQ_NONE;
+		/* Real interrupt, already handled by timeout path */
+		return IRQ_HANDLED;
+	}
 
 	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-16 11:38   ` Thierry Reding
  2026-01-16 10:41 ` [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

Move the assignment of the transfer pointer from curr_xfer inside the
spinlock critical section in both handle_cpu_based_xfer() and
handle_dma_based_xfer().

Previously, curr_xfer was read before acquiring the lock, creating a
window where the timeout path could clear curr_xfer between reading it
and using it. By moving the read inside the lock, the handlers are
guaranteed to see a consistent value that cannot be modified by the
timeout path.

Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index f0408c0b4b98..ee291b9e9e9c 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1440,10 +1440,11 @@ static int tegra_qspi_transfer_one_message(struct spi_controller *host,
 
 static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 {
-	struct spi_transfer *t = tqspi->curr_xfer;
+	struct spi_transfer *t;
 	unsigned long flags;
 
 	spin_lock_irqsave(&tqspi->lock, flags);
+	t = tqspi->curr_xfer;
 
 	if (tqspi->tx_status ||  tqspi->rx_status) {
 		tegra_qspi_handle_error(tqspi);
@@ -1474,7 +1475,7 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 
 static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 {
-	struct spi_transfer *t = tqspi->curr_xfer;
+	struct spi_transfer *t;
 	unsigned int total_fifo_words;
 	unsigned long flags;
 	long wait_status;
@@ -1513,6 +1514,7 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	}
 
 	spin_lock_irqsave(&tqspi->lock, flags);
+	t = tqspi->curr_xfer;
 
 	if (num_errors) {
 		tegra_qspi_dma_unmap_xfer(tqspi, t);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
  2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-16 10:41 ` [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

When the timeout handler processes a completed transfer and signals
completion, the transfer thread can immediately set up the next transfer
and assign curr_xfer to point to it.

If a delayed ISR from the previous transfer then runs, it checks if
(!tqspi->curr_xfer) (currently without the lock also -- to be fixed
soon) to detect stale interrupts, but this check passes because
curr_xfer now points to the new transfer. The ISR then incorrectly
processes the new transfer's context.

Protect the curr_xfer assignment with the spinlock to ensure the ISR
either sees NULL (and bails out) or sees the new value only after the
assignment is complete.

Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index ee291b9e9e9c..15c110c00aca 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -839,6 +839,7 @@ static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi, struct spi_tran
 	u32 command1, command2, speed = t->speed_hz;
 	u8 bits_per_word = t->bits_per_word;
 	u32 tx_tap = 0, rx_tap = 0;
+	unsigned long flags;
 	int req_mode;
 
 	if (!has_acpi_companion(tqspi->dev) && speed != tqspi->cur_speed) {
@@ -846,10 +847,12 @@ static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi, struct spi_tran
 		tqspi->cur_speed = speed;
 	}
 
+	spin_lock_irqsave(&tqspi->lock, flags);
 	tqspi->cur_pos = 0;
 	tqspi->cur_rx_pos = 0;
 	tqspi->cur_tx_pos = 0;
 	tqspi->curr_xfer = t;
+	spin_unlock_irqrestore(&tqspi->lock, flags);
 
 	if (is_first_of_msg) {
 		tegra_qspi_mask_clear_irq(tqspi);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
                   ` (2 preceding siblings ...)
  2026-01-16 10:41 ` [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-16 10:41 ` [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

The curr_xfer field is read by the IRQ handler without holding the lock
to check if a transfer is in progress. When clearing curr_xfer in the
combined sequence transfer loop, protect it with the spinlock to prevent
a race with the interrupt handler.

Protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race
with the interrupt handler that reads this field.

Without this protection, the IRQ handler could read a partially updated
curr_xfer value, leading to NULL pointer dereference or use-after-free.

Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 15c110c00aca..669e01d3f56a 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1161,6 +1161,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 	u32 address_value = 0;
 	u32 cmd_config = 0, addr_config = 0;
 	u8 cmd_value = 0, val = 0;
+	unsigned long flags;
 
 	/* Enable Combined sequence mode */
 	val = tegra_qspi_readl(tqspi, QSPI_GLOBAL_CONFIG);
@@ -1264,13 +1265,17 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 			tegra_qspi_transfer_end(spi);
 			spi_transfer_delay_exec(xfer);
 		}
+		spin_lock_irqsave(&tqspi->lock, flags);
 		tqspi->curr_xfer = NULL;
+		spin_unlock_irqrestore(&tqspi->lock, flags);
 		transfer_phase++;
 	}
 	ret = 0;
 
 exit:
+	spin_lock_irqsave(&tqspi->lock, flags);
 	tqspi->curr_xfer = NULL;
+	spin_unlock_irqrestore(&tqspi->lock, flags);
 	msg->status = ret;
 
 	return ret;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
                   ` (3 preceding siblings ...)
  2026-01-16 10:41 ` [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-16 10:41 ` [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
  2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
  6 siblings, 0 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

Protect the curr_xfer clearing in tegra_qspi_non_combined_seq_xfer()
with the spinlock to prevent a race with the interrupt handler that
reads this field to check if a transfer is in progress.

Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 669e01d3f56a..79aeb80aa4a7 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1288,6 +1288,7 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 	struct spi_transfer *transfer;
 	bool is_first_msg = true;
 	int ret = 0, val = 0;
+	unsigned long flags;
 
 	msg->status = 0;
 	msg->actual_length = 0;
@@ -1368,7 +1369,9 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 		msg->actual_length += xfer->len + dummy_bytes;
 
 complete_xfer:
+		spin_lock_irqsave(&tqspi->lock, flags);
 		tqspi->curr_xfer = NULL;
+		spin_unlock_irqrestore(&tqspi->lock, flags);
 
 		if (ret < 0) {
 			tegra_qspi_transfer_end(spi);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
                   ` (4 preceding siblings ...)
  2026-01-16 10:41 ` [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
@ 2026-01-16 10:41 ` Breno Leitao
  2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
  6 siblings, 0 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

Now that all other accesses to curr_xfer are done under the lock,
protect the curr_xfer NULL check in tegra_qspi_isr_thread() with the
spinlock. Without this protection, the following race can occur:

  CPU0 (ISR thread)              CPU1 (timeout path)
  ----------------               -------------------
  if (!tqspi->curr_xfer)
    // sees non-NULL
                                 spin_lock()
                                 tqspi->curr_xfer = NULL
                                 spin_unlock()
  handle_*_xfer()
    spin_lock()
    t = tqspi->curr_xfer  // NULL!
    ... t->len ...        // NULL dereference!

With this patch, all curr_xfer accesses are now properly synchronized.

Although all accesses to curr_xfer are done under the lock, in
tegra_qspi_isr_thread() it checks for NULL, releases the lock and
reacquires it later in handle_cpu_based_xfer()/handle_dma_based_xfer().
There is a potential for an update in between, which could cause a NULL
pointer dereference.

To handle this, add a NULL check inside the handlers after acquiring
the lock. This ensures that if the timeout path has already cleared
curr_xfer, the handler will safely return without dereferencing the
NULL pointer.

Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 79aeb80aa4a7..c0443c986dd9 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1457,6 +1457,11 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 	spin_lock_irqsave(&tqspi->lock, flags);
 	t = tqspi->curr_xfer;
 
+	if (!t) {
+		spin_unlock_irqrestore(&tqspi->lock, flags);
+		return IRQ_HANDLED;
+	}
+
 	if (tqspi->tx_status ||  tqspi->rx_status) {
 		tegra_qspi_handle_error(tqspi);
 		complete(&tqspi->xfer_completion);
@@ -1527,6 +1532,11 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	spin_lock_irqsave(&tqspi->lock, flags);
 	t = tqspi->curr_xfer;
 
+	if (!t) {
+		spin_unlock_irqrestore(&tqspi->lock, flags);
+		return IRQ_HANDLED;
+	}
+
 	if (num_errors) {
 		tegra_qspi_dma_unmap_xfer(tqspi, t);
 		tegra_qspi_handle_error(tqspi);
@@ -1565,6 +1575,7 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
 	struct tegra_qspi *tqspi = context_data;
+	unsigned long flags;
 	u32 status;
 
 	/*
@@ -1582,7 +1593,9 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 	 * If no transfer is in progress, check if this was a real interrupt
 	 * that the timeout handler already processed, or a spurious one.
 	 */
+	spin_lock_irqsave(&tqspi->lock, flags);
 	if (!tqspi->curr_xfer) {
+		spin_unlock_irqrestore(&tqspi->lock, flags);
 		/* Spurious interrupt - transfer not ready */
 		if (!(status & QSPI_RDY))
 			return IRQ_NONE;
@@ -1599,7 +1612,12 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 		tqspi->rx_status = tqspi->status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
 
 	tegra_qspi_mask_clear_irq(tqspi);
+	spin_unlock_irqrestore(&tqspi->lock, flags);
 
+	/*
+	 *  TODO: Keep tqspi->lock held for both handle functions below to avoid
+	 *  releasing and reacquiring the lock between calls.
+	 */
 	if (!tqspi->is_curr_dma_xfer)
 		return handle_cpu_based_xfer(tqspi);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
@ 2026-01-16 11:33   ` Thierry Reding
  2026-01-16 11:48   ` Usama Arif
  1 sibling, 0 replies; 16+ messages in thread
From: Thierry Reding @ 2026-01-16 11:33 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Hunter, Sowjanya Komatineni, Laxman Dewangan, Mark Brown,
	Vishwaroop A, Thierry Reding, linux-tegra, linux-spi,
	linux-kernel, kernel-team, puranjay, usamaarif642

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On Fri, Jan 16, 2026 at 02:41:41AM -0800, Breno Leitao wrote:
> When the ISR thread wakes up late and finds that the timeout handler
> has already processed the transfer (curr_xfer is NULL), return
> IRQ_HANDLED instead of IRQ_NONE.
> 
> Use a similar approach to tegra_qspi_handle_timeout() by reading
> QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
> hardware actually completed the transfer. If QSPI_RDY is set, the
> interrupt was legitimate and triggered by real hardware activity.
> The fact that the timeout path handled it first doesn't make it
> spurious. Returning IRQ_NONE incorrectly suggests the interrupt
> wasn't for this device, which can cause issues with shared interrupt
> lines and interrupt accounting.
> 
> Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)

Hi Breno,

we've seen reports of similar failure and we're reviewing a similar
series internally. I think this patch looks good, though, but I'd like
for Vishwaroop to take a look at this and compare with his notes.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock
  2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
@ 2026-01-16 11:38   ` Thierry Reding
  0 siblings, 0 replies; 16+ messages in thread
From: Thierry Reding @ 2026-01-16 11:38 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Hunter, Sowjanya Komatineni, Laxman Dewangan, Mark Brown,
	Vishwaroop A, Thierry Reding, linux-tegra, linux-spi,
	linux-kernel, kernel-team, puranjay, usamaarif642

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

On Fri, Jan 16, 2026 at 02:41:42AM -0800, Breno Leitao wrote:
> Move the assignment of the transfer pointer from curr_xfer inside the
> spinlock critical section in both handle_cpu_based_xfer() and
> handle_dma_based_xfer().
> 
> Previously, curr_xfer was read before acquiring the lock, creating a
> window where the timeout path could clear curr_xfer between reading it
> and using it. By moving the read inside the lock, the handlers are
> guaranteed to see a consistent value that cannot be modified by the
> timeout path.
> 
> Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  drivers/spi/spi-tegra210-quad.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

I think this is the root problem of one of the crashes that were
reported. The problem seems to be that the high CPU load can lead to a
case where tqspi->curr_xfer is modified after being copied to the local
variable, and before the check. The window for this is very slim for the
CPU based transfer, but for DMA based transfers I can see that happening
more easily.

Acked-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
  2026-01-16 11:33   ` Thierry Reding
@ 2026-01-16 11:48   ` Usama Arif
  2026-01-16 12:06     ` Thierry Reding
  1 sibling, 1 reply; 16+ messages in thread
From: Usama Arif @ 2026-01-16 11:48 UTC (permalink / raw)
  To: Breno Leitao, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel, kernel-team,
	puranjay



On 16/01/2026 10:41, Breno Leitao wrote:
> When the ISR thread wakes up late and finds that the timeout handler
> has already processed the transfer (curr_xfer is NULL), return
> IRQ_HANDLED instead of IRQ_NONE.
> 
> Use a similar approach to tegra_qspi_handle_timeout() by reading
> QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
> hardware actually completed the transfer. If QSPI_RDY is set, the
> interrupt was legitimate and triggered by real hardware activity.
> The fact that the timeout path handled it first doesn't make it
> spurious. Returning IRQ_NONE incorrectly suggests the interrupt
> wasn't for this device, which can cause issues with shared interrupt
> lines and interrupt accounting.
> 
> Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)

Hi,

I am not familiar with tegra SPI specifically, so I might be missing something, but
I was wondering if its actually better to not handle the transfer in timeout at all. 

We have 1000ms before timeout and I feel like that should have been enough.
Looking at other spi drivers: imx, fsl,.. these dont handle transfers in timeout
and are therefore lockless. tegra_qspi_handle_timeout also at the end checks
if for some reason the transfer fails (although it looks like handle_cpu/dma_based_xfer
cant really fail?), and would just return failure.

Removing the attempt to transfer in timeout will get rid of the spinlock, all the bugs,
make isr handling quicker as you dont need to acquire a spinlock (which might lead to lesser
timeouts?) and makes the whole driver much more simpler. Or maybe I am missing something?

I have a potential untested patch below for how it would look like. We can work on testing this
if it makes sense to the maintainers?



commit 09af307fbb33717e8de9489cef2abd743e3f6a93
Author: Usama Arif <usamaarif642@gmail.com>
Date:   Wed Jan 14 16:47:06 2026 +0000

    spi: tegra210-quad: remove tqspi->lock and xfer handling in timeout
    
    The tqspi->lock spinlock existed to protect against races between the
    IRQ handler and the timeout recovery path, which could both call
    handle_cpu_based_xfer() or handle_dma_based_xfer(). However, this
    locking is not implemented correctly and is causing a race condition.
    Looking at other spi drivers that have a irq (for e.g. spi-fsl-dspi.c),
    it is simpler to not just handle trasfers in timeout.
    
    Remove xfer handling when there is timeout, eliminating the need for
    the lock completely, simplifying the code and eliminating the race
    condition. The existing code already handles xfer failure by returning
    -EIO. This brings tegra210-quad inline with other drivers like
    spi-fsl-dspi, spi-imx and spi-fs-lpspi. Making the irq handling lockless
    might also make it faster, which could lead to lesser timeouts.
    
    Signed-off-by: Usama Arif <usamaarif642@gmail.com>

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9b..ba67e628bd9ef 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -184,8 +184,6 @@ struct tegra_qspi_client_data {
 struct tegra_qspi {
 	struct device				*dev;
 	struct spi_controller			*host;
-	/* lock to protect data accessed by irq */
-	spinlock_t				lock;
 
 	struct clk				*clk;
 	void __iomem				*base;
@@ -968,7 +966,6 @@ static int tegra_qspi_setup(struct spi_device *spi)
 {
 	struct tegra_qspi *tqspi = spi_controller_get_devdata(spi->controller);
 	struct tegra_qspi_client_data *cdata = spi->controller_data;
-	unsigned long flags;
 	u32 val;
 	int ret;
 
@@ -982,8 +979,6 @@ static int tegra_qspi_setup(struct spi_device *spi)
 		cdata = tegra_qspi_parse_cdata_dt(spi);
 		spi->controller_data = cdata;
 	}
-	spin_lock_irqsave(&tqspi->lock, flags);
-
 	/* keep default cs state to inactive */
 	val = tqspi->def_command1_reg;
 	val |= QSPI_CS_SEL(spi_get_chipselect(spi, 0));
@@ -995,8 +990,6 @@ static int tegra_qspi_setup(struct spi_device *spi)
 	tqspi->def_command1_reg = val;
 	tegra_qspi_writel(tqspi, tqspi->def_command1_reg, QSPI_COMMAND1);
 
-	spin_unlock_irqrestore(&tqspi->lock, flags);
-
 	pm_runtime_put(tqspi->dev);
 
 	return 0;
@@ -1048,48 +1041,6 @@ static void tegra_qspi_transfer_end(struct spi_device *spi)
 	tegra_qspi_writel(tqspi, tqspi->def_command1_reg, QSPI_COMMAND1);
 }
 
-static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi);
-static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi);
-
-/**
- * tegra_qspi_handle_timeout - Handle transfer timeout with hardware check
- * @tqspi: QSPI controller instance
- *
- * When a timeout occurs but hardware has completed the transfer (interrupt
- * was lost or delayed), manually trigger transfer completion processing.
- * This avoids failing transfers that actually succeeded.
- *
- * Returns: 0 if transfer was completed, -ETIMEDOUT if real timeout
- */
-static int tegra_qspi_handle_timeout(struct tegra_qspi *tqspi)
-{
-	irqreturn_t ret;
-	u32 status;
-
-	/* Check if hardware actually completed the transfer */
-	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
-	if (!(status & QSPI_RDY))
-		return -ETIMEDOUT;
-
-	/*
-	 * Hardware completed but interrupt was lost/delayed. Manually
-	 * process the completion by calling the appropriate handler.
-	 */
-	dev_warn_ratelimited(tqspi->dev,
-			     "QSPI interrupt timeout, but transfer complete\n");
-
-	/* Clear the transfer status */
-	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
-	tegra_qspi_writel(tqspi, status, QSPI_TRANS_STATUS);
-
-	/* Manually trigger completion handler */
-	if (!tqspi->is_curr_dma_xfer)
-		ret = handle_cpu_based_xfer(tqspi);
-	else
-		ret = handle_dma_based_xfer(tqspi);
-
-	return (ret == IRQ_HANDLED) ? 0 : -EIO;
-}
 
 static u32 tegra_qspi_cmd_config(bool is_ddr, u8 bus_width, u8 len)
 {
@@ -1220,28 +1171,19 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 					QSPI_DMA_TIMEOUT);
 
 			if (WARN_ON_ONCE(ret == 0)) {
-				/*
-				 * Check if hardware completed the transfer
-				 * even though interrupt was lost or delayed.
-				 * If so, process the completion and continue.
-				 */
-				ret = tegra_qspi_handle_timeout(tqspi);
-				if (ret < 0) {
-					/* Real timeout - clean up and fail */
-					dev_err(tqspi->dev, "transfer timeout\n");
-
-					/* Abort transfer by resetting pio/dma bit */
-					if (tqspi->is_curr_dma_xfer)
-						tegra_qspi_dma_stop(tqspi);
-					else
-						tegra_qspi_pio_stop(tqspi);
-
-					/* Reset controller if timeout happens */
-					tegra_qspi_reset(tqspi);
-
-					ret = -EIO;
-					goto exit;
-				}
+				dev_err(tqspi->dev, "transfer timeout\n");
+
+				/* Abort transfer by resetting pio/dma bit */
+				if (tqspi->is_curr_dma_xfer)
+					tegra_qspi_dma_stop(tqspi);
+				else
+					tegra_qspi_pio_stop(tqspi);
+
+				/* Reset controller if timeout happens */
+				tegra_qspi_reset(tqspi);
+
+				ret = -EIO;
+				goto exit;
 			}
 
 			if (tqspi->tx_status ||  tqspi->rx_status) {
@@ -1332,23 +1274,14 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 		ret = wait_for_completion_timeout(&tqspi->xfer_completion,
 						  QSPI_DMA_TIMEOUT);
 		if (WARN_ON(ret == 0)) {
-			/*
-			 * Check if hardware completed the transfer even though
-			 * interrupt was lost or delayed. If so, process the
-			 * completion and continue.
-			 */
-			ret = tegra_qspi_handle_timeout(tqspi);
-			if (ret < 0) {
-				/* Real timeout - clean up and fail */
-				dev_err(tqspi->dev, "transfer timeout\n");
+			dev_err(tqspi->dev, "transfer timeout\n");
 
-				if (tqspi->is_curr_dma_xfer)
-					tegra_qspi_dma_stop(tqspi);
+			if (tqspi->is_curr_dma_xfer)
+				tegra_qspi_dma_stop(tqspi);
 
-				tegra_qspi_handle_error(tqspi);
-				ret = -EIO;
-				goto complete_xfer;
-			}
+			tegra_qspi_handle_error(tqspi);
+			ret = -EIO;
+			goto complete_xfer;
 		}
 
 		if (tqspi->tx_status ||  tqspi->rx_status) {
@@ -1441,9 +1374,6 @@ static int tegra_qspi_transfer_one_message(struct spi_controller *host,
 static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 {
 	struct spi_transfer *t = tqspi->curr_xfer;
-	unsigned long flags;
-
-	spin_lock_irqsave(&tqspi->lock, flags);
 
 	if (tqspi->tx_status ||  tqspi->rx_status) {
 		tegra_qspi_handle_error(tqspi);
@@ -1468,7 +1398,6 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 	tegra_qspi_start_cpu_based_transfer(tqspi, t);
 exit:
 	tqspi->curr_xfer = NULL;
-	spin_unlock_irqrestore(&tqspi->lock, flags);
 	return IRQ_HANDLED;
 }
 
@@ -1476,7 +1405,6 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 {
 	struct spi_transfer *t = tqspi->curr_xfer;
 	unsigned int total_fifo_words;
-	unsigned long flags;
 	long wait_status;
 	int num_errors = 0;
 
@@ -1512,13 +1440,11 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 		}
 	}
 
-	spin_lock_irqsave(&tqspi->lock, flags);
-
 	if (num_errors) {
 		tegra_qspi_dma_unmap_xfer(tqspi, t);
 		tegra_qspi_handle_error(tqspi);
 		complete(&tqspi->xfer_completion);
-		goto exit;
+		return IRQ_HANDLED;
 	}
 
 	if (tqspi->cur_direction & DATA_DIR_RX)
@@ -1532,7 +1458,7 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	if (tqspi->cur_pos == t->len) {
 		tegra_qspi_dma_unmap_xfer(tqspi, t);
 		complete(&tqspi->xfer_completion);
-		goto exit;
+		return IRQ_HANDLED;
 	}
 
 	tegra_qspi_dma_unmap_xfer(tqspi, t);
@@ -1544,8 +1470,6 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	else
 		num_errors = tegra_qspi_start_cpu_based_transfer(tqspi, t);
 
-exit:
-	spin_unlock_irqrestore(&tqspi->lock, flags);
 	return IRQ_HANDLED;
 }
 
@@ -1679,7 +1603,6 @@ static int tegra_qspi_probe(struct platform_device *pdev)
 
 	tqspi->host = host;
 	tqspi->dev = &pdev->dev;
-	spin_lock_init(&tqspi->lock);
 
 	tqspi->soc_data = device_get_match_data(&pdev->dev);
 	host->num_chipselect = tqspi->soc_data->cs_count;


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
  2026-01-16 11:48   ` Usama Arif
@ 2026-01-16 12:06     ` Thierry Reding
  0 siblings, 0 replies; 16+ messages in thread
From: Thierry Reding @ 2026-01-16 12:06 UTC (permalink / raw)
  To: Usama Arif
  Cc: Breno Leitao, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A, Thierry Reding,
	linux-tegra, linux-spi, linux-kernel, kernel-team, puranjay

[-- Attachment #1: Type: text/plain, Size: 4113 bytes --]

On Fri, Jan 16, 2026 at 11:48:44AM +0000, Usama Arif wrote:
> 
> 
> On 16/01/2026 10:41, Breno Leitao wrote:
> > When the ISR thread wakes up late and finds that the timeout handler
> > has already processed the transfer (curr_xfer is NULL), return
> > IRQ_HANDLED instead of IRQ_NONE.
> > 
> > Use a similar approach to tegra_qspi_handle_timeout() by reading
> > QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
> > hardware actually completed the transfer. If QSPI_RDY is set, the
> > interrupt was legitimate and triggered by real hardware activity.
> > The fact that the timeout path handled it first doesn't make it
> > spurious. Returning IRQ_NONE incorrectly suggests the interrupt
> > wasn't for this device, which can cause issues with shared interrupt
> > lines and interrupt accounting.
> > 
> > Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> Hi,
> 
> I am not familiar with tegra SPI specifically, so I might be missing something, but
> I was wondering if its actually better to not handle the transfer in timeout at all. 
> 
> We have 1000ms before timeout and I feel like that should have been enough.
> Looking at other spi drivers: imx, fsl,.. these dont handle transfers in timeout
> and are therefore lockless. tegra_qspi_handle_timeout also at the end checks
> if for some reason the transfer fails (although it looks like handle_cpu/dma_based_xfer
> cant really fail?), and would just return failure.
> 
> Removing the attempt to transfer in timeout will get rid of the spinlock, all the bugs,
> make isr handling quicker as you dont need to acquire a spinlock (which might lead to lesser
> timeouts?) and makes the whole driver much more simpler. Or maybe I am missing something?
> 
> I have a potential untested patch below for how it would look like. We can work on testing this
> if it makes sense to the maintainers?

These issues on Tegra are subtle. The primary reason for these latest
rounds of fixes is that under typical circumstances there are no issues
because transfers complete normally.

However, it turns out that if there's enough load on CPU 0, the threaded
IRQ handler ends up running much too late (we've seen something on the
order of several seconds and in extreme cases even 10s of seconds).
Granted, these are extreme circumstances, such as when an erroneous
driver blocks CPU 0 for prolonged amounts of time during boot (in this
case it was a display driver reading EDID and using mdelay() in a loop)
or high CPU load at runtime caused by things like NVMe I/O stress tests
and such.

I think the root problem is that we rely entirely on the threaded IRQ
handler, which was fine for some of the more classic cases for this
driver but it ends up breaking in these new cases. The problem is that
most of these transfers actually do complete at a hardware level, i.e.
the data ends up in the SPI client chip, it's just that we get
notification too late. We've verified this by testing with a hard IRQ
handler and that one is never late. The threaded IRQ is the problem, but
we also need threading here because of all the DMA handling, etc.

Given that the transfers at a hardware level complete, we cannot simply
return failure on timeout. The recovery mechanism implemented is the
simplest way to make sure software and hardware state remain consistent.
Unfortunately it's riddled with bugs right now.

So I don't think getting rid of the failure/timeout handling and the
locking is the right move here. I think maybe a significant improvement
would be to not rely on the threaded IRQ handler as much, but instead
employ a combination of a hard IRQ handler and a work queue to defer the
actual processing to. That might also be a bit overkill because the SPI
transfers are already serialized by the core, so we're only ever going
to have a single transfer in flight.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
  2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
                   ` (5 preceding siblings ...)
  2026-01-16 10:41 ` [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
@ 2026-01-20 11:22 ` Vishwaroop A
  2026-01-20 16:49   ` Breno Leitao
  6 siblings, 1 reply; 16+ messages in thread
From: Vishwaroop A @ 2026-01-20 11:22 UTC (permalink / raw)
  To: leitao
  Cc: thierry.reding, treding, jonathanh, skomatineni, ldewangan,
	broonie, linux-spi, linux-tegra, linux-kernel, kernel-team,
	puranjay, usamaarif642, Vishwaroop A

Hi Breno,

Thanks for posting this series. I've been working closely with our team at 
NVIDIA on this exact issue after the reports from Meta's fleet.

Your analysis is correct, I have some technical feedback on the series:

**Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):

These directly address the race we identified. The spinlock-protected read of 
curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the 
TOCTOU race between the timeout handler and the delayed ISR thread.

**Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**

While our original position was that this lock isn't strictly required due to 
call ordering (setup completes before HW start, so no IRQ can fire yet), I 
agree that explicit locking here improves maintainability. It makes the 
synchronization model clearer for future readers and removes any dependency on 
implicit ordering guarantees.

**Patch 4 (device_reset_optional):

Your observation about ACPI platforms lacking _RST is accurate. On server 
platforms, device_reset() fails and logs unnecessary errors. 
device_reset_optional() is the right semantic here. I'd suggest coordinating 
with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API 
addition gets proper review—it's useful beyond just QSPI.

**Patch 5 (synchronize_irq):

This ensures any in-flight ISR thread completes before we proceed with 
recovery. It closes a potential gap where a delayed ISR could access state 
after a reset. Combined with the spinlock in Patch 2, this provides robust 
protection.

**Patch 6 (Timeout error path): Logic issue—needs rework**

I see a problem here. If QSPI_RDY is not set (hardware actually timed out), 
you return success (0) from tegra_qspi_handle_timeout(). This causes 
__spi_transfer_message_noqueue() to call spi_finalize_current_message() with 
status = 0, signaling success to the client when the transfer actually failed.

The correct behavior should be:
- If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
- If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)

The current logic inverts this. I'd suggest:

  if (fifo_status & QSPI_RDY) {
      /* HW completed, recover */
      handle_cpu/dma_based_xfer();
      return 0;
  }
  /* True timeout */
  dev_err(..., "Transfer timeout");
  tegra_qspi_handle_error(tqspi);
  return -ETIMEDOUT;

---

I saw your tpm_torture_test.sh in the cover letter. We haven't been able to 
reproduce the issue locally yet—it seems to require the specific TPM device 
configuration and CPU load patterns present in Meta's fleet. If you have 
additional details on the reproduction environment (TPM vendor/model, specific 
workload characteristics, CPU affinity settings), that would help us validate 
the fix on our end.

---

I'm happy to:
- Test this series on our hardware platforms
- Collaborate on v2 with the fixes above
- I will work on hard IRQ handler follow-up that Thierry suggested for 
  long-term robustness

Let me know how you'd like to coordinate. Thanks again for tackling this—it's 
been a high-priority issue for our server customers.

Best regards,
Vishwaroop

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
  2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
@ 2026-01-20 16:49   ` Breno Leitao
  2026-01-20 16:52     ` Mark Brown
  2026-01-21 17:56     ` Vishwaroop A
  0 siblings, 2 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-20 16:49 UTC (permalink / raw)
  To: Vishwaroop A
  Cc: thierry.reding, treding, jonathanh, skomatineni, ldewangan,
	broonie, linux-spi, linux-tegra, linux-kernel, kernel-team,
	puranjay, usamaarif642

Hello Vishwaroop,

First of all, thanks for you answer and help,  

On Tue, Jan 20, 2026 at 11:22:42AM +0000, Vishwaroop A wrote:
> Thanks for posting this series. I've been working closely with our team at 
> NVIDIA on this exact issue after the reports from Meta's fleet.
> 
> Your analysis is correct, I have some technical feedback on the series:
> 
> **Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):
> 
> These directly address the race we identified. The spinlock-protected read of 
> curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the 
> TOCTOU race between the timeout handler and the delayed ISR thread.
> 
> **Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**
> 
> While our original position was that this lock isn't strictly required due to 
> call ordering (setup completes before HW start, so no IRQ can fire yet), I 
> agree that explicit locking here improves maintainability. It makes the 
> synchronization model clearer for future readers and removes any dependency on 
> implicit ordering guarantees.
> 
> **Patch 4 (device_reset_optional):
> 
> Your observation about ACPI platforms lacking _RST is accurate. On server 
> platforms, device_reset() fails and logs unnecessary errors. 
> device_reset_optional() is the right semantic here. I'd suggest coordinating 
> with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API 
> addition gets proper review—it's useful beyond just QSPI.

We are counting patch numbers differently. Patch 4 in this patchset is called
"[PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer",
 and basically protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race with the
interrupt handler that reads this field.

https://lore.kernel.org/all/20260116-tegra_xfer-v1-4-02d96c790619@debian.org/

The discussion about device_reset_optional() happened a while ago (March 2025),
but it never landed. Do you want me to include in this patchset?

Here is the discussion link:
https://lore.kernel.org/all/20250317-tegra-v1-1-78474efc0386@debian.org/

> **Patch 6 (Timeout error path): Logic issue—needs rework**

I understand you are talking about patch "[PATCH 1/6] spi:
tegra210-quad: Return IRQ_HANDLED when timeout already processed
transfer", right?

https://lore.kernel.org/all/20260116-tegra_xfer-v1-1-02d96c790619@debian.org/

> I see a problem here. If QSPI_RDY is not set (hardware actually timed out), 
> you return success (0) from tegra_qspi_handle_timeout(). This causes 
> __spi_transfer_message_noqueue() to call spi_finalize_current_message() with 
> status = 0, signaling success to the client when the transfer actually failed.
> 
> The correct behavior should be:
> - If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
> - If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)

Thanks for the heads-up.

> The current logic inverts this. I'd suggest:
> 
>   if (fifo_status & QSPI_RDY) {
>       /* HW completed, recover */
>       handle_cpu/dma_based_xfer();
>       return 0;
>   }
>   /* True timeout */
>   dev_err(..., "Transfer timeout");
>   tegra_qspi_handle_error(tqspi);
>   return -ETIMEDOUT;

I am not sure we can return -ETIMEDOUT here, given its return function
is irqreturn, which is:

  /**
   * enum irqreturn - irqreturn type values
   * @IRQ_NONE:           interrupt was not from this device or was not handled
   * @IRQ_HANDLED:        interrupt was handled by this device
   * @IRQ_WAKE_THREAD:    handler requests to wake the handler thread
   */
  enum irqreturn {
          IRQ_NONE                = (0 << 0),
          IRQ_HANDLED             = (1 << 0),
          IRQ_WAKE_THREAD         = (1 << 1),
  };


What about something as:

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9b..028b0b0cfdb25 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1552,15 +1552,30 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       u32 status;
+
+       /*
+        * Read transfer status to check if interrupt was triggered by transfer
+        * completion
+        */
+       status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually
         * when the CPU that it's running on is excessively busy) and we have
         * already reached the timeout before and cleaned up the timed out
         * transfer. Avoid any processing in that case and bail out early.
+        *
+        * If no transfer is in progress, check if this was a real interrupt
+        * that the timeout handler already processed, or a spurious one.
         */
-       if (!tqspi->curr_xfer)
+       if (!tqspi->curr_xfer) {
+               /* Spurious interrupt - transfer not ready */
+               if (!(status & QSPI_RDY))
+                       return IRQ_HANDLED;
+               /* Real interrupt, already handled by timeout path */
                return IRQ_NONE;
+       }

        tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);


> I saw your tpm_torture_test.sh in the cover letter. We haven't been able to 
> reproduce the issue locally yet—it seems to require the specific TPM device 
> configuration and CPU load patterns present in Meta's fleet. 

You can try to simulate a timeout using something like:

t a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9..7116c876c62b 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -159,7 +159,8 @@
 #define DATA_DIR_TX                            BIT(0)
 #define DATA_DIR_RX                            BIT(1)

-#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(1000))
+/* INSTRUMENTATION: Reduced from 1000ms to 20ms to trigger timeouts */
+#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(20))
 #define DEFAULT_QSPI_DMA_BUF_LEN               SZ_64K

 enum tegra_qspi_transfer_type {
@@ -1552,6 +1553,10 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       mdelay(10);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually

I've also uploaded another stress test in
https://github.com/leitao/debug/tree/main/arm/tegra. Those are the  
stress test I am using to reproduce this issue.

> If you have 
> additional details on the reproduction environment (TPM vendor/model, specific 
> workload characteristics, CPU affinity settings), that would help us validate 
> the fix on our end.

I am not doing anything special, this is a bit more about my host:

  # dmesg | grep tegra
  [    8.903792] tegra-qspi NVDA1513:00: cannot use DMA: -19
  [    8.903806] tegra-qspi NVDA1513:00: falling back to PIO
  [    8.903811] tegra-qspi NVDA1513:00: device reset failed

  # udevadm info -a /dev/tpm0  | cat
  looking at device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01/tpm/tpm0':
    KERNEL=="tpm0"
    SUBSYSTEM=="tpm"
    DRIVER==""
    ATTR{null_name}=="000bb4c148f37daef5c917ebdd9267edad857263b872d9a9cd05f2af96c060212e6f"
    ATTR{pcr-sha256/0}=="9CC1646906FD362B5F6D182CBADE226057CD32A6F5D6C6197A49AA78B4241929"
    ATTR{pcr-sha256/1}=="BFB3D60EA045EE30E924F239C812E11D7D02D380D94B1A53CDF8E336F4BD5EFF"
    ATTR{pcr-sha256/10}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/11}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/12}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/13}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/14}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/15}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/16}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/17}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/18}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/19}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/2}=="ECE24999889A3DBA6BDC392ED64CA0B6A968DFCAF46D8DBDDAD57A7A0423AD2C"
    ATTR{pcr-sha256/20}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/21}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/22}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/23}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/3}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/4}=="7E7811C1F6A74895706FFFBED8180452AABA032209D708A7B08066B16F38524F"
    ATTR{pcr-sha256/5}=="D679AB040FA69F51A392DC467BA09AC0A5040FAAD386A9DA99A23C465DB0BCE1"
    ATTR{pcr-sha256/6}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/7}=="65CAF8DD1E0EA7A6347B635D2B379C93B9A1351EDC2AFC3ECDA700E534EB3068"
    ATTR{pcr-sha256/8}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/9}=="94BACCCA95B21E689C38511D4FE26D2DCB1E2C20CD5CECC5C4F2FC99C28452BF"
    ATTR{power/control}=="auto"
    ATTR{power/runtime_active_time}=="0"
    ATTR{power/runtime_status}=="unsupported"
    ATTR{power/runtime_suspended_time}=="0"
    ATTR{tpm_version_major}=="2"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01':
    KERNELS=="spi-PRP0001:01"
    SUBSYSTEMS=="spi"
    DRIVERS=="tpm_tis_spi"
    ATTRS{driver_override}==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0':
    KERNELS=="spi0"
    SUBSYSTEMS=="spi_master"
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00':
    KERNELS=="NVDA1513:00"
    SUBSYSTEMS=="platform"
    DRIVERS=="tegra-qspi"
    ATTRS{driver_override}=="(null)"
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="86751"
    ATTRS{power/runtime_status}=="suspended"
    ATTRS{power/runtime_suspended_time}=="340179486"

  looking at parent device '/devices/platform':
    KERNELS=="platform"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"

> I'm happy to:
> - Test this series on our hardware platforms

Please test with the patch above (reduced QSPI_DMA_TIMEOUT and the mdelay). 

> - Collaborate on v2 with the fixes above

Thanks. I understand that the only concern for v1 is that QSPI_RDY if inverted,
and I can fix it and send a v2. is there anything else you want fixed in v2?

> - I will work on hard IRQ handler follow-up that Thierry suggested for 
>   long-term robustness

This is awesome, appreciate!

Thanks for you support,
--breno

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
  2026-01-20 16:49   ` Breno Leitao
@ 2026-01-20 16:52     ` Mark Brown
  2026-01-21 17:56     ` Vishwaroop A
  1 sibling, 0 replies; 16+ messages in thread
From: Mark Brown @ 2026-01-20 16:52 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Vishwaroop A, thierry.reding, treding, jonathanh, skomatineni,
	ldewangan, linux-spi, linux-tegra, linux-kernel, kernel-team,
	puranjay, usamaarif642

[-- Attachment #1: Type: text/plain, Size: 647 bytes --]

On Tue, Jan 20, 2026 at 08:49:23AM -0800, Breno Leitao wrote:
> On Tue, Jan 20, 2026 at 11:22:42AM +0000, Vishwaroop A wrote:

> -       if (!tqspi->curr_xfer)
> +       if (!tqspi->curr_xfer) {
> +               /* Spurious interrupt - transfer not ready */
> +               if (!(status & QSPI_RDY))
> +                       return IRQ_HANDLED;
> +               /* Real interrupt, already handled by timeout path */
>                 return IRQ_NONE;
> +       }

IRQ_NONE means that there was no interrupt flagged by the hardware, if
there was an interrupt you should return IRQ_HANDLED.  You might confuse
genirq if you flag it spuriously.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
  2026-01-20 16:49   ` Breno Leitao
  2026-01-20 16:52     ` Mark Brown
@ 2026-01-21 17:56     ` Vishwaroop A
  2026-01-22 17:04       ` Breno Leitao
  1 sibling, 1 reply; 16+ messages in thread
From: Vishwaroop A @ 2026-01-21 17:56 UTC (permalink / raw)
  To: leitao
  Cc: thierry.reding, treding, jonathanh, skomatineni, ldewangan,
	broonie, linux-spi, linux-tegra, linux-kernel, kernel-team,
	puranjay, usamaarif642, Vishwaroop A

Hi Breno,

After reviewing Mark Brown's feedback and the code carefully, let me clarify the 
correct logic. This is important to get right.

**IRQ Handler Semantics (per Mark Brown):**
- IRQ_NONE = interrupt was NOT from this device
- IRQ_HANDLED = interrupt WAS from this device (regardless of whether we fully processed it)

**The QSPI_RDY Bit:**
This bit in QSPI_TRANS_STATUS is set by hardware when a transfer completes and triggers 
the interrupt. Software clears it by writing 1.

**Why Your Original v1 Logic is Correct:**

Your "[PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed 
transfer" reads QSPI_TRANS_STATUS at the start of tegra_qspi_isr_thread():

    if (!tqspi->curr_xfer) {
        if (!(status & QSPI_RDY))
            return IRQ_NONE;        // HW never set RDY → spurious interrupt
        return IRQ_HANDLED;         // HW did set RDY → real interrupt, timeout processed it
    }

**Scenario 1 - Delayed ISR (the race you're fixing):**
1. HW completes transfer, sets QSPI_RDY, interrupt fires
2. ISR thread delayed (CPU busy)
3. Timeout handler runs, processes transfer, clears curr_xfer
4. Delayed ISR finally wakes up
5. Reads QSPI_RDY (may still be set)
6. curr_xfer is NULL
7. Return IRQ_HANDLED → this WAS our interrupt, just processed by timeout

**Scenario 2 - Truly Spurious:**
1. Spurious interrupt fires
2. QSPI_RDY = 0 (no transfer completed)
3. curr_xfer is NULL
4. Return IRQ_NONE → not our interrupt

**Your Latest Proposal Has It Backwards:**

The version in your last email returns IRQ_HANDLED when QSPI_RDY is NOT set, which is 
incorrect per Mark's feedback.

**For v2:**

Patches 1-5: Keep as-is from v1 (all correct)

Patch 6 ("[PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler"): 
The TODO comment you added asks about keeping the lock held across the handler call. I'd 
suggest removing the TODO and replacing it with a comment explaining why the current 
approach is safe:

    spin_unlock_irqrestore(&tqspi->lock, flags);

    /*
     * Lock is released here but handlers safely re-check curr_xfer under lock
     * before dereferencing. DMA handler also needs to sleep in 
     * wait_for_completion_*(), which cannot be done while holding spinlock.
     */
    if (!tqspi->is_curr_dma_xfer)
        return handle_cpu_based_xfer(tqspi);

This documents the design decision and closes the TODO.

The device_reset_optional() was from your March 2025 series - keep that separate.

**Testing:**

Carol Soto will validate v2 with your test methodology and provide feedback.

**Follow-on:**

I'll implement hard IRQ handler support separately after your fix merges.

Best,
Vishwaroop

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
  2026-01-21 17:56     ` Vishwaroop A
@ 2026-01-22 17:04       ` Breno Leitao
  0 siblings, 0 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-22 17:04 UTC (permalink / raw)
  To: Vishwaroop A
  Cc: thierry.reding, treding, jonathanh, skomatineni, ldewangan,
	broonie, linux-spi, linux-tegra, linux-kernel, kernel-team,
	puranjay, usamaarif642

Hello Vishwaroop,

On Wed, Jan 21, 2026 at 05:56:17PM +0000, Vishwaroop A wrote:
> Hi Breno,
> 
> After reviewing Mark Brown's feedback and the code carefully, let me clarify the 
> correct logic. This is important to get right.
> 
> **IRQ Handler Semantics (per Mark Brown):**
> - IRQ_NONE = interrupt was NOT from this device
> - IRQ_HANDLED = interrupt WAS from this device (regardless of whether we fully processed it)
> 
> **The QSPI_RDY Bit:**
> This bit in QSPI_TRANS_STATUS is set by hardware when a transfer completes and triggers 
> the interrupt. Software clears it by writing 1.
> 
> **Why Your Original v1 Logic is Correct:**
> 
> Your "[PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed 
> transfer" reads QSPI_TRANS_STATUS at the start of tegra_qspi_isr_thread():
> 
>     if (!tqspi->curr_xfer) {
>         if (!(status & QSPI_RDY))
>             return IRQ_NONE;        // HW never set RDY → spurious interrupt
>         return IRQ_HANDLED;         // HW did set RDY → real interrupt, timeout processed it
>     }
> 
> **Scenario 1 - Delayed ISR (the race you're fixing):**
> 1. HW completes transfer, sets QSPI_RDY, interrupt fires
> 2. ISR thread delayed (CPU busy)
> 3. Timeout handler runs, processes transfer, clears curr_xfer
> 4. Delayed ISR finally wakes up
> 5. Reads QSPI_RDY (may still be set)
> 6. curr_xfer is NULL
> 7. Return IRQ_HANDLED → this WAS our interrupt, just processed by timeout
> 
> **Scenario 2 - Truly Spurious:**
> 1. Spurious interrupt fires
> 2. QSPI_RDY = 0 (no transfer completed)
> 3. curr_xfer is NULL
> 4. Return IRQ_NONE → not our interrupt
> 
> **Your Latest Proposal Has It Backwards:**
> 
> The version in your last email returns IRQ_HANDLED when QSPI_RDY is NOT set, which is 
> incorrect per Mark's feedback.
> 
> **For v2:**
> 
> Patches 1-5: Keep as-is from v1 (all correct)
> 
> Patch 6 ("[PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler"): 
> The TODO comment you added asks about keeping the lock held across the handler call. I'd 
> suggest removing the TODO and replacing it with a comment explaining why the current 
> approach is safe:
> 
>     spin_unlock_irqrestore(&tqspi->lock, flags);
> 
>     /*
>      * Lock is released here but handlers safely re-check curr_xfer under lock
>      * before dereferencing. DMA handler also needs to sleep in 
>      * wait_for_completion_*(), which cannot be done while holding spinlock.
>      */
>     if (!tqspi->is_curr_dma_xfer)
>         return handle_cpu_based_xfer(tqspi);
> 
> This documents the design decision and closes the TODO.

Thanks. I will send a v2 with patches 1 - 5 untoouched, and patch
6 updated with this document.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-01-22 17:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-16 11:33   ` Thierry Reding
2026-01-16 11:48   ` Usama Arif
2026-01-16 12:06     ` Thierry Reding
2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-16 11:38   ` Thierry Reding
2026-01-16 10:41 ` [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-16 10:41 ` [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
2026-01-20 16:49   ` Breno Leitao
2026-01-20 16:52     ` Mark Brown
2026-01-21 17:56     ` Vishwaroop A
2026-01-22 17:04       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox