* [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-26 17:50 ` [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
` (6 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
When the ISR thread wakes up late and finds that the timeout handler
has already processed the transfer (curr_xfer is NULL), return
IRQ_HANDLED instead of IRQ_NONE.
Use a similar approach to tegra_qspi_handle_timeout() by reading
QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
hardware actually completed the transfer. If QSPI_RDY is set, the
interrupt was legitimate and triggered by real hardware activity.
The fact that the timeout path handled it first doesn't make it
spurious. Returning IRQ_NONE incorrectly suggests the interrupt
wasn't for this device, which can cause issues with shared interrupt
lines and interrupt accounting.
Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9b..f0408c0b4b981 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1552,15 +1552,30 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
{
struct tegra_qspi *tqspi = context_data;
+ u32 status;
+
+ /*
+ * Read transfer status to check if interrupt was triggered by transfer
+ * completion
+ */
+ status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
/*
* Occasionally the IRQ thread takes a long time to wake up (usually
* when the CPU that it's running on is excessively busy) and we have
* already reached the timeout before and cleaned up the timed out
* transfer. Avoid any processing in that case and bail out early.
+ *
+ * If no transfer is in progress, check if this was a real interrupt
+ * that the timeout handler already processed, or a spurious one.
*/
- if (!tqspi->curr_xfer)
- return IRQ_NONE;
+ if (!tqspi->curr_xfer) {
+ /* Spurious interrupt - transfer not ready */
+ if (!(status & QSPI_RDY))
+ return IRQ_NONE;
+ /* Real interrupt, already handled by timeout path */
+ return IRQ_HANDLED;
+ }
tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-26 17:50 ` [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-26 17:50 ` [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
` (5 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
Move the assignment of the transfer pointer from curr_xfer inside the
spinlock critical section in both handle_cpu_based_xfer() and
handle_dma_based_xfer().
Previously, curr_xfer was read before acquiring the lock, creating a
window where the timeout path could clear curr_xfer between reading it
and using it. By moving the read inside the lock, the handlers are
guaranteed to see a consistent value that cannot be modified by the
timeout path.
Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index f0408c0b4b981..ee291b9e9e9c0 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1440,10 +1440,11 @@ static int tegra_qspi_transfer_one_message(struct spi_controller *host,
static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
{
- struct spi_transfer *t = tqspi->curr_xfer;
+ struct spi_transfer *t;
unsigned long flags;
spin_lock_irqsave(&tqspi->lock, flags);
+ t = tqspi->curr_xfer;
if (tqspi->tx_status || tqspi->rx_status) {
tegra_qspi_handle_error(tqspi);
@@ -1474,7 +1475,7 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
{
- struct spi_transfer *t = tqspi->curr_xfer;
+ struct spi_transfer *t;
unsigned int total_fifo_words;
unsigned long flags;
long wait_status;
@@ -1513,6 +1514,7 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
}
spin_lock_irqsave(&tqspi->lock, flags);
+ t = tqspi->curr_xfer;
if (num_errors) {
tegra_qspi_dma_unmap_xfer(tqspi, t);
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-26 17:50 ` [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-26 17:50 ` [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
` (4 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
When the timeout handler processes a completed transfer and signals
completion, the transfer thread can immediately set up the next transfer
and assign curr_xfer to point to it.
If a delayed ISR from the previous transfer then runs, it checks if
(!tqspi->curr_xfer) (currently without the lock also -- to be fixed
soon) to detect stale interrupts, but this check passes because
curr_xfer now points to the new transfer. The ISR then incorrectly
processes the new transfer's context.
Protect the curr_xfer assignment with the spinlock to ensure the ISR
either sees NULL (and bails out) or sees the new value only after the
assignment is complete.
Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index ee291b9e9e9c0..15c110c00aca5 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -839,6 +839,7 @@ static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi, struct spi_tran
u32 command1, command2, speed = t->speed_hz;
u8 bits_per_word = t->bits_per_word;
u32 tx_tap = 0, rx_tap = 0;
+ unsigned long flags;
int req_mode;
if (!has_acpi_companion(tqspi->dev) && speed != tqspi->cur_speed) {
@@ -846,10 +847,12 @@ static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi, struct spi_tran
tqspi->cur_speed = speed;
}
+ spin_lock_irqsave(&tqspi->lock, flags);
tqspi->cur_pos = 0;
tqspi->cur_rx_pos = 0;
tqspi->cur_tx_pos = 0;
tqspi->curr_xfer = t;
+ spin_unlock_irqrestore(&tqspi->lock, flags);
if (is_first_of_msg) {
tegra_qspi_mask_clear_irq(tqspi);
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
` (2 preceding siblings ...)
2026-01-26 17:50 ` [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-26 17:50 ` [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
` (3 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
The curr_xfer field is read by the IRQ handler without holding the lock
to check if a transfer is in progress. When clearing curr_xfer in the
combined sequence transfer loop, protect it with the spinlock to prevent
a race with the interrupt handler.
Protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race
with the interrupt handler that reads this field.
Without this protection, the IRQ handler could read a partially updated
curr_xfer value, leading to NULL pointer dereference or use-after-free.
Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 15c110c00aca5..669e01d3f56a6 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1161,6 +1161,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
u32 address_value = 0;
u32 cmd_config = 0, addr_config = 0;
u8 cmd_value = 0, val = 0;
+ unsigned long flags;
/* Enable Combined sequence mode */
val = tegra_qspi_readl(tqspi, QSPI_GLOBAL_CONFIG);
@@ -1264,13 +1265,17 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
tegra_qspi_transfer_end(spi);
spi_transfer_delay_exec(xfer);
}
+ spin_lock_irqsave(&tqspi->lock, flags);
tqspi->curr_xfer = NULL;
+ spin_unlock_irqrestore(&tqspi->lock, flags);
transfer_phase++;
}
ret = 0;
exit:
+ spin_lock_irqsave(&tqspi->lock, flags);
tqspi->curr_xfer = NULL;
+ spin_unlock_irqrestore(&tqspi->lock, flags);
msg->status = ret;
return ret;
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
` (3 preceding siblings ...)
2026-01-26 17:50 ` [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-26 17:50 ` [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
` (2 subsequent siblings)
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
Protect the curr_xfer clearing in tegra_qspi_non_combined_seq_xfer()
with the spinlock to prevent a race with the interrupt handler that
reads this field to check if a transfer is in progress.
Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 669e01d3f56a6..79aeb80aa4a70 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1288,6 +1288,7 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
struct spi_transfer *transfer;
bool is_first_msg = true;
int ret = 0, val = 0;
+ unsigned long flags;
msg->status = 0;
msg->actual_length = 0;
@@ -1368,7 +1369,9 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
msg->actual_length += xfer->len + dummy_bytes;
complete_xfer:
+ spin_lock_irqsave(&tqspi->lock, flags);
tqspi->curr_xfer = NULL;
+ spin_unlock_irqrestore(&tqspi->lock, flags);
if (ret < 0) {
tegra_qspi_transfer_end(spi);
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
` (4 preceding siblings ...)
2026-01-26 17:50 ` [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
@ 2026-01-26 17:50 ` Breno Leitao
2026-01-30 10:16 ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Thierry Reding
2026-01-30 17:07 ` Mark Brown
7 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-26 17:50 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
Breno Leitao, kernel-team, soto
Now that all other accesses to curr_xfer are done under the lock,
protect the curr_xfer NULL check in tegra_qspi_isr_thread() with the
spinlock. Without this protection, the following race can occur:
CPU0 (ISR thread) CPU1 (timeout path)
---------------- -------------------
if (!tqspi->curr_xfer)
// sees non-NULL
spin_lock()
tqspi->curr_xfer = NULL
spin_unlock()
handle_*_xfer()
spin_lock()
t = tqspi->curr_xfer // NULL!
... t->len ... // NULL dereference!
With this patch, all curr_xfer accesses are now properly synchronized.
Although all accesses to curr_xfer are done under the lock, in
tegra_qspi_isr_thread() it checks for NULL, releases the lock and
reacquires it later in handle_cpu_based_xfer()/handle_dma_based_xfer().
There is a potential for an update in between, which could cause a NULL
pointer dereference.
To handle this, add a NULL check inside the handlers after acquiring
the lock. This ensures that if the timeout path has already cleared
curr_xfer, the handler will safely return without dereferencing the
NULL pointer.
Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/spi/spi-tegra210-quad.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 79aeb80aa4a70..f425d62e0c276 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1457,6 +1457,11 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
spin_lock_irqsave(&tqspi->lock, flags);
t = tqspi->curr_xfer;
+ if (!t) {
+ spin_unlock_irqrestore(&tqspi->lock, flags);
+ return IRQ_HANDLED;
+ }
+
if (tqspi->tx_status || tqspi->rx_status) {
tegra_qspi_handle_error(tqspi);
complete(&tqspi->xfer_completion);
@@ -1527,6 +1532,11 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
spin_lock_irqsave(&tqspi->lock, flags);
t = tqspi->curr_xfer;
+ if (!t) {
+ spin_unlock_irqrestore(&tqspi->lock, flags);
+ return IRQ_HANDLED;
+ }
+
if (num_errors) {
tegra_qspi_dma_unmap_xfer(tqspi, t);
tegra_qspi_handle_error(tqspi);
@@ -1565,6 +1575,7 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
{
struct tegra_qspi *tqspi = context_data;
+ unsigned long flags;
u32 status;
/*
@@ -1582,7 +1593,9 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
* If no transfer is in progress, check if this was a real interrupt
* that the timeout handler already processed, or a spurious one.
*/
+ spin_lock_irqsave(&tqspi->lock, flags);
if (!tqspi->curr_xfer) {
+ spin_unlock_irqrestore(&tqspi->lock, flags);
/* Spurious interrupt - transfer not ready */
if (!(status & QSPI_RDY))
return IRQ_NONE;
@@ -1599,7 +1612,14 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
tqspi->rx_status = tqspi->status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
tegra_qspi_mask_clear_irq(tqspi);
+ spin_unlock_irqrestore(&tqspi->lock, flags);
+ /*
+ * Lock is released here but handlers safely re-check curr_xfer under
+ * lock before dereferencing.
+ * DMA handler also needs to sleep in wait_for_completion_*(), which
+ * cannot be done while holding spinlock.
+ */
if (!tqspi->is_curr_dma_xfer)
return handle_cpu_based_xfer(tqspi);
--
2.47.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
` (5 preceding siblings ...)
2026-01-26 17:50 ` [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
@ 2026-01-30 10:16 ` Thierry Reding
2026-01-30 13:41 ` Jon Hunter
2026-01-30 17:07 ` Mark Brown
7 siblings, 1 reply; 10+ messages in thread
From: Thierry Reding @ 2026-01-30 10:16 UTC (permalink / raw)
To: Breno Leitao
Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Mark Brown, Vishwaroop A, Thierry Reding,
linux-tegra, linux-spi, linux-kernel, kernel-team, soto
[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]
On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
>
> 1) After running for a while, the interrupt gets marked as spurious:
>
> irq 63: nobody cared (try booting with the "irqpoll" option)
> Disabling IRQ #63
>
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
> responsible for signaling xfer_completion.
> Once the interrupt is disabled, xfer_completion is never completed, causing
> transfers to hit the timeout:
>
> WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>
> 3) The timeout handler completes the transfer:
>
> tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>
> 4) Later, the ISR thread finally runs and crashes trying to dereference
> curr_xfer which the timeout handler already set to NULL:
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
> pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
> lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
> Call trace:
> handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>
> Root cause analysis identified three issues:
>
> 1) Race condition on tqspi->curr_xfer
>
> The curr_xfer pointer can change during ISR execution without proper
> synchronization. The timeout path clears curr_xfer while the ISR
> thread may still be accessing it.
>
> This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
> adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
> at entry and exit - the value changes mid-execution. I've used the
> following test to reproduce this issue:
>
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>
> The existing comment in the ISR acknowledges this race but the
> protection is insufficient:
>
> /*
> * Occasionally the IRQ thread takes a long time to wake up (usually
> * when the CPU that it's running on is excessively busy) and we have
> * already reached the timeout before and cleaned up the timed out
> * transfer. Avoid any processing in that case and bail out early.
> */
>
> This is bad because tqspi->curr_xfer can just get NULLed
>
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>
> When the timeout handler processes a transfer before the ISR thread
> runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>
> After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
> and disables it - but these were legitimate interrupts that happened to be
> processed by the timeout path first.
>
> Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
> the interrupt (!?)
>
> 3) Complex locking makes full protection difficult
>
> Ideally the entire tqspi structure would be protected by tqspi->lock,
> but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
> which can sleep, preventing the lock from being held across the entire
> ISR execution.
>
> Usama Arif has some ideas here, and he can share more.
>
> This patchset addresses these issues:
>
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
>
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
>
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
>
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
>
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>
> ---
> Breno Leitao (6):
> spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
> spi: tegra210-quad: Move curr_xfer read inside spinlock
> spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
> spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
> spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
> spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>
> drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 52 insertions(+), 4 deletions(-)
For the series:
Acked-by: Thierry Reding <treding@nvidia.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
2026-01-30 10:16 ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Thierry Reding
@ 2026-01-30 13:41 ` Jon Hunter
0 siblings, 0 replies; 10+ messages in thread
From: Jon Hunter @ 2026-01-30 13:41 UTC (permalink / raw)
To: Thierry Reding, Breno Leitao
Cc: Thierry Reding, Sowjanya Komatineni, Laxman Dewangan, Mark Brown,
Vishwaroop A, Thierry Reding, linux-tegra, linux-spi,
linux-kernel, kernel-team, soto
On 30/01/2026 10:16, Thierry Reding wrote:
> On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
>> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
>> the following failure sequence:
>>
>> 1) After running for a while, the interrupt gets marked as spurious:
>>
>> irq 63: nobody cared (try booting with the "irqpoll" option)
>> Disabling IRQ #63
>>
>> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>> responsible for signaling xfer_completion.
>> Once the interrupt is disabled, xfer_completion is never completed, causing
>> transfers to hit the timeout:
>>
>> WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>>
>> 3) The timeout handler completes the transfer:
>>
>> tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>>
>> 4) Later, the ISR thread finally runs and crashes trying to dereference
>> curr_xfer which the timeout handler already set to NULL:
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>> pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>> lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>> Call trace:
>> handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>>
>> Root cause analysis identified three issues:
>>
>> 1) Race condition on tqspi->curr_xfer
>>
>> The curr_xfer pointer can change during ISR execution without proper
>> synchronization. The timeout path clears curr_xfer while the ISR
>> thread may still be accessing it.
>>
>> This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>> adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>> at entry and exit - the value changes mid-execution. I've used the
>> following test to reproduce this issue:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> The existing comment in the ISR acknowledges this race but the
>> protection is insufficient:
>>
>> /*
>> * Occasionally the IRQ thread takes a long time to wake up (usually
>> * when the CPU that it's running on is excessively busy) and we have
>> * already reached the timeout before and cleaned up the timed out
>> * transfer. Avoid any processing in that case and bail out early.
>> */
>>
>> This is bad because tqspi->curr_xfer can just get NULLed
>>
>> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>>
>> When the timeout handler processes a transfer before the ISR thread
>> runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>>
>> After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>> and disables it - but these were legitimate interrupts that happened to be
>> processed by the timeout path first.
>>
>> Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>> the interrupt (!?)
>>
>> 3) Complex locking makes full protection difficult
>>
>> Ideally the entire tqspi structure would be protected by tqspi->lock,
>> but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>> which can sleep, preventing the lock from being held across the entire
>> ISR execution.
>>
>> Usama Arif has some ideas here, and he can share more.
>>
>> This patchset addresses these issues:
>>
>> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
>> already processed the transfer. Use the QSPI_RDY bit in
>> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
>> distinguish real interrupts from truly spurious ones.
>>
>> Protect curr_xfer access with spinlock everywhere in the code, given
>> Interrupt handling can run in parallel with timeout and transfer.
>> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
>> be cleared while being checked.
>>
>> While this may not provide complete protection for all tqspi fields
>> (which might be necessary?!), it fixes the observed crashes and prevents
>> the spurious IRQ detection that was disabling the interrupt entirely.
>>
>> This was tested with a simple TPM application, where the TPM lives
>> behind the tegra qspi driver:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> A special thanks for Usama Arif for his help investigating the problem
>> and helping with the fixes.
>>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>> Changes in v2:
>> - Replaced the TODO comment to clarify why the lock is being released.
>> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>>
>> ---
>> Breno Leitao (6):
>> spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>> spi: tegra210-quad: Move curr_xfer read inside spinlock
>> spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>> spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>> spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>> spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>>
>> drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 52 insertions(+), 4 deletions(-)
>
> For the series:
>
> Acked-by: Thierry Reding <treding@nvidia.com>
This also resolves a NULL pointer deference I see on Tegra194 Jetson
Xavier NX and so ...
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Thanks!
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
` (6 preceding siblings ...)
2026-01-30 10:16 ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Thierry Reding
@ 2026-01-30 17:07 ` Mark Brown
7 siblings, 0 replies; 10+ messages in thread
From: Mark Brown @ 2026-01-30 17:07 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, Vishwaroop A, Breno Leitao
Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel, kernel-team,
soto
On Mon, 26 Jan 2026 09:50:25 -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
>
> 1) After running for a while, the interrupt gets marked as spurious:
>
> irq 63: nobody cared (try booting with the "irqpoll" option)
> Disabling IRQ #63
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next
Thanks!
[1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
commit: aabd8ea0aa253d40cf5f20a609fc3d6f61e38299
[2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock
commit: ef13ba357656451d6371940d8414e3e271df97e3
[3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
commit: f5a4d7f5e32ba163cff893493ec1cbb0fd2fb0d5
[4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
commit: bf4528ab28e2bf112c3a2cdef44fd13f007781cd
[5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
commit: 6d7723e8161f3c3f14125557e19dd080e9d882be
[6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler
commit: edf9088b6e1d6d88982db7eb5e736a0e4fbcc09e
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
^ permalink raw reply [flat|nested] 10+ messages in thread