public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
@ 2026-01-16 10:41 Breno Leitao
  2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Breno Leitao @ 2026-01-16 10:41 UTC (permalink / raw)
  To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Laxman Dewangan, Mark Brown, Vishwaroop A
  Cc: Thierry Reding, linux-tegra, linux-spi, linux-kernel,
	Breno Leitao, kernel-team, puranjay, usamaarif642

The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
the following failure sequence:

1) After running for a while, the interrupt gets marked as spurious:

    irq 63: nobody cared (try booting with the "irqpoll" option)
    Disabling IRQ #63

2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
   responsible for signaling xfer_completion.
   Once the interrupt is disabled, xfer_completion is never completed, causing
   transfers to hit the timeout:

    WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0

3) The timeout handler completes the transfer:

    tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete

4) Later, the ISR thread finally runs and crashes trying to dereference
   curr_xfer which the timeout handler already set to NULL:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
    lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
    Call trace:
      handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)

Root cause analysis identified three issues:

1) Race condition on tqspi->curr_xfer

   The curr_xfer pointer can change during ISR execution without proper
   synchronization. The timeout path clears curr_xfer while the ISR
   thread may still be accessing it.

   This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
   adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
   at entry and exit - the value changes mid-execution. I've used the
   following test to reproduce this issue:

   https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

   The existing comment in the ISR acknowledges this race but the
   protection is insufficient:

       /*
        * Occasionally the IRQ thread takes a long time to wake up (usually
        * when the CPU that it's running on is excessively busy) and we have
        * already reached the timeout before and cleaned up the timed out
        * transfer. Avoid any processing in that case and bail out early.
        */

   This is bad because tqspi->curr_xfer can just get NULLed

2) Incorrect IRQ_NONE return causing spurious IRQ detection

   When the timeout handler processes a transfer before the ISR thread
   runs, tegra_qspi_isr_thread() returns IRQ_NONE.

   After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
   and disables it - but these were legitimate interrupts that happened to be
   processed by the timeout path first.

   Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
   the interrupt (!?)

3) Complex locking makes full protection difficult

   Ideally the entire tqspi structure would be protected by tqspi->lock,
   but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
   which can sleep, preventing the lock from being held across the entire
   ISR execution.

   Usama Arif has some ideas here, and he can share more.

This patchset addresses these issues:

Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
already processed the transfer. Use the QSPI_RDY bit in
QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
distinguish real interrupts from truly spurious ones.

Protect curr_xfer access with spinlock everywhere in the code, given
Interrupt handling can run in parallel with timeout and transfer.
This prevents the NULL pointer dereference by ensuring curr_xfer cannot
be cleared while being checked.

While this may not provide complete protection for all tqspi fields
(which might be necessary?!), it fixes the observed crashes and prevents
the spurious IRQ detection that was disabling the interrupt entirely.

This was tested with a simple TPM application, where the TPM lives
behind the tegra qspi driver:

https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

A special thanks for Usama Arif for his help investigating the problem
and helping with the fixes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (6):
      spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
      spi: tegra210-quad: Move curr_xfer read inside spinlock
      spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
      spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer check in IRQ handler

 drivers/spi/spi-tegra210-quad.c | 54 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 4 deletions(-)
---
base-commit: 9b7977f9e39b7768c70c2aa497f04e7569fd3e00
change-id: 20260112-tegra_xfer-6acb30a6720f

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-01-22 17:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-16 11:33   ` Thierry Reding
2026-01-16 11:48   ` Usama Arif
2026-01-16 12:06     ` Thierry Reding
2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-16 11:38   ` Thierry Reding
2026-01-16 10:41 ` [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-16 10:41 ` [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
2026-01-20 16:49   ` Breno Leitao
2026-01-20 16:52     ` Mark Brown
2026-01-21 17:56     ` Vishwaroop A
2026-01-22 17:04       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox