On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote: > The tegra-quad-spi driver is crashing on some hosts. Analysis revealed > the following failure sequence: > > 1) After running for a while, the interrupt gets marked as spurious: > > irq 63: nobody cared (try booting with the "irqpoll" option) > Disabling IRQ #63 > > 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is > responsible for signaling xfer_completion. > Once the interrupt is disabled, xfer_completion is never completed, causing > transfers to hit the timeout: > > WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0 > > 3) The timeout handler completes the transfer: > > tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete > > 4) Later, the ISR thread finally runs and crashes trying to dereference > curr_xfer which the timeout handler already set to NULL: > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 > pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] > lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad] > Call trace: > handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P) > > Root cause analysis identified three issues: > > 1) Race condition on tqspi->curr_xfer > > The curr_xfer pointer can change during ISR execution without proper > synchronization. The timeout path clears curr_xfer while the ISR > thread may still be accessing it. > > This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and > adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer > at entry and exit - the value changes mid-execution. I've used the > following test to reproduce this issue: > > https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh > > The existing comment in the ISR acknowledges this race but the > protection is insufficient: > > /* > * Occasionally the IRQ thread takes a long time to wake up (usually > * when the CPU that it's running on is excessively busy) and we have > * already reached the timeout before and cleaned up the timed out > * transfer. Avoid any processing in that case and bail out early. > */ > > This is bad because tqspi->curr_xfer can just get NULLed > > 2) Incorrect IRQ_NONE return causing spurious IRQ detection > > When the timeout handler processes a transfer before the ISR thread > runs, tegra_qspi_isr_thread() returns IRQ_NONE. > > After enough IRQ_NONE returns, the kernel marks the interrupt as spurious > and disables it - but these were legitimate interrupts that happened to be > processed by the timeout path first. > > Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled > the interrupt (!?) > > 3) Complex locking makes full protection difficult > > Ideally the entire tqspi structure would be protected by tqspi->lock, > but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout() > which can sleep, preventing the lock from being held across the entire > ISR execution. > > Usama Arif has some ideas here, and he can share more. > > This patchset addresses these issues: > > Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has > already processed the transfer. Use the QSPI_RDY bit in > QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to > distinguish real interrupts from truly spurious ones. > > Protect curr_xfer access with spinlock everywhere in the code, given > Interrupt handling can run in parallel with timeout and transfer. > This prevents the NULL pointer dereference by ensuring curr_xfer cannot > be cleared while being checked. > > While this may not provide complete protection for all tqspi fields > (which might be necessary?!), it fixes the observed crashes and prevents > the spurious IRQ detection that was disabling the interrupt entirely. > > This was tested with a simple TPM application, where the TPM lives > behind the tegra qspi driver: > > https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh > > A special thanks for Usama Arif for his help investigating the problem > and helping with the fixes. > > Signed-off-by: Breno Leitao > --- > Changes in v2: > - Replaced the TODO comment to clarify why the lock is being released. > - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org > > --- > Breno Leitao (6): > spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer > spi: tegra210-quad: Move curr_xfer read inside spinlock > spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one > spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer > spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer > spi: tegra210-quad: Protect curr_xfer check in IRQ handler > > drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 52 insertions(+), 4 deletions(-) For the series: Acked-by: Thierry Reding