public inbox for linux-tegra@vger.kernel.org
 help / color / mirror / Atom feed
From: Thierry Reding <thierry.reding@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Thierry Reding <thierry.reding@gmail.com>,
	 Jonathan Hunter <jonathanh@nvidia.com>,
	Sowjanya Komatineni <skomatineni@nvidia.com>,
	 Laxman Dewangan <ldewangan@nvidia.com>,
	Mark Brown <broonie@kernel.org>, Vishwaroop A <va@nvidia.com>,
	 Thierry Reding <treding@nvidia.com>,
	linux-tegra@vger.kernel.org, linux-spi@vger.kernel.org,
	 linux-kernel@vger.kernel.org, kernel-team@meta.com,
	soto@nvidia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
Date: Fri, 30 Jan 2026 11:16:12 +0100	[thread overview]
Message-ID: <aXyE1kfP4GeOdYs5@orome> (raw)
In-Reply-To: <20260126-tegra_xfer-v2-0-6d2115e4f387@debian.org>

[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]

On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
> 
> 1) After running for a while, the interrupt gets marked as spurious:
> 
>     irq 63: nobody cared (try booting with the "irqpoll" option)
>     Disabling IRQ #63
> 
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>    responsible for signaling xfer_completion.
>    Once the interrupt is disabled, xfer_completion is never completed, causing
>    transfers to hit the timeout:
> 
>     WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
> 
> 3) The timeout handler completes the transfer:
> 
>     tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
> 
> 4) Later, the ISR thread finally runs and crashes trying to dereference
>    curr_xfer which the timeout handler already set to NULL:
> 
>     Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>     pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>     lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>     Call trace:
>       handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
> 
> Root cause analysis identified three issues:
> 
> 1) Race condition on tqspi->curr_xfer
> 
>    The curr_xfer pointer can change during ISR execution without proper
>    synchronization. The timeout path clears curr_xfer while the ISR
>    thread may still be accessing it.
> 
>    This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>    adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>    at entry and exit - the value changes mid-execution. I've used the
>    following test to reproduce this issue:
> 
>    https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
>    The existing comment in the ISR acknowledges this race but the
>    protection is insufficient:
> 
>        /*
>         * Occasionally the IRQ thread takes a long time to wake up (usually
>         * when the CPU that it's running on is excessively busy) and we have
>         * already reached the timeout before and cleaned up the timed out
>         * transfer. Avoid any processing in that case and bail out early.
>         */
> 
>    This is bad because tqspi->curr_xfer can just get NULLed
> 
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
> 
>    When the timeout handler processes a transfer before the ISR thread
>    runs, tegra_qspi_isr_thread() returns IRQ_NONE.
> 
>    After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>    and disables it - but these were legitimate interrupts that happened to be
>    processed by the timeout path first.
> 
>    Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>    the interrupt (!?)
> 
> 3) Complex locking makes full protection difficult
> 
>    Ideally the entire tqspi structure would be protected by tqspi->lock,
>    but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>    which can sleep, preventing the lock from being held across the entire
>    ISR execution.
> 
>    Usama Arif has some ideas here, and he can share more.
> 
> This patchset addresses these issues:
> 
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
> 
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
> 
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
> 
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
> 
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
> 
> ---
> Breno Leitao (6):
>       spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>       spi: tegra210-quad: Move curr_xfer read inside spinlock
>       spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>       spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer check in IRQ handler
> 
>  drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)

For the series:

Acked-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2026-01-30 10:16 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-26 17:50 ` [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-26 17:50 ` [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-26 17:50 ` [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-30 10:16 ` Thierry Reding [this message]
2026-01-30 13:41   ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Jon Hunter
2026-01-30 17:07 ` Mark Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXyE1kfP4GeOdYs5@orome \
    --to=thierry.reding@kernel.org \
    --cc=broonie@kernel.org \
    --cc=jonathanh@nvidia.com \
    --cc=kernel-team@meta.com \
    --cc=ldewangan@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=skomatineni@nvidia.com \
    --cc=soto@nvidia.com \
    --cc=thierry.reding@gmail.com \
    --cc=treding@nvidia.com \
    --cc=va@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox