All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thierry Reding <thierry.reding@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Thierry Reding <thierry.reding@gmail.com>,
	 Jonathan Hunter <jonathanh@nvidia.com>,
	Sowjanya Komatineni <skomatineni@nvidia.com>,
	 Laxman Dewangan <ldewangan@nvidia.com>,
	Mark Brown <broonie@kernel.org>, Vishwaroop A <va@nvidia.com>,
	 Thierry Reding <treding@nvidia.com>,
	linux-tegra@vger.kernel.org, linux-spi@vger.kernel.org,
	 linux-kernel@vger.kernel.org, kernel-team@meta.com,
	soto@nvidia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
Date: Fri, 30 Jan 2026 11:16:12 +0100	[thread overview]
Message-ID: <aXyE1kfP4GeOdYs5@orome> (raw)
In-Reply-To: <20260126-tegra_xfer-v2-0-6d2115e4f387@debian.org>

[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]

On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
> 
> 1) After running for a while, the interrupt gets marked as spurious:
> 
>     irq 63: nobody cared (try booting with the "irqpoll" option)
>     Disabling IRQ #63
> 
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>    responsible for signaling xfer_completion.
>    Once the interrupt is disabled, xfer_completion is never completed, causing
>    transfers to hit the timeout:
> 
>     WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
> 
> 3) The timeout handler completes the transfer:
> 
>     tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
> 
> 4) Later, the ISR thread finally runs and crashes trying to dereference
>    curr_xfer which the timeout handler already set to NULL:
> 
>     Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>     pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>     lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>     Call trace:
>       handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
> 
> Root cause analysis identified three issues:
> 
> 1) Race condition on tqspi->curr_xfer
> 
>    The curr_xfer pointer can change during ISR execution without proper
>    synchronization. The timeout path clears curr_xfer while the ISR
>    thread may still be accessing it.
> 
>    This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>    adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>    at entry and exit - the value changes mid-execution. I've used the
>    following test to reproduce this issue:
> 
>    https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
>    The existing comment in the ISR acknowledges this race but the
>    protection is insufficient:
> 
>        /*
>         * Occasionally the IRQ thread takes a long time to wake up (usually
>         * when the CPU that it's running on is excessively busy) and we have
>         * already reached the timeout before and cleaned up the timed out
>         * transfer. Avoid any processing in that case and bail out early.
>         */
> 
>    This is bad because tqspi->curr_xfer can just get NULLed
> 
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
> 
>    When the timeout handler processes a transfer before the ISR thread
>    runs, tegra_qspi_isr_thread() returns IRQ_NONE.
> 
>    After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>    and disables it - but these were legitimate interrupts that happened to be
>    processed by the timeout path first.
> 
>    Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>    the interrupt (!?)
> 
> 3) Complex locking makes full protection difficult
> 
>    Ideally the entire tqspi structure would be protected by tqspi->lock,
>    but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>    which can sleep, preventing the lock from being held across the entire
>    ISR execution.
> 
>    Usama Arif has some ideas here, and he can share more.
> 
> This patchset addresses these issues:
> 
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
> 
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
> 
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
> 
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
> 
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
> 
> ---
> Breno Leitao (6):
>       spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>       spi: tegra210-quad: Move curr_xfer read inside spinlock
>       spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>       spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer check in IRQ handler
> 
>  drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)

For the series:

Acked-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2026-01-30 10:16 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-26 17:50 ` [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-26 17:50 ` [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-26 17:50 ` [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-30 10:16 ` Thierry Reding [this message]
2026-01-30 13:41   ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Jon Hunter
2026-01-30 17:07 ` Mark Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXyE1kfP4GeOdYs5@orome \
    --to=thierry.reding@kernel.org \
    --cc=broonie@kernel.org \
    --cc=jonathanh@nvidia.com \
    --cc=kernel-team@meta.com \
    --cc=ldewangan@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=skomatineni@nvidia.com \
    --cc=soto@nvidia.com \
    --cc=thierry.reding@gmail.com \
    --cc=treding@nvidia.com \
    --cc=va@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.