public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jon Hunter <jonathanh@nvidia.com>
To: Thierry Reding <thierry.reding@kernel.org>,
	Breno Leitao <leitao@debian.org>
Cc: Thierry Reding <thierry.reding@gmail.com>,
	Sowjanya Komatineni <skomatineni@nvidia.com>,
	Laxman Dewangan <ldewangan@nvidia.com>,
	Mark Brown <broonie@kernel.org>, Vishwaroop A <va@nvidia.com>,
	Thierry Reding <treding@nvidia.com>,
	linux-tegra@vger.kernel.org, linux-spi@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com,
	soto@nvidia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
Date: Fri, 30 Jan 2026 13:41:44 +0000	[thread overview]
Message-ID: <57a63922-e194-4882-aa52-9447255ea1a3@nvidia.com> (raw)
In-Reply-To: <aXyE1kfP4GeOdYs5@orome>


On 30/01/2026 10:16, Thierry Reding wrote:
> On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
>> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
>> the following failure sequence:
>>
>> 1) After running for a while, the interrupt gets marked as spurious:
>>
>>      irq 63: nobody cared (try booting with the "irqpoll" option)
>>      Disabling IRQ #63
>>
>> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>>     responsible for signaling xfer_completion.
>>     Once the interrupt is disabled, xfer_completion is never completed, causing
>>     transfers to hit the timeout:
>>
>>      WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>>
>> 3) The timeout handler completes the transfer:
>>
>>      tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>>
>> 4) Later, the ISR thread finally runs and crashes trying to dereference
>>     curr_xfer which the timeout handler already set to NULL:
>>
>>      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>>      pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>>      lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>>      Call trace:
>>        handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>>
>> Root cause analysis identified three issues:
>>
>> 1) Race condition on tqspi->curr_xfer
>>
>>     The curr_xfer pointer can change during ISR execution without proper
>>     synchronization. The timeout path clears curr_xfer while the ISR
>>     thread may still be accessing it.
>>
>>     This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>>     adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>>     at entry and exit - the value changes mid-execution. I've used the
>>     following test to reproduce this issue:
>>
>>     https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>>     The existing comment in the ISR acknowledges this race but the
>>     protection is insufficient:
>>
>>         /*
>>          * Occasionally the IRQ thread takes a long time to wake up (usually
>>          * when the CPU that it's running on is excessively busy) and we have
>>          * already reached the timeout before and cleaned up the timed out
>>          * transfer. Avoid any processing in that case and bail out early.
>>          */
>>
>>     This is bad because tqspi->curr_xfer can just get NULLed
>>
>> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>>
>>     When the timeout handler processes a transfer before the ISR thread
>>     runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>>
>>     After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>>     and disables it - but these were legitimate interrupts that happened to be
>>     processed by the timeout path first.
>>
>>     Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>>     the interrupt (!?)
>>
>> 3) Complex locking makes full protection difficult
>>
>>     Ideally the entire tqspi structure would be protected by tqspi->lock,
>>     but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>>     which can sleep, preventing the lock from being held across the entire
>>     ISR execution.
>>
>>     Usama Arif has some ideas here, and he can share more.
>>
>> This patchset addresses these issues:
>>
>> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
>> already processed the transfer. Use the QSPI_RDY bit in
>> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
>> distinguish real interrupts from truly spurious ones.
>>
>> Protect curr_xfer access with spinlock everywhere in the code, given
>> Interrupt handling can run in parallel with timeout and transfer.
>> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
>> be cleared while being checked.
>>
>> While this may not provide complete protection for all tqspi fields
>> (which might be necessary?!), it fixes the observed crashes and prevents
>> the spurious IRQ detection that was disabling the interrupt entirely.
>>
>> This was tested with a simple TPM application, where the TPM lives
>> behind the tegra qspi driver:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> A special thanks for Usama Arif for his help investigating the problem
>> and helping with the fixes.
>>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>> Changes in v2:
>> - Replaced the TODO comment to clarify why the lock is being released.
>> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>>
>> ---
>> Breno Leitao (6):
>>        spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>>        spi: tegra210-quad: Move curr_xfer read inside spinlock
>>        spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>>        spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>>        spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>>        spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>>
>>   drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 52 insertions(+), 4 deletions(-)
> 
> For the series:
> 
> Acked-by: Thierry Reding <treding@nvidia.com>

This also resolves a NULL pointer deference I see on Tegra194 Jetson 
Xavier NX and so ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic


  reply	other threads:[~2026-01-30 13:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26 17:50 [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-26 17:50 ` [PATCH v2 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-26 17:50 ` [PATCH v2 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-26 17:50 ` [PATCH v2 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-26 17:50 ` [PATCH v2 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-30 10:16 ` [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Thierry Reding
2026-01-30 13:41   ` Jon Hunter [this message]
2026-01-30 17:07 ` Mark Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57a63922-e194-4882-aa52-9447255ea1a3@nvidia.com \
    --to=jonathanh@nvidia.com \
    --cc=broonie@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=ldewangan@nvidia.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=skomatineni@nvidia.com \
    --cc=soto@nvidia.com \
    --cc=thierry.reding@gmail.com \
    --cc=thierry.reding@kernel.org \
    --cc=treding@nvidia.com \
    --cc=va@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox