Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Breno Leitao <leitao@debian.org>
To: Vishwaroop A <va@nvidia.com>
Cc: thierry.reding@gmail.com, treding@nvidia.com,
	jonathanh@nvidia.com,  skomatineni@nvidia.com,
	ldewangan@nvidia.com, broonie@kernel.org,
	 linux-spi@vger.kernel.org, linux-tegra@vger.kernel.org,
	linux-kernel@vger.kernel.org,  kernel-team@meta.com,
	puranjay@kernel.org, usamaarif642@gmail.com
Subject: Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
Date: Tue, 20 Jan 2026 08:49:23 -0800	[thread overview]
Message-ID: <aW-seUXIJv4Lz7bK@gmail.com> (raw)
In-Reply-To: <20260120112242.3766700-1-va@nvidia.com>

Hello Vishwaroop,

First of all, thanks for you answer and help,  

On Tue, Jan 20, 2026 at 11:22:42AM +0000, Vishwaroop A wrote:
> Thanks for posting this series. I've been working closely with our team at 
> NVIDIA on this exact issue after the reports from Meta's fleet.
> 
> Your analysis is correct, I have some technical feedback on the series:
> 
> **Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):
> 
> These directly address the race we identified. The spinlock-protected read of 
> curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the 
> TOCTOU race between the timeout handler and the delayed ISR thread.
> 
> **Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**
> 
> While our original position was that this lock isn't strictly required due to 
> call ordering (setup completes before HW start, so no IRQ can fire yet), I 
> agree that explicit locking here improves maintainability. It makes the 
> synchronization model clearer for future readers and removes any dependency on 
> implicit ordering guarantees.
> 
> **Patch 4 (device_reset_optional):
> 
> Your observation about ACPI platforms lacking _RST is accurate. On server 
> platforms, device_reset() fails and logs unnecessary errors. 
> device_reset_optional() is the right semantic here. I'd suggest coordinating 
> with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API 
> addition gets proper review—it's useful beyond just QSPI.

We are counting patch numbers differently. Patch 4 in this patchset is called
"[PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer",
 and basically protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race with the
interrupt handler that reads this field.

https://lore.kernel.org/all/20260116-tegra_xfer-v1-4-02d96c790619@debian.org/

The discussion about device_reset_optional() happened a while ago (March 2025),
but it never landed. Do you want me to include in this patchset?

Here is the discussion link:
https://lore.kernel.org/all/20250317-tegra-v1-1-78474efc0386@debian.org/

> **Patch 6 (Timeout error path): Logic issue—needs rework**

I understand you are talking about patch "[PATCH 1/6] spi:
tegra210-quad: Return IRQ_HANDLED when timeout already processed
transfer", right?

https://lore.kernel.org/all/20260116-tegra_xfer-v1-1-02d96c790619@debian.org/

> I see a problem here. If QSPI_RDY is not set (hardware actually timed out), 
> you return success (0) from tegra_qspi_handle_timeout(). This causes 
> __spi_transfer_message_noqueue() to call spi_finalize_current_message() with 
> status = 0, signaling success to the client when the transfer actually failed.
> 
> The correct behavior should be:
> - If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
> - If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)

Thanks for the heads-up.

> The current logic inverts this. I'd suggest:
> 
>   if (fifo_status & QSPI_RDY) {
>       /* HW completed, recover */
>       handle_cpu/dma_based_xfer();
>       return 0;
>   }
>   /* True timeout */
>   dev_err(..., "Transfer timeout");
>   tegra_qspi_handle_error(tqspi);
>   return -ETIMEDOUT;

I am not sure we can return -ETIMEDOUT here, given its return function
is irqreturn, which is:

  /**
   * enum irqreturn - irqreturn type values
   * @IRQ_NONE:           interrupt was not from this device or was not handled
   * @IRQ_HANDLED:        interrupt was handled by this device
   * @IRQ_WAKE_THREAD:    handler requests to wake the handler thread
   */
  enum irqreturn {
          IRQ_NONE                = (0 << 0),
          IRQ_HANDLED             = (1 << 0),
          IRQ_WAKE_THREAD         = (1 << 1),
  };


What about something as:

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9b..028b0b0cfdb25 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1552,15 +1552,30 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       u32 status;
+
+       /*
+        * Read transfer status to check if interrupt was triggered by transfer
+        * completion
+        */
+       status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually
         * when the CPU that it's running on is excessively busy) and we have
         * already reached the timeout before and cleaned up the timed out
         * transfer. Avoid any processing in that case and bail out early.
+        *
+        * If no transfer is in progress, check if this was a real interrupt
+        * that the timeout handler already processed, or a spurious one.
         */
-       if (!tqspi->curr_xfer)
+       if (!tqspi->curr_xfer) {
+               /* Spurious interrupt - transfer not ready */
+               if (!(status & QSPI_RDY))
+                       return IRQ_HANDLED;
+               /* Real interrupt, already handled by timeout path */
                return IRQ_NONE;
+       }

        tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);


> I saw your tpm_torture_test.sh in the cover letter. We haven't been able to 
> reproduce the issue locally yet—it seems to require the specific TPM device 
> configuration and CPU load patterns present in Meta's fleet. 

You can try to simulate a timeout using something like:

t a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9..7116c876c62b 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -159,7 +159,8 @@
 #define DATA_DIR_TX                            BIT(0)
 #define DATA_DIR_RX                            BIT(1)

-#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(1000))
+/* INSTRUMENTATION: Reduced from 1000ms to 20ms to trigger timeouts */
+#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(20))
 #define DEFAULT_QSPI_DMA_BUF_LEN               SZ_64K

 enum tegra_qspi_transfer_type {
@@ -1552,6 +1553,10 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       mdelay(10);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually

I've also uploaded another stress test in
https://github.com/leitao/debug/tree/main/arm/tegra. Those are the  
stress test I am using to reproduce this issue.

> If you have 
> additional details on the reproduction environment (TPM vendor/model, specific 
> workload characteristics, CPU affinity settings), that would help us validate 
> the fix on our end.

I am not doing anything special, this is a bit more about my host:

  # dmesg | grep tegra
  [    8.903792] tegra-qspi NVDA1513:00: cannot use DMA: -19
  [    8.903806] tegra-qspi NVDA1513:00: falling back to PIO
  [    8.903811] tegra-qspi NVDA1513:00: device reset failed

  # udevadm info -a /dev/tpm0  | cat
  looking at device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01/tpm/tpm0':
    KERNEL=="tpm0"
    SUBSYSTEM=="tpm"
    DRIVER==""
    ATTR{null_name}=="000bb4c148f37daef5c917ebdd9267edad857263b872d9a9cd05f2af96c060212e6f"
    ATTR{pcr-sha256/0}=="9CC1646906FD362B5F6D182CBADE226057CD32A6F5D6C6197A49AA78B4241929"
    ATTR{pcr-sha256/1}=="BFB3D60EA045EE30E924F239C812E11D7D02D380D94B1A53CDF8E336F4BD5EFF"
    ATTR{pcr-sha256/10}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/11}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/12}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/13}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/14}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/15}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/16}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/17}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/18}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/19}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/2}=="ECE24999889A3DBA6BDC392ED64CA0B6A968DFCAF46D8DBDDAD57A7A0423AD2C"
    ATTR{pcr-sha256/20}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/21}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/22}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/23}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/3}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/4}=="7E7811C1F6A74895706FFFBED8180452AABA032209D708A7B08066B16F38524F"
    ATTR{pcr-sha256/5}=="D679AB040FA69F51A392DC467BA09AC0A5040FAAD386A9DA99A23C465DB0BCE1"
    ATTR{pcr-sha256/6}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/7}=="65CAF8DD1E0EA7A6347B635D2B379C93B9A1351EDC2AFC3ECDA700E534EB3068"
    ATTR{pcr-sha256/8}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/9}=="94BACCCA95B21E689C38511D4FE26D2DCB1E2C20CD5CECC5C4F2FC99C28452BF"
    ATTR{power/control}=="auto"
    ATTR{power/runtime_active_time}=="0"
    ATTR{power/runtime_status}=="unsupported"
    ATTR{power/runtime_suspended_time}=="0"
    ATTR{tpm_version_major}=="2"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01':
    KERNELS=="spi-PRP0001:01"
    SUBSYSTEMS=="spi"
    DRIVERS=="tpm_tis_spi"
    ATTRS{driver_override}==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0':
    KERNELS=="spi0"
    SUBSYSTEMS=="spi_master"
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00':
    KERNELS=="NVDA1513:00"
    SUBSYSTEMS=="platform"
    DRIVERS=="tegra-qspi"
    ATTRS{driver_override}=="(null)"
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="86751"
    ATTRS{power/runtime_status}=="suspended"
    ATTRS{power/runtime_suspended_time}=="340179486"

  looking at parent device '/devices/platform':
    KERNELS=="platform"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"

> I'm happy to:
> - Test this series on our hardware platforms

Please test with the patch above (reduced QSPI_DMA_TIMEOUT and the mdelay). 

> - Collaborate on v2 with the fixes above

Thanks. I understand that the only concern for v1 is that QSPI_RDY if inverted,
and I can fix it and send a v2. is there anything else you want fixed in v2?

> - I will work on hard IRQ handler follow-up that Thierry suggested for 
>   long-term robustness

This is awesome, appreciate!

Thanks for you support,
--breno

next prev parent reply	other threads:[~2026-01-20 16:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-16 10:41 [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Breno Leitao
2026-01-16 10:41 ` [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer Breno Leitao
2026-01-16 11:33   ` Thierry Reding
2026-01-16 11:48   ` Usama Arif
2026-01-16 12:06     ` Thierry Reding
2026-01-16 10:41 ` [PATCH 2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock Breno Leitao
2026-01-16 11:38   ` Thierry Reding
2026-01-16 10:41 ` [PATCH 3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one Breno Leitao
2026-01-16 10:41 ` [PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer Breno Leitao
2026-01-16 10:41 ` [PATCH 6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler Breno Leitao
2026-01-20 11:22 ` [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ Vishwaroop A
2026-01-20 16:49   ` Breno Leitao [this message]
2026-01-20 16:52     ` Mark Brown
2026-01-21 17:56     ` Vishwaroop A
2026-01-22 17:04       ` Breno Leitao

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:cdc3cb7c01f9 dfblob:028b0b0cfdb2 dfblob:cdc3cb7c01f
dfblob:7116c876c62 )
 OR (
bs:"Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aW-seUXIJv4Lz7bK@gmail.com \
    --to=leitao@debian.org \
    --cc=broonie@kernel.org \
    --cc=jonathanh@nvidia.com \
    --cc=kernel-team@meta.com \
    --cc=ldewangan@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-spi@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=puranjay@kernel.org \
    --cc=skomatineni@nvidia.com \
    --cc=thierry.reding@gmail.com \
    --cc=treding@nvidia.com \
    --cc=usamaarif642@gmail.com \
    --cc=va@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.