Linux SPI subsystem development
 help / color / mirror / Atom feed
* [PATCH v4 0/3] spi: tegra210-quad: Improve interrupt handling for loaded systems
@ 2026-06-10  6:23 Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 1/3] spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue Vishwaroop A
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Vishwaroop A @ 2026-06-10  6:23 UTC (permalink / raw)
  To: Thierry Reding, Jon Hunter, Mark Brown
  Cc: Vishwaroop A, Laxman Dewangan, Sowjanya Komatineni, Breno Leitao,
	Suresh Mangipudi, Krishna Yarlagadda, linux-tegra, linux-spi,
	linux-kernel

The current threaded IRQ implementation in spi-tegra210-quad suffers from
scheduler-induced latency on heavily loaded systems. Because threaded IRQ
handlers are subject to CFS scheduling, they can be delayed long enough to
trigger transfer timeouts even though hardware completes in microseconds.
This results in false timeout errors and WARN_ON splats during normal
operation.

This series addresses the problem in three steps:

1. Convert the threaded IRQ handler to a hard IRQ + high-priority unbound
   workqueue model. The hard IRQ does the minimum: capture FIFO status,
   mask and clear the controller IRQ, then schedule the bottom half. The
   workqueue handler runs in process context (can sleep for DMA
   completion) and can execute on any CPU, avoiding the CPU0 bottleneck
   inherent in threaded IRQs.

2. Cache QSPI_TRANS_STATUS in the ISR before clearing it. This lets the
   timeout handler distinguish between a real hardware timeout (QSPI_RDY
   not set) and a delayed workqueue (QSPI_RDY set), preventing false
   timeout errors when hardware has already completed. Pair the cache
   publication with smp_store_release()/smp_load_acquire() so the timeout
   handler observes a coherent set of cached fields on weakly-ordered
   architectures.

3. Process small PIO transfers (those that complete the whole spi_transfer
   in a single chunk) directly in hard IRQ context, eliminating workqueue
   scheduling latency for TPM-style short reads.

Changes since v3:

  Patch 1 ("Convert to hard IRQ with high-priority workqueue"):
    - Dropped IRQF_SHARED. Tegra QSPI uses a dedicated GIC SPI line on
      every SoC that uses this driver, so the ISR does not need a
      runtime PM reference. Addresses Mark Brown's "Since we now have
      IRQF_SHARED we need to take a runtime PM reference here" comment.
    - Switched from devm_request_irq() to plain request_irq() in
      probe() and added explicit free_irq() in remove(), in the order:
      spi_unregister_controller -> free_irq -> destroy_workqueue ->
      pm_runtime_dont_use_autosuspend -> pm_runtime_force_suspend ->
      tegra_qspi_deinit_dma. Addresses Mark's "devm + non-devm mix
      seems likely to be racy" comments on probe() and remove().
    - Removed the tegra_qspi_unmask_irq() call from the work_handler
      NULL-bail path. Addresses Mark's "unmask after dropping the lock
      feels like it opens up races" comment. The next transfer's setup
      path re-arms the IRQ.
    - Snapshot tqspi->curr_xfer under tqspi->lock at the top of
      handle_dma_based_xfer() so the (potentially long) DMA waits
      operate on a stable transfer pointer even if the timeout path
      clears curr_xfer concurrently. Integrates cleanly with Breno
      Leitao's recently merged protect-curr_xfer series.

  Patch 2 ("Cache TRANS_STATUS in ISR for timeout handler"):
    - Cached status_reg / tx_status / rx_status / trans_status are now
      published with WRITE_ONCE() and smp_store_release() in the ISR
      and consumed with smp_load_acquire() in
      tegra_qspi_handle_timeout(). This keeps KCSAN clean across the
      hard-IRQ / process-context boundary.
    - Reset the cached trans_status to zero in
      tegra_qspi_setup_transfer_one() (under tqspi->lock) before
      unmasking the IRQ for the new transfer, so a stale RDY bit from
      the previous transfer cannot mislead the timeout handler.
    - Dropped the WARN_ON() / WARN_ON_ONCE() wrappers around the
      wait_for_completion_timeout() calls in
      tegra_qspi_{combined,non_combined}_seq_xfer(). With the new
      recovery path in place a timeout is no longer a programming
      error; the existing dev_warn_ratelimited("QSPI interrupt
      timeout, but transfer complete") and dev_err("transfer timeout")
      diagnostics are preserved.

  Patch 3 ("Process small PIO transfers in hard IRQ context"):
    - Replaced the previous "curr_dma_words <= QSPI_FIFO_DEPTH" check
      with a tqspi->is_last_pio_chunk scalar computed in
      tegra_qspi_start_cpu_based_transfer() before it unmasks the IRQ.
      Addresses Mark Brown's "Is cur_dma_words always in the same
      units as QSPI_FIFO_DEPTH - I see there's packed transfer support
      in the driver?" comment: the scalar comparison is units-correct
      (cur_pos + curr_dma_words * bytes_per_word >= t->len) and the
      hard-IRQ fastpath no longer dereferences the spi_transfer
      object, so it stays safe against any teardown race that could
      clear curr_xfer concurrently.
    - Fastpath additionally gates on tx_status == 0 && rx_status == 0
      because handle_cpu_based_xfer()'s error path calls
      tegra_qspi_handle_error() -> device_reset(), which can sleep and
      must not run from hard IRQ context.
    - is_curr_dma_xfer and is_last_pio_chunk are written from process
      context (the transfer-start functions) and read lock-free from
      the hard IRQ handler and the workqueue handler, so the writers
      use WRITE_ONCE() and the readers use READ_ONCE() to prevent
      compiler tearing and silence KCSAN.

Changes since v2:
  - Added cancel_work_sync() in remove to flush pending work before
    devm tore down the workqueue (Jon Hunter). v4 has replaced devm
    altogether per Mark Brown's comment, so the explicit teardown now
    relies on free_irq() preventing new work being queued, followed by
    destroy_workqueue() draining what is in-flight.
  - Rewrote patch 2 commit message to describe the race in terms of
    the workqueue model rather than referencing the old threaded IRQ
    (Jon).
  - s/NULLed/cleared/ in code comment (Jon).

Changes since v1:
  - Switched to devm_alloc_workqueue() and devm_request_irq() for
    resource management (Jon Hunter). v4 has since reverted to
    non-devm for the IRQ and workqueue per Mark Brown's review, so
    teardown order can be made explicit.
  - Improved patch 2 commit message to explain the timeout race
    scenario and clarify that the issue pre-exists the workqueue
    conversion (Jon).
  - Removed unnecessary local variable in tegra_qspi_handle_timeout
    (Jon).
  - Moved "workqueue was delayed" comment updates from patch 2 to
    patch 1, since patch 1 introduces the workqueue (Jon).

Tested on TH500 with TPM and QSPI flash devices under sustained CPU
stress (stress-ng), and on Tegra234 (P3960 / P3970) under similar
load. With the v4 series applied, transfers complete correctly without
spurious timeout errors or WARN_ON splats.

Vishwaroop A (3):
  spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue
  spi: tegra210-quad: Cache TRANS_STATUS in ISR for timeout handler
  spi: tegra210-quad: Process small PIO transfers in hard IRQ context

 drivers/spi/spi-tegra210-quad.c | 280 +++++++++++++++++++++++++-------
 1 file changed, 221 insertions(+), 59 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v4 1/3] spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue
  2026-06-10  6:23 [PATCH v4 0/3] spi: tegra210-quad: Improve interrupt handling for loaded systems Vishwaroop A
@ 2026-06-10  6:24 ` Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 2/3] spi: tegra210-quad: Cache TRANS_STATUS in ISR for timeout handler Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 3/3] spi: tegra210-quad: Process small PIO transfers in hard IRQ context Vishwaroop A
  2 siblings, 0 replies; 4+ messages in thread
From: Vishwaroop A @ 2026-06-10  6:24 UTC (permalink / raw)
  To: Thierry Reding, Jon Hunter, Mark Brown
  Cc: Vishwaroop A, Laxman Dewangan, Sowjanya Komatineni, Breno Leitao,
	Suresh Mangipudi, Krishna Yarlagadda, linux-tegra, linux-spi,
	linux-kernel

Threaded IRQ handlers suffer from scheduler latency on heavily loaded
systems, causing false transfer timeouts. Convert to a hard IRQ handler
that schedules work on a high-priority unbound workqueue.

The hard IRQ handler captures FIFO status, masks and clears the
controller IRQ, then schedules the bottom half. The workqueue handler
runs in process context (can sleep for DMA completion) and can execute
on any CPU, avoiding the CPU0 bottleneck that a threaded IRQ pinned to
the GIC affinity creates.

The work handler only touches QSPI MMIO when curr_xfer is non-NULL.
curr_xfer is cleared only after the transfer thread has processed the
completion, and while it is set the transfer thread is blocked in
wait_for_completion_timeout() with the SPI core's runtime PM reference
held, so the clocks are guaranteed on.

The ISR returns IRQ_HANDLED unconditionally. Tegra QSPI has a dedicated,
non-shared GIC SPI line on every SoC that uses this driver, so any
spurious / late IRQ (e.g. after the timeout path has cleared curr_xfer)
must still be acked and re-masked here; otherwise the level-triggered
line could stay asserted and trip the kernel spurious-IRQ detector into
disabling the line ("nobody cared, try to disable"). The lock-free
curr_xfer NULL check lets the ISR bail without touching FIFO/status
when there is no transfer to drive forward.

handle_dma_based_xfer() snapshots curr_xfer under the spinlock at
function entry and bails immediately when the timeout path has already
cleared it. This avoids waiting up to QSPI_DMA_TIMEOUT on a DMA
completion that belongs to a transfer the synchronous path has already
torn down, and keeps the subsequent dma_unmap/FIFO-drain operations
consistent with the transfer that actually started.

Resources are allocated and torn down manually so that remove() can
stop the controller, free the IRQ (preventing new work from being
queued), then destroy the workqueue (which drains any already-queued
work while the clocks are still on) before runtime PM is disabled.

Signed-off-by: Vishwaroop A <va@nvidia.com>
---
 drivers/spi/spi-tegra210-quad.c | 162 ++++++++++++++++++++++----------
 1 file changed, 114 insertions(+), 48 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 588a929a9785..dd5d40e0dcbc 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -191,6 +191,8 @@ struct tegra_qspi {
 	void __iomem				*base;
 	phys_addr_t				phys;
 	unsigned int				irq;
+	struct work_struct			irq_work;
+	struct workqueue_struct			*wq;
 
 	u32					cur_speed;
 	unsigned int				cur_pos;
@@ -1232,9 +1234,9 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 
 			if (ret == 0) {
 				/*
-				 * Check if hardware completed the transfer
-				 * even though interrupt was lost or delayed.
-				 * If so, process the completion and continue.
+				 * Check if hardware completed the transfer even though
+				 * workqueue was delayed. If so, process completion and
+				 * continue.
 				 */
 				ret = tegra_qspi_handle_timeout(tqspi);
 				if (ret < 0) {
@@ -1351,8 +1353,8 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 		if (ret == 0) {
 			/*
 			 * Check if hardware completed the transfer even though
-			 * interrupt was lost or delayed. If so, process the
-			 * completion and continue.
+			 * workqueue was delayed. If so, process completion and
+			 * continue.
 			 */
 			ret = tegra_qspi_handle_timeout(tqspi);
 			if (ret < 0) {
@@ -1506,6 +1508,19 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	long wait_status;
 	int num_errors = 0;
 
+	/*
+	 * Snapshot curr_xfer under the lock before the (potentially long)
+	 * DMA waits below. The timeout path can clear tqspi->curr_xfer
+	 * concurrently; using the local copy keeps the subsequent dma_unmap
+	 * and FIFO-drain steps consistent with the transfer that actually
+	 * started, and lets us bail safely if cleanup already happened.
+	 */
+	spin_lock_irqsave(&tqspi->lock, flags);
+	t = tqspi->curr_xfer;
+	spin_unlock_irqrestore(&tqspi->lock, flags);
+	if (!t)
+		return IRQ_HANDLED;
+
 	if (tqspi->cur_direction & DATA_DIR_TX) {
 		if (tqspi->tx_status) {
 			if (tqspi->tx_dma_chan)
@@ -1539,12 +1554,6 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	}
 
 	spin_lock_irqsave(&tqspi->lock, flags);
-	t = tqspi->curr_xfer;
-
-	if (!t) {
-		spin_unlock_irqrestore(&tqspi->lock, flags);
-		return IRQ_HANDLED;
-	}
 
 	if (num_errors) {
 		tegra_qspi_dma_unmap_xfer(tqspi, t);
@@ -1581,46 +1590,39 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
+/**
+ * tegra_qspi_work_handler - Workqueue handler for interrupt bottom-half
+ * @work: work_struct embedded in tegra_qspi
+ *
+ * Runs in process context and can sleep (needed for DMA completion waits).
+ * Can run on any available CPU, avoiding CPU0 bottleneck that occurs with
+ * threaded IRQ handlers which are pinned to the IRQ's CPU.
+ *
+ * The hard IRQ handler has already:
+ * - Verified this is our interrupt (QSPI_RDY was set)
+ * - Cached FIFO status in tqspi->status_reg
+ * - Parsed tx_status / rx_status from FIFO status
+ * - Masked further interrupts
+ */
+static void tegra_qspi_work_handler(struct work_struct *work)
 {
-	struct tegra_qspi *tqspi = context_data;
+	struct tegra_qspi *tqspi = container_of(work, struct tegra_qspi, irq_work);
 	unsigned long flags;
-	u32 status;
 
-	/*
-	 * Read transfer status to check if interrupt was triggered by transfer
-	 * completion
-	 */
-	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
+	spin_lock_irqsave(&tqspi->lock, flags);
 
 	/*
-	 * Occasionally the IRQ thread takes a long time to wake up (usually
-	 * when the CPU that it's running on is excessively busy) and we have
-	 * already reached the timeout before and cleaned up the timed out
-	 * transfer. Avoid any processing in that case and bail out early.
-	 *
-	 * If no transfer is in progress, check if this was a real interrupt
-	 * that the timeout handler already processed, or a spurious one.
+	 * Check if the timeout handler already processed this transfer.
+	 * Can happen if the workqueue was delayed and the timeout fired
+	 * first. In that case there is nothing to do: tegra_qspi_start_
+	 * {cpu,dma}_based_transfer() at the start of the next transfer
+	 * (or the next message) re-enables interrupts.
 	 */
-	spin_lock_irqsave(&tqspi->lock, flags);
 	if (!tqspi->curr_xfer) {
 		spin_unlock_irqrestore(&tqspi->lock, flags);
-		/* Spurious interrupt - transfer not ready */
-		if (!(status & QSPI_RDY))
-			return IRQ_NONE;
-		/* Real interrupt, already handled by timeout path */
-		return IRQ_HANDLED;
+		return;
 	}
 
-	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
-
-	if (tqspi->cur_direction & DATA_DIR_TX)
-		tqspi->tx_status = tqspi->status_reg & (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);
-
-	if (tqspi->cur_direction & DATA_DIR_RX)
-		tqspi->rx_status = tqspi->status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
-
-	tegra_qspi_mask_clear_irq(tqspi);
 	spin_unlock_irqrestore(&tqspi->lock, flags);
 
 	/*
@@ -1630,9 +1632,55 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 	 * cannot be done while holding spinlock.
 	 */
 	if (!tqspi->is_curr_dma_xfer)
-		return handle_cpu_based_xfer(tqspi);
+		handle_cpu_based_xfer(tqspi);
+	else
+		handle_dma_based_xfer(tqspi);
+}
+
+/**
+ * tegra_qspi_isr - Hard IRQ handler
+ * @irq: IRQ number
+ * @context_data: QSPI controller instance
+ *
+ * Runs in hard IRQ context with minimal latency. Cannot sleep.
+ *
+ * Tegra QSPI uses a dedicated, non-shared GIC SPI line on every SoC that
+ * uses this driver. The handler always returns IRQ_HANDLED and always
+ * acknowledges/re-masks the controller IRQ, so the level-triggered line
+ * cannot stay asserted and trip the kernel spurious-IRQ detector into
+ * disabling the line. On a stray IRQ where curr_xfer is NULL (e.g. the
+ * timeout path has already torn the transfer down) the FIFO/status
+ * processing and bottom-half scheduling are skipped because there is no
+ * transfer to drive forward.
+ *
+ * Return: IRQ_HANDLED.
+ */
+static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
+{
+	struct tegra_qspi *tqspi = context_data;
+
+	if (!READ_ONCE(tqspi->curr_xfer)) {
+		tegra_qspi_mask_clear_irq(tqspi);
+		return IRQ_HANDLED;
+	}
+
+	spin_lock(&tqspi->lock);
+	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
+	tegra_qspi_mask_clear_irq(tqspi);
+
+	if (tqspi->cur_direction & DATA_DIR_TX)
+		tqspi->tx_status = tqspi->status_reg &
+				    (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);
+
+	if (tqspi->cur_direction & DATA_DIR_RX)
+		tqspi->rx_status = tqspi->status_reg &
+				    (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
+
+	spin_unlock(&tqspi->lock);
 
-	return handle_dma_based_xfer(tqspi);
+	queue_work(tqspi->wq, &tqspi->irq_work);
+
+	return IRQ_HANDLED;
 }
 
 static struct tegra_qspi_soc_data tegra210_qspi_soc_data = {
@@ -1800,12 +1848,21 @@ static int tegra_qspi_probe(struct platform_device *pdev)
 
 	pm_runtime_put_autosuspend(&pdev->dev);
 
-	ret = request_threaded_irq(tqspi->irq, NULL,
-				   tegra_qspi_isr_thread, IRQF_ONESHOT,
-				   dev_name(&pdev->dev), tqspi);
+	tqspi->wq = alloc_workqueue("%s", WQ_HIGHPRI | WQ_UNBOUND, 0,
+				    dev_name(&pdev->dev));
+	if (!tqspi->wq) {
+		dev_err(&pdev->dev, "failed to allocate workqueue\n");
+		ret = -ENOMEM;
+		goto exit_pm_disable;
+	}
+
+	INIT_WORK(&tqspi->irq_work, tegra_qspi_work_handler);
+
+	ret = request_irq(tqspi->irq, tegra_qspi_isr, 0,
+			  dev_name(&pdev->dev), tqspi);
 	if (ret < 0) {
 		dev_err(&pdev->dev, "failed to request IRQ#%u: %d\n", tqspi->irq, ret);
-		goto exit_pm_disable;
+		goto exit_destroy_wq;
 	}
 
 	ret = spi_register_controller(host);
@@ -1817,7 +1874,9 @@ static int tegra_qspi_probe(struct platform_device *pdev)
 	return 0;
 
 exit_free_irq:
-	free_irq(qspi_irq, tqspi);
+	free_irq(tqspi->irq, tqspi);
+exit_destroy_wq:
+	destroy_workqueue(tqspi->wq);
 exit_pm_disable:
 	pm_runtime_dont_use_autosuspend(&pdev->dev);
 	pm_runtime_force_suspend(&pdev->dev);
@@ -1830,8 +1889,15 @@ static void tegra_qspi_remove(struct platform_device *pdev)
 	struct spi_controller *host = platform_get_drvdata(pdev);
 	struct tegra_qspi *tqspi = spi_controller_get_devdata(host);
 
+	/*
+	 * Tear down in reverse order of probe() so that the controller stops
+	 * accepting transfers before the IRQ is released, no new work can be
+	 * queued after the IRQ is freed, and any work already queued is
+	 * drained while the clocks are still running.
+	 */
 	spi_unregister_controller(host);
 	free_irq(tqspi->irq, tqspi);
+	destroy_workqueue(tqspi->wq);
 	pm_runtime_dont_use_autosuspend(&pdev->dev);
 	pm_runtime_force_suspend(&pdev->dev);
 	tegra_qspi_deinit_dma(tqspi);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v4 2/3] spi: tegra210-quad: Cache TRANS_STATUS in ISR for timeout handler
  2026-06-10  6:23 [PATCH v4 0/3] spi: tegra210-quad: Improve interrupt handling for loaded systems Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 1/3] spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue Vishwaroop A
@ 2026-06-10  6:24 ` Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 3/3] spi: tegra210-quad: Process small PIO transfers in hard IRQ context Vishwaroop A
  2 siblings, 0 replies; 4+ messages in thread
From: Vishwaroop A @ 2026-06-10  6:24 UTC (permalink / raw)
  To: Thierry Reding, Jon Hunter, Mark Brown
  Cc: Vishwaroop A, Laxman Dewangan, Sowjanya Komatineni, Breno Leitao,
	Suresh Mangipudi, Krishna Yarlagadda, linux-tegra, linux-spi,
	linux-kernel

On heavily loaded systems the workqueue bottom half can be delayed
long enough for wait_for_completion_timeout() to expire before the
ISR's queued work actually runs. Reading QSPI_TRANS_STATUS directly
from the controller in the timeout handler races with both the
workqueue handler and the controller itself, and can mis-classify a
transfer that genuinely timed out as having "completed".

Cache the controller status captured by the hard IRQ before it is
acked, and let the timeout handler consume that cache:

  - tegra_qspi_isr() reads QSPI_FIFO_STATUS and QSPI_TRANS_STATUS,
    derives tx_status / rx_status, then publishes them via WRITE_ONCE()
    and the trans_status cache via smp_store_release() before masking
    and acking the controller IRQ.

  - tegra_qspi_handle_timeout() consumes trans_status with a paired
    smp_load_acquire(). The release/acquire pair guarantees that a
    timeout handler which observes a non-zero trans_status also
    observes the matching status_reg / tx_status / rx_status updates.

  - tegra_qspi_setup_transfer_one() clears the cache with
    smp_store_release() under the spinlock before unmasking the IRQ
    for the new transfer, so a stale RDY bit from the previous
    transfer cannot fool the timeout handler.

  - If the cache is zero at timeout time the ISR genuinely never ran
    (lost IRQ), so the handler falls back to a direct controller read
    so we do not unconditionally surface a timeout when the hardware
    has actually finished.

curr_xfer is re-checked under the spinlock so that a workqueue handler
which races with the timeout path cannot double-complete the transfer.

Signed-off-by: Vishwaroop A <va@nvidia.com>
---
 drivers/spi/spi-tegra210-quad.c | 65 ++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 14 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index dd5d40e0dcbc..f0b15d13e433 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -214,6 +214,7 @@ struct tegra_qspi {
 	u32					tx_status;
 	u32					rx_status;
 	u32					status_reg;
+	u32					trans_status;
 	bool					is_packed;
 	bool					use_dma;
 
@@ -861,6 +862,13 @@ static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi, struct spi_tran
 	tqspi->cur_rx_pos = 0;
 	tqspi->cur_tx_pos = 0;
 	tqspi->curr_xfer = t;
+	/*
+	 * Pairs with smp_load_acquire() in tegra_qspi_handle_timeout().
+	 * Clearing the cached trans_status before unmasking the IRQ for
+	 * the new transfer prevents a stale RDY bit from the previous
+	 * transfer fooling the timeout handler into a false recovery.
+	 */
+	smp_store_release(&tqspi->trans_status, 0);
 	spin_unlock_irqrestore(&tqspi->lock, flags);
 
 	if (is_first_of_msg) {
@@ -1075,26 +1083,43 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi);
  */
 static int tegra_qspi_handle_timeout(struct tegra_qspi *tqspi)
 {
+	unsigned long flags;
 	irqreturn_t ret;
 	u32 status;
 
-	/* Check if hardware actually completed the transfer */
-	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
+	/*
+	 * Pairs with smp_store_release() in tegra_qspi_isr(). If the ISR
+	 * managed to run before wait_for_completion_timeout() expired, the
+	 * cached value lets us classify the timeout without racing with a
+	 * concurrent HW write to QSPI_TRANS_STATUS.
+	 *
+	 * A zero cache means the ISR never ran for this transfer (the IRQ
+	 * was lost entirely). In that case fall back to reading the
+	 * controller directly so we do not unconditionally surface a
+	 * timeout when the hardware has actually finished.
+	 */
+	status = smp_load_acquire(&tqspi->trans_status);
+	if (!status)
+		status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
+
 	if (!(status & QSPI_RDY))
 		return -ETIMEDOUT;
 
+	spin_lock_irqsave(&tqspi->lock, flags);
 	/*
-	 * Hardware completed but interrupt was lost/delayed. Manually
-	 * process the completion by calling the appropriate handler.
+	 * ISR or workqueue may have already completed the transfer
+	 * and cleared curr_xfer between the completion timeout and now.
 	 */
+	if (!tqspi->curr_xfer) {
+		spin_unlock_irqrestore(&tqspi->lock, flags);
+		return 0;
+	}
+
+	spin_unlock_irqrestore(&tqspi->lock, flags);
+
 	dev_warn_ratelimited(tqspi->dev,
 			     "QSPI interrupt timeout, but transfer complete\n");
 
-	/* Clear the transfer status */
-	status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
-	tegra_qspi_writel(tqspi, status, QSPI_TRANS_STATUS);
-
-	/* Manually trigger completion handler */
 	if (!tqspi->is_curr_dma_xfer)
 		ret = handle_cpu_based_xfer(tqspi);
 	else
@@ -1658,6 +1683,7 @@ static void tegra_qspi_work_handler(struct work_struct *work)
 static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
 {
 	struct tegra_qspi *tqspi = context_data;
+	u32 status_reg, trans_status;
 
 	if (!READ_ONCE(tqspi->curr_xfer)) {
 		tegra_qspi_mask_clear_irq(tqspi);
@@ -1665,16 +1691,27 @@ static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
 	}
 
 	spin_lock(&tqspi->lock);
-	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
+	status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
+	trans_status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
 	tegra_qspi_mask_clear_irq(tqspi);
 
 	if (tqspi->cur_direction & DATA_DIR_TX)
-		tqspi->tx_status = tqspi->status_reg &
-				    (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);
+		WRITE_ONCE(tqspi->tx_status,
+			   status_reg & (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF));
 
 	if (tqspi->cur_direction & DATA_DIR_RX)
-		tqspi->rx_status = tqspi->status_reg &
-				    (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
+		WRITE_ONCE(tqspi->rx_status,
+			   status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF));
+
+	WRITE_ONCE(tqspi->status_reg, status_reg);
+	/*
+	 * smp_store_release() pairs with smp_load_acquire() in
+	 * tegra_qspi_handle_timeout(). Publishing trans_status last with
+	 * release semantics guarantees that a timeout handler which sees
+	 * a non-zero trans_status also observes the WRITE_ONCE() updates
+	 * to status_reg / tx_status / rx_status above.
+	 */
+	smp_store_release(&tqspi->trans_status, trans_status);
 
 	spin_unlock(&tqspi->lock);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v4 3/3] spi: tegra210-quad: Process small PIO transfers in hard IRQ context
  2026-06-10  6:23 [PATCH v4 0/3] spi: tegra210-quad: Improve interrupt handling for loaded systems Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 1/3] spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue Vishwaroop A
  2026-06-10  6:24 ` [PATCH v4 2/3] spi: tegra210-quad: Cache TRANS_STATUS in ISR for timeout handler Vishwaroop A
@ 2026-06-10  6:24 ` Vishwaroop A
  2 siblings, 0 replies; 4+ messages in thread
From: Vishwaroop A @ 2026-06-10  6:24 UTC (permalink / raw)
  To: Thierry Reding, Jon Hunter, Mark Brown
  Cc: Vishwaroop A, Laxman Dewangan, Sowjanya Komatineni, Breno Leitao,
	Suresh Mangipudi, Krishna Yarlagadda, linux-tegra, linux-spi,
	linux-kernel

On heavily loaded systems, workqueue scheduling delays can exceed
transfer timeouts even for high-priority queues, causing false
timeouts for latency-sensitive devices like TPM despite hardware
completing in microseconds.

Process small PIO transfers (those that complete the whole spi_transfer
in a single chunk) directly in hard IRQ context instead of deferring to
the workqueue. This reduces completion latency from 1000ms+ to
microseconds and matches the pattern used by other SPI drivers.

To avoid touching the spi_transfer object from hard IRQ context (which
would race with the synchronous teardown path that clears curr_xfer on
timeout), tegra_qspi_start_cpu_based_transfer() caches the "this PIO
chunk completes the whole transfer" decision into a scalar
tqspi->is_last_pio_chunk *before* unmasking the IRQ. The hard-IRQ
fastpath consumes that scalar with READ_ONCE() and never dereferences
curr_xfer or any spi_transfer fields. Multi-chunk PIO transfers are
intentionally kept on the workqueue (only the final chunk sets the
flag) so the fastpath can never recurse into
tegra_qspi_start_cpu_based_transfer() from hard IRQ context, and DMA
transfers always go through the workqueue because their completion
path sleeps on the DMA engine.

The fastpath also gates on the per-IRQ tx_status / rx_status locals
being zero, because handle_cpu_based_xfer()'s error path calls
tegra_qspi_reset() -> device_reset(), which can sleep and must not run
from hard IRQ context.

is_curr_dma_xfer and is_last_pio_chunk are written from process
context (the transfer-start functions) and read lock-free from the
hard IRQ handler and the workqueue handler, so the writes use
WRITE_ONCE() and the reads use READ_ONCE() to prevent compiler tearing
and silence KCSAN data-race warnings.

Signed-off-by: Vishwaroop A <va@nvidia.com>
---
 drivers/spi/spi-tegra210-quad.c | 77 +++++++++++++++++++++++++++++----
 1 file changed, 68 insertions(+), 9 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index f0b15d13e433..e7611275734a 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -207,6 +207,18 @@ struct tegra_qspi {
 	unsigned int				dma_buf_size;
 	unsigned int				max_buf_size;
 	bool					is_curr_dma_xfer;
+	/*
+	 * Cached "this PIO chunk completes the whole transfer" decision,
+	 * computed by tegra_qspi_start_cpu_based_transfer() before it
+	 * unmasks the IRQ. Used by the hard IRQ small-PIO fastpath in
+	 * place of dereferencing curr_xfer->len, so the ISR cannot touch
+	 * the spi_transfer object even on a late IRQ that races with the
+	 * synchronous teardown path. Multi-chunk PIO transfers always go
+	 * through the workqueue (this flag is only set on the final
+	 * chunk), so the fastpath cannot recurse into
+	 * tegra_qspi_start_cpu_based_transfer() from hard IRQ context.
+	 */
+	bool					is_last_pio_chunk;
 
 	struct completion			rx_dma_complete;
 	struct completion			tx_dma_complete;
@@ -716,7 +728,13 @@ static int tegra_qspi_start_dma_based_transfer(struct tegra_qspi *tqspi, struct
 
 	tegra_qspi_writel(tqspi, tqspi->command1_reg, QSPI_COMMAND1);
 
-	tqspi->is_curr_dma_xfer = true;
+	/*
+	 * WRITE_ONCE() pairs with READ_ONCE() in tegra_qspi_isr() and
+	 * tegra_qspi_work_handler(); the flag is read lock-free across
+	 * the hard-IRQ / process-context boundary so the annotation
+	 * prevents compiler tearing and silences KCSAN.
+	 */
+	WRITE_ONCE(tqspi->is_curr_dma_xfer, true);
 	tqspi->dma_control_reg = val;
 	val |= QSPI_DMA_EN;
 	tegra_qspi_writel(tqspi, val, QSPI_DMA_CTL);
@@ -737,9 +755,23 @@ static int tegra_qspi_start_cpu_based_transfer(struct tegra_qspi *qspi, struct s
 	val = QSPI_DMA_BLK_SET(cur_words - 1);
 	tegra_qspi_writel(qspi, val, QSPI_DMA_BLK);
 
+	/*
+	 * Snapshot whether this PIO chunk completes the whole transfer
+	 * before unmasking the IRQ, so the hard IRQ small-PIO fastpath
+	 * can decide whether to drain inline without dereferencing the
+	 * spi_transfer object. cur_pos / curr_dma_words / bytes_per_word
+	 * are stable here: they are written by
+	 * tegra_qspi_calculate_curr_xfer_param() earlier in this code
+	 * path. The IRQ cannot fire until the QSPI_COMMAND1 write below
+	 * kicks the transfer off, so this store happens-before any ISR
+	 * that observes the unmask.
+	 */
+	WRITE_ONCE(qspi->is_last_pio_chunk,
+		   qspi->cur_pos + qspi->curr_dma_words * qspi->bytes_per_word >= t->len);
+
 	tegra_qspi_unmask_irq(qspi);
 
-	qspi->is_curr_dma_xfer = false;
+	WRITE_ONCE(qspi->is_curr_dma_xfer, false);
 	val = qspi->command1_reg;
 	val |= QSPI_PIO;
 	tegra_qspi_writel(qspi, val, QSPI_COMMAND1);
@@ -1656,7 +1688,7 @@ static void tegra_qspi_work_handler(struct work_struct *work)
 	 * DMA handler also needs to sleep in wait_for_completion_*(), which
 	 * cannot be done while holding spinlock.
 	 */
-	if (!tqspi->is_curr_dma_xfer)
+	if (!READ_ONCE(tqspi->is_curr_dma_xfer))
 		handle_cpu_based_xfer(tqspi);
 	else
 		handle_dma_based_xfer(tqspi);
@@ -1684,6 +1716,7 @@ static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
 {
 	struct tegra_qspi *tqspi = context_data;
 	u32 status_reg, trans_status;
+	u32 tx_status = 0, rx_status = 0;
 
 	if (!READ_ONCE(tqspi->curr_xfer)) {
 		tegra_qspi_mask_clear_irq(tqspi);
@@ -1695,13 +1728,15 @@ static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
 	trans_status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
 	tegra_qspi_mask_clear_irq(tqspi);
 
-	if (tqspi->cur_direction & DATA_DIR_TX)
-		WRITE_ONCE(tqspi->tx_status,
-			   status_reg & (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF));
+	if (tqspi->cur_direction & DATA_DIR_TX) {
+		tx_status = status_reg & (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);
+		WRITE_ONCE(tqspi->tx_status, tx_status);
+	}
 
-	if (tqspi->cur_direction & DATA_DIR_RX)
-		WRITE_ONCE(tqspi->rx_status,
-			   status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF));
+	if (tqspi->cur_direction & DATA_DIR_RX) {
+		rx_status = status_reg & (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
+		WRITE_ONCE(tqspi->rx_status, rx_status);
+	}
 
 	WRITE_ONCE(tqspi->status_reg, status_reg);
 	/*
@@ -1715,6 +1750,30 @@ static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
 
 	spin_unlock(&tqspi->lock);
 
+	/*
+	 * Small-PIO fastpath: drain the FIFO inline only when this chunk
+	 * completes the entire outstanding transfer and no error bit was
+	 * latched, to avoid workqueue scheduling latency for TPM-style
+	 * short reads.
+	 *
+	 * The "last chunk" decision is computed and cached as a scalar by
+	 * tegra_qspi_start_cpu_based_transfer() before it unmasks the IRQ,
+	 * so the hard-IRQ fastpath never dereferences the spi_transfer
+	 * pointer here. That keeps the ISR safe against any teardown race
+	 * where the synchronous path could clear curr_xfer concurrently.
+	 *
+	 * Multi-chunk PIO continuation stays on the workqueue so that
+	 * tegra_qspi_start_cpu_based_transfer() can re-arm the IRQ from
+	 * process context. DMA transfers also stay on the workqueue
+	 * because their completion path sleeps on the DMA engine.
+	 * tegra_qspi_handle_error() -> device_reset() can sleep, so the
+	 * fastpath only runs when both status words are clean.
+	 */
+	if (!READ_ONCE(tqspi->is_curr_dma_xfer) &&
+	    READ_ONCE(tqspi->is_last_pio_chunk) &&
+	    !tx_status && !rx_status)
+		return handle_cpu_based_xfer(tqspi);
+
 	queue_work(tqspi->wq, &tqspi->irq_work);
 
 	return IRQ_HANDLED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-10  6:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10  6:23 [PATCH v4 0/3] spi: tegra210-quad: Improve interrupt handling for loaded systems Vishwaroop A
2026-06-10  6:24 ` [PATCH v4 1/3] spi: tegra210-quad: Convert to hard IRQ with high-priority workqueue Vishwaroop A
2026-06-10  6:24 ` [PATCH v4 2/3] spi: tegra210-quad: Cache TRANS_STATUS in ISR for timeout handler Vishwaroop A
2026-06-10  6:24 ` [PATCH v4 3/3] spi: tegra210-quad: Process small PIO transfers in hard IRQ context Vishwaroop A

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox