* [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
@ 2025-10-28 15:57 Vishwaroop A
2025-10-28 15:57 ` [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling Vishwaroop A
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Vishwaroop A @ 2025-10-28 15:57 UTC (permalink / raw)
To: Mark Brown, Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda
Cc: Vishwaroop A, linux-spi, linux-tegra, linux-kernel
Hi,
This patch series addresses timeout handling issues in the Tegra QSPI driver
that occur under high system load conditions. We've observed that when CPUs
are saturated (due to error injection, RAS firmware activity, or general CPU
contention), QSPI interrupt handlers can be delayed, causing spurious transfer
failures even though the hardware completed the operation successfully.
Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
and checked when the IRQ thread finally runs. It also ensures interrupts are
properly cleared on failure paths.
Patch 2 refactors the timeout cleanup code into dedicated helper functions
(tegra_qspi_reset, tegra_qspi_dma_stop, tegra_qspi_pio_stop) to improve code
readability and maintainability. This is purely a code reorganization with no
functional changes.
Patch 3 adds hardware status checking on timeout. Before failing a transfer,
the driver now reads QSPI_TRANS_STATUS to verify if the hardware actually
completed the operation. If so, it manually invokes the completion handler
instead of failing the transfer. This distinguishes genuine hardware timeouts
from delayed/lost interrupts.
These changes have been tested in production environments under various high
load scenarios including RAS testing and CPU saturation workloads.
Changes in v5:
- No code changes, rebased to resolve conflicts
Changes in v4:
- Removed Change-Id from commit messages
Changes in v3:
- Added missing tqspi->curr_xfer = NULL assignment in handle_cpu_based_xfer()
- Split the previous patch 2/2 into two separate patches (now 2/3 and 3/3)
- Patch 2/3: New patch - refactoring only, no functional changes
- Patch 3/3: Functional changes to add hardware timeout checking
Changes in v2:
- Fixed indentation in patch 1/2: The "Reset controller if timeout happens"
block now has correct indentation (inside the WARN_ON_ONCE block)
- No functional changes
Thierry Reding (1):
spi: tegra210-quad: Fix timeout handling
Vishwaroop A (2):
spi: tegra210-quad: Refactor error handling into helper functions
spi: tegra210-quad: Check hardware status on timeout
drivers/spi/spi-tegra210-quad.c | 174 +++++++++++++++++++++++---------
1 file changed, 128 insertions(+), 46 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
@ 2025-10-28 15:57 ` Vishwaroop A
2025-11-12 14:39 ` Breno Leitao
2025-10-28 15:57 ` [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions Vishwaroop A
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Vishwaroop A @ 2025-10-28 15:57 UTC (permalink / raw)
To: Mark Brown, Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda
Cc: Vishwaroop A, linux-spi, linux-tegra, linux-kernel,
Thierry Reding
When the CPU that the QSPI interrupt handler runs on (typically CPU 0)
is excessively busy, it can lead to rare cases of the IRQ thread not
running before the transfer timeout is reached.
While handling the timeouts, any pending transfers are cleaned up and
the message that they correspond to is marked as failed, which leaves
the curr_xfer field pointing at stale memory.
To avoid this, clear curr_xfer to NULL upon timeout and check for this
condition when the IRQ thread is finally run.
While at it, also make sure to clear interrupts on failure so that new
interrupts can be run.
A better, more involved, fix would move the interrupt clearing into a
hard IRQ handler. Ideally we would also want to signal that the IRQ
thread no longer needs to be run after the timeout is hit to avoid the
extra check for a valid transfer.
Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Vishwaroop A <va@nvidia.com>
---
drivers/spi/spi-tegra210-quad.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 3be7499db21e..d9ca3d7b082f 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1024,8 +1024,10 @@ static void tegra_qspi_handle_error(struct tegra_qspi *tqspi)
dev_err(tqspi->dev, "error in transfer, fifo status 0x%08x\n", tqspi->status_reg);
tegra_qspi_dump_regs(tqspi);
tegra_qspi_flush_fifos(tqspi, true);
- if (device_reset(tqspi->dev) < 0)
+ if (device_reset(tqspi->dev) < 0) {
dev_warn_once(tqspi->dev, "device reset failed\n");
+ tegra_qspi_mask_clear_irq(tqspi);
+ }
}
static void tegra_qspi_transfer_end(struct spi_device *spi)
@@ -1176,9 +1178,11 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
}
/* Reset controller if timeout happens */
- if (device_reset(tqspi->dev) < 0)
+ if (device_reset(tqspi->dev) < 0) {
dev_warn_once(tqspi->dev,
"device reset failed\n");
+ tegra_qspi_mask_clear_irq(tqspi);
+ }
ret = -EIO;
goto exit;
}
@@ -1200,11 +1204,13 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
tegra_qspi_transfer_end(spi);
spi_transfer_delay_exec(xfer);
}
+ tqspi->curr_xfer = NULL;
transfer_phase++;
}
ret = 0;
exit:
+ tqspi->curr_xfer = NULL;
msg->status = ret;
return ret;
@@ -1290,6 +1296,8 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
msg->actual_length += xfer->len + dummy_bytes;
complete_xfer:
+ tqspi->curr_xfer = NULL;
+
if (ret < 0) {
tegra_qspi_transfer_end(spi);
spi_transfer_delay_exec(xfer);
@@ -1395,6 +1403,7 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
tegra_qspi_calculate_curr_xfer_param(tqspi, t);
tegra_qspi_start_cpu_based_transfer(tqspi, t);
exit:
+ tqspi->curr_xfer = NULL;
spin_unlock_irqrestore(&tqspi->lock, flags);
return IRQ_HANDLED;
}
@@ -1480,6 +1489,15 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
{
struct tegra_qspi *tqspi = context_data;
+ /*
+ * Occasionally the IRQ thread takes a long time to wake up (usually
+ * when the CPU that it's running on is excessively busy) and we have
+ * already reached the timeout before and cleaned up the timed out
+ * transfer. Avoid any processing in that case and bail out early.
+ */
+ if (!tqspi->curr_xfer)
+ return IRQ_NONE;
+
tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
if (tqspi->cur_direction & DATA_DIR_TX)
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
2025-10-28 15:57 ` [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling Vishwaroop A
@ 2025-10-28 15:57 ` Vishwaroop A
2025-11-03 14:15 ` Thierry Reding
2025-10-28 15:57 ` [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout Vishwaroop A
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Vishwaroop A @ 2025-10-28 15:57 UTC (permalink / raw)
To: Mark Brown, Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda
Cc: Vishwaroop A, linux-spi, linux-tegra, linux-kernel
Extract common cleanup code into dedicated helper functions to simplify
the code and improve readability. This refactoring includes:
- tegra_qspi_reset(): Device reset and interrupt cleanup
- tegra_qspi_dma_stop(): DMA termination and disable
- tegra_qspi_pio_stop(): PIO mode disable
No functional changes. This is purely a code reorganization to prepare
for improved timeout handling in subsequent patches.
Signed-off-by: Vishwaroop A <va@nvidia.com>
---
drivers/spi/spi-tegra210-quad.c | 84 +++++++++++++++++----------------
1 file changed, 44 insertions(+), 40 deletions(-)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index d9ca3d7b082f..69defb4ffe49 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1019,17 +1019,22 @@ static void tegra_qspi_dump_regs(struct tegra_qspi *tqspi)
tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS));
}
-static void tegra_qspi_handle_error(struct tegra_qspi *tqspi)
+static void tegra_qspi_reset(struct tegra_qspi *tqspi)
{
- dev_err(tqspi->dev, "error in transfer, fifo status 0x%08x\n", tqspi->status_reg);
- tegra_qspi_dump_regs(tqspi);
- tegra_qspi_flush_fifos(tqspi, true);
if (device_reset(tqspi->dev) < 0) {
dev_warn_once(tqspi->dev, "device reset failed\n");
tegra_qspi_mask_clear_irq(tqspi);
}
}
+static void tegra_qspi_handle_error(struct tegra_qspi *tqspi)
+{
+ dev_err(tqspi->dev, "error in transfer, fifo status 0x%08x\n", tqspi->status_reg);
+ tegra_qspi_dump_regs(tqspi);
+ tegra_qspi_flush_fifos(tqspi, true);
+ tegra_qspi_reset(tqspi);
+}
+
static void tegra_qspi_transfer_end(struct spi_device *spi)
{
struct tegra_qspi *tqspi = spi_controller_get_devdata(spi->controller);
@@ -1074,6 +1079,30 @@ static u32 tegra_qspi_addr_config(bool is_ddr, u8 bus_width, u8 len)
return addr_config;
}
+static void tegra_qspi_dma_stop(struct tegra_qspi *tqspi)
+{
+ u32 value;
+
+ if ((tqspi->cur_direction & DATA_DIR_TX) && tqspi->tx_dma_chan)
+ dmaengine_terminate_all(tqspi->tx_dma_chan);
+
+ if ((tqspi->cur_direction & DATA_DIR_RX) && tqspi->rx_dma_chan)
+ dmaengine_terminate_all(tqspi->rx_dma_chan);
+
+ value = tegra_qspi_readl(tqspi, QSPI_DMA_CTL);
+ value &= ~QSPI_DMA_EN;
+ tegra_qspi_writel(tqspi, value, QSPI_DMA_CTL);
+}
+
+static void tegra_qspi_pio_stop(struct tegra_qspi *tqspi)
+{
+ u32 value;
+
+ value = tegra_qspi_readl(tqspi, QSPI_COMMAND1);
+ value &= ~QSPI_PIO;
+ tegra_qspi_writel(tqspi, value, QSPI_COMMAND1);
+}
+
static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
struct spi_message *msg)
{
@@ -1081,7 +1110,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
struct spi_transfer *xfer;
struct spi_device *spi = msg->spi;
u8 transfer_phase = 0;
- u32 cmd1 = 0, dma_ctl = 0;
+ u32 cmd1 = 0;
int ret = 0;
u32 address_value = 0;
u32 cmd_config = 0, addr_config = 0;
@@ -1150,39 +1179,16 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
if (WARN_ON_ONCE(ret == 0)) {
dev_err_ratelimited(tqspi->dev,
"QSPI Transfer failed with timeout\n");
- if (tqspi->is_curr_dma_xfer) {
- if ((tqspi->cur_direction & DATA_DIR_TX) &&
- tqspi->tx_dma_chan)
- dmaengine_terminate_all(tqspi->tx_dma_chan);
- if ((tqspi->cur_direction & DATA_DIR_RX) &&
- tqspi->rx_dma_chan)
- dmaengine_terminate_all(tqspi->rx_dma_chan);
- }
/* Abort transfer by resetting pio/dma bit */
- if (!tqspi->is_curr_dma_xfer) {
- cmd1 = tegra_qspi_readl
- (tqspi,
- QSPI_COMMAND1);
- cmd1 &= ~QSPI_PIO;
- tegra_qspi_writel
- (tqspi, cmd1,
- QSPI_COMMAND1);
- } else {
- dma_ctl = tegra_qspi_readl
- (tqspi,
- QSPI_DMA_CTL);
- dma_ctl &= ~QSPI_DMA_EN;
- tegra_qspi_writel(tqspi, dma_ctl,
- QSPI_DMA_CTL);
- }
+ if (tqspi->is_curr_dma_xfer)
+ tegra_qspi_dma_stop(tqspi);
+ else
+ tegra_qspi_pio_stop(tqspi);
/* Reset controller if timeout happens */
- if (device_reset(tqspi->dev) < 0) {
- dev_warn_once(tqspi->dev,
- "device reset failed\n");
- tegra_qspi_mask_clear_irq(tqspi);
- }
+ tegra_qspi_reset(tqspi);
+
ret = -EIO;
goto exit;
}
@@ -1276,12 +1282,10 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
QSPI_DMA_TIMEOUT);
if (WARN_ON(ret == 0)) {
dev_err(tqspi->dev, "transfer timeout\n");
- if (tqspi->is_curr_dma_xfer) {
- if ((tqspi->cur_direction & DATA_DIR_TX) && tqspi->tx_dma_chan)
- dmaengine_terminate_all(tqspi->tx_dma_chan);
- if ((tqspi->cur_direction & DATA_DIR_RX) && tqspi->rx_dma_chan)
- dmaengine_terminate_all(tqspi->rx_dma_chan);
- }
+
+ if (tqspi->is_curr_dma_xfer)
+ tegra_qspi_dma_stop(tqspi);
+
tegra_qspi_handle_error(tqspi);
ret = -EIO;
goto complete_xfer;
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
2025-10-28 15:57 ` [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling Vishwaroop A
2025-10-28 15:57 ` [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions Vishwaroop A
@ 2025-10-28 15:57 ` Vishwaroop A
2025-11-03 14:16 ` Thierry Reding
2025-11-06 10:06 ` [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Jon Hunter
2025-11-06 11:34 ` Mark Brown
4 siblings, 1 reply; 9+ messages in thread
From: Vishwaroop A @ 2025-10-28 15:57 UTC (permalink / raw)
To: Mark Brown, Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda
Cc: Vishwaroop A, linux-spi, linux-tegra, linux-kernel
Under high system load, QSPI interrupts can be delayed or blocked on the
target CPU, causing wait_for_completion_timeout() to report failure even
though the hardware successfully completed the transfer.
When a timeout occurs, check the QSPI_RDY bit in QSPI_TRANS_STATUS to
determine if the hardware actually completed the transfer. If so, manually
invoke the completion handler to process the transfer successfully instead
of failing it.
This distinguishes lost/delayed interrupts from real hardware timeouts,
preventing unnecessary failures of transfers that completed successfully.
Signed-off-by: Vishwaroop A <va@nvidia.com>
---
drivers/spi/spi-tegra210-quad.c | 100 +++++++++++++++++++++++++-------
1 file changed, 80 insertions(+), 20 deletions(-)
diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 69defb4ffe49..cdc3cb7c01f9 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1048,6 +1048,49 @@ static void tegra_qspi_transfer_end(struct spi_device *spi)
tegra_qspi_writel(tqspi, tqspi->def_command1_reg, QSPI_COMMAND1);
}
+static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi);
+static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi);
+
+/**
+ * tegra_qspi_handle_timeout - Handle transfer timeout with hardware check
+ * @tqspi: QSPI controller instance
+ *
+ * When a timeout occurs but hardware has completed the transfer (interrupt
+ * was lost or delayed), manually trigger transfer completion processing.
+ * This avoids failing transfers that actually succeeded.
+ *
+ * Returns: 0 if transfer was completed, -ETIMEDOUT if real timeout
+ */
+static int tegra_qspi_handle_timeout(struct tegra_qspi *tqspi)
+{
+ irqreturn_t ret;
+ u32 status;
+
+ /* Check if hardware actually completed the transfer */
+ status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
+ if (!(status & QSPI_RDY))
+ return -ETIMEDOUT;
+
+ /*
+ * Hardware completed but interrupt was lost/delayed. Manually
+ * process the completion by calling the appropriate handler.
+ */
+ dev_warn_ratelimited(tqspi->dev,
+ "QSPI interrupt timeout, but transfer complete\n");
+
+ /* Clear the transfer status */
+ status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);
+ tegra_qspi_writel(tqspi, status, QSPI_TRANS_STATUS);
+
+ /* Manually trigger completion handler */
+ if (!tqspi->is_curr_dma_xfer)
+ ret = handle_cpu_based_xfer(tqspi);
+ else
+ ret = handle_dma_based_xfer(tqspi);
+
+ return (ret == IRQ_HANDLED) ? 0 : -EIO;
+}
+
static u32 tegra_qspi_cmd_config(bool is_ddr, u8 bus_width, u8 len)
{
u32 cmd_config = 0;
@@ -1177,20 +1220,28 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
QSPI_DMA_TIMEOUT);
if (WARN_ON_ONCE(ret == 0)) {
- dev_err_ratelimited(tqspi->dev,
- "QSPI Transfer failed with timeout\n");
-
- /* Abort transfer by resetting pio/dma bit */
- if (tqspi->is_curr_dma_xfer)
- tegra_qspi_dma_stop(tqspi);
- else
- tegra_qspi_pio_stop(tqspi);
-
- /* Reset controller if timeout happens */
- tegra_qspi_reset(tqspi);
-
- ret = -EIO;
- goto exit;
+ /*
+ * Check if hardware completed the transfer
+ * even though interrupt was lost or delayed.
+ * If so, process the completion and continue.
+ */
+ ret = tegra_qspi_handle_timeout(tqspi);
+ if (ret < 0) {
+ /* Real timeout - clean up and fail */
+ dev_err(tqspi->dev, "transfer timeout\n");
+
+ /* Abort transfer by resetting pio/dma bit */
+ if (tqspi->is_curr_dma_xfer)
+ tegra_qspi_dma_stop(tqspi);
+ else
+ tegra_qspi_pio_stop(tqspi);
+
+ /* Reset controller if timeout happens */
+ tegra_qspi_reset(tqspi);
+
+ ret = -EIO;
+ goto exit;
+ }
}
if (tqspi->tx_status || tqspi->rx_status) {
@@ -1281,14 +1332,23 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
ret = wait_for_completion_timeout(&tqspi->xfer_completion,
QSPI_DMA_TIMEOUT);
if (WARN_ON(ret == 0)) {
- dev_err(tqspi->dev, "transfer timeout\n");
+ /*
+ * Check if hardware completed the transfer even though
+ * interrupt was lost or delayed. If so, process the
+ * completion and continue.
+ */
+ ret = tegra_qspi_handle_timeout(tqspi);
+ if (ret < 0) {
+ /* Real timeout - clean up and fail */
+ dev_err(tqspi->dev, "transfer timeout\n");
- if (tqspi->is_curr_dma_xfer)
- tegra_qspi_dma_stop(tqspi);
+ if (tqspi->is_curr_dma_xfer)
+ tegra_qspi_dma_stop(tqspi);
- tegra_qspi_handle_error(tqspi);
- ret = -EIO;
- goto complete_xfer;
+ tegra_qspi_handle_error(tqspi);
+ ret = -EIO;
+ goto complete_xfer;
+ }
}
if (tqspi->tx_status || tqspi->rx_status) {
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions
2025-10-28 15:57 ` [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions Vishwaroop A
@ 2025-11-03 14:15 ` Thierry Reding
0 siblings, 0 replies; 9+ messages in thread
From: Thierry Reding @ 2025-11-03 14:15 UTC (permalink / raw)
To: Vishwaroop A
Cc: Mark Brown, Jonathan Hunter, Sowjanya Komatineni, Laxman Dewangan,
smangipudi, kyarlagadda, linux-spi, linux-tegra, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 737 bytes --]
On Tue, Oct 28, 2025 at 03:57:02PM +0000, Vishwaroop A wrote:
> Extract common cleanup code into dedicated helper functions to simplify
> the code and improve readability. This refactoring includes:
>
> - tegra_qspi_reset(): Device reset and interrupt cleanup
> - tegra_qspi_dma_stop(): DMA termination and disable
> - tegra_qspi_pio_stop(): PIO mode disable
>
> No functional changes. This is purely a code reorganization to prepare
> for improved timeout handling in subsequent patches.
>
> Signed-off-by: Vishwaroop A <va@nvidia.com>
> ---
> drivers/spi/spi-tegra210-quad.c | 84 +++++++++++++++++----------------
> 1 file changed, 44 insertions(+), 40 deletions(-)
Acked-by: Thierry Reding <treding@nvidia.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout
2025-10-28 15:57 ` [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout Vishwaroop A
@ 2025-11-03 14:16 ` Thierry Reding
0 siblings, 0 replies; 9+ messages in thread
From: Thierry Reding @ 2025-11-03 14:16 UTC (permalink / raw)
To: Vishwaroop A
Cc: Mark Brown, Jonathan Hunter, Sowjanya Komatineni, Laxman Dewangan,
smangipudi, kyarlagadda, linux-spi, linux-tegra, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 918 bytes --]
On Tue, Oct 28, 2025 at 03:57:03PM +0000, Vishwaroop A wrote:
> Under high system load, QSPI interrupts can be delayed or blocked on the
> target CPU, causing wait_for_completion_timeout() to report failure even
> though the hardware successfully completed the transfer.
>
> When a timeout occurs, check the QSPI_RDY bit in QSPI_TRANS_STATUS to
> determine if the hardware actually completed the transfer. If so, manually
> invoke the completion handler to process the transfer successfully instead
> of failing it.
>
> This distinguishes lost/delayed interrupts from real hardware timeouts,
> preventing unnecessary failures of transfers that completed successfully.
>
> Signed-off-by: Vishwaroop A <va@nvidia.com>
> ---
> drivers/spi/spi-tegra210-quad.c | 100 +++++++++++++++++++++++++-------
> 1 file changed, 80 insertions(+), 20 deletions(-)
Acked-by: Thierry Reding <treding@nvidia.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
` (2 preceding siblings ...)
2025-10-28 15:57 ` [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout Vishwaroop A
@ 2025-11-06 10:06 ` Jon Hunter
2025-11-06 11:34 ` Mark Brown
4 siblings, 0 replies; 9+ messages in thread
From: Jon Hunter @ 2025-11-06 10:06 UTC (permalink / raw)
To: Vishwaroop A, Mark Brown, Thierry Reding, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda
Cc: linux-spi, linux-tegra, linux-kernel
On 28/10/2025 15:57, Vishwaroop A wrote:
> Hi,
>
> This patch series addresses timeout handling issues in the Tegra QSPI driver
> that occur under high system load conditions. We've observed that when CPUs
> are saturated (due to error injection, RAS firmware activity, or general CPU
> contention), QSPI interrupt handlers can be delayed, causing spurious transfer
> failures even though the hardware completed the operation successfully.
>
> Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
> and checked when the IRQ thread finally runs. It also ensures interrupts are
> properly cleared on failure paths.
>
> Patch 2 refactors the timeout cleanup code into dedicated helper functions
> (tegra_qspi_reset, tegra_qspi_dma_stop, tegra_qspi_pio_stop) to improve code
> readability and maintainability. This is purely a code reorganization with no
> functional changes.
>
> Patch 3 adds hardware status checking on timeout. Before failing a transfer,
> the driver now reads QSPI_TRANS_STATUS to verify if the hardware actually
> completed the operation. If so, it manually invokes the completion handler
> instead of failing the transfer. This distinguishes genuine hardware timeouts
> from delayed/lost interrupts.
>
> These changes have been tested in production environments under various high
> load scenarios including RAS testing and CPU saturation workloads.
For the series ...
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Thanks
Jon
--
nvpublic
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
` (3 preceding siblings ...)
2025-11-06 10:06 ` [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Jon Hunter
@ 2025-11-06 11:34 ` Mark Brown
4 siblings, 0 replies; 9+ messages in thread
From: Mark Brown @ 2025-11-06 11:34 UTC (permalink / raw)
To: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda, Vishwaroop A
Cc: linux-spi, linux-tegra, linux-kernel
On Tue, 28 Oct 2025 15:57:00 +0000, Vishwaroop A wrote:
> This patch series addresses timeout handling issues in the Tegra QSPI driver
> that occur under high system load conditions. We've observed that when CPUs
> are saturated (due to error injection, RAS firmware activity, or general CPU
> contention), QSPI interrupt handlers can be delayed, causing spurious transfer
> failures even though the hardware completed the operation successfully.
>
> Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
> and checked when the IRQ thread finally runs. It also ensures interrupts are
> properly cleared on failure paths.
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next
Thanks!
[1/3] spi: tegra210-quad: Fix timeout handling
commit: b4e002d8a7cee3b1d70efad0e222567f92a73000
[2/3] spi: tegra210-quad: Refactor error handling into helper functions
commit: 6022eacdda8b0b06a2e1d4122e5268099b62ff5d
[3/3] spi: tegra210-quad: Check hardware status on timeout
commit: 380fd29d57abe6679d87ec56babe65ddc5873a37
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling
2025-10-28 15:57 ` [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling Vishwaroop A
@ 2025-11-12 14:39 ` Breno Leitao
0 siblings, 0 replies; 9+ messages in thread
From: Breno Leitao @ 2025-11-12 14:39 UTC (permalink / raw)
To: Vishwaroop A
Cc: Mark Brown, Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
Laxman Dewangan, smangipudi, kyarlagadda, linux-spi, linux-tegra,
linux-kernel, Thierry Reding
On Tue, Oct 28, 2025 at 03:57:01PM +0000, Vishwaroop A wrote:
> When the CPU that the QSPI interrupt handler runs on (typically CPU 0)
> is excessively busy, it can lead to rare cases of the IRQ thread not
> running before the transfer timeout is reached.
>
> While handling the timeouts, any pending transfers are cleaned up and
> the message that they correspond to is marked as failed, which leaves
> the curr_xfer field pointing at stale memory.
I saw something similar on one of my hosts, and I debugged it, and it
seemed similar to what you are fixing in here.
Just sharing what I got while debugging this, in case this is useful:
UBSAN: shift-out-of-bounds in drivers/spi/spi-tegra210-quad.c:385:25
shift exponent 198 is too large for 32-bit type 'u32' (aka 'unsigned int')
CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G W E N 6.16.1 #1 PREEMPT(none)
Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
Call trace:
show_stack+0x1c/0x30 (C)
dump_stack_lvl+0x38/0xb0
dump_stack+0x14/0x1c
__ubsan_handle_shift_out_of_bounds+0x24c/0x2c0
tegra_qspi_isr_thread+0x1cc8/0x1e60 [spi_tegra210_quad]
irq_thread_fn+0x80/0x108
irq_thread+0x158/0x258
kthread+0x3fc/0x530
ret_from_fork+0x10/0x20
---[ end trace ]---
------------[ cut here ]------------
UBSAN: shift-out-of-bounds in drivers/spi/spi-tegra210-quad.c:397:20
shift exponent 32 is too large for 32-bit type 'u32' (aka 'unsigned int')
CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G W E N 6.16.1 #1 PREEMPT(none)
Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
Call trace:
show_stack+0x1c/0x30 (C)
dump_stack_lvl+0x38/0xb0
dump_stack+0x14/0x1c
__ubsan_handle_shift_out_of_bounds+0x24c/0x2c0
tegra_qspi_isr_thread+0xc90/0x1e60 [spi_tegra210_quad]
irq_thread_fn+0x80/0x108
irq_thread+0x158/0x258
kthread+0x3fc/0x530
ret_from_fork+0x10/0x20
---[ end trace ]---
and then KASAN and a kernel crash.
BUG: KASAN: vmalloc-out-of-bounds in tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
Write of size 1 at addr ffff8000db950000 by task irq/43-NVDA1513/883
CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G W E N 6.16.1-0_fbk0_debug_rc20_0_g977c20cb5846 #1 PREEMPT(none)
Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
Call trace:
show_stack+0x1c/0x30 (C)
dump_stack_lvl+0x38/0xb0
print_report+0x164/0x6d8
kasan_report+0xcc/0x128
__asan_report_store1_noabort+0x1c/0x28
tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
irq_thread_fn+0x80/0x108
irq_thread+0x158/0x258
kthread+0x3fc/0x530
ret_from_fork+0x10/0x20
The buggy address belongs to a 1-page vmalloc region starting at 0xffff8000db940000 allocated at copy_process+0x258/0x28d8
Memory state around the buggy address:
ffff8000db94ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8000db94ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8000db950000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
^
ffff8000db950080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
ffff8000db950100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
==================================================================
Unable to handle kernel paging request at virtual address ffff8000db950000
KASAN: probably user-memory-access in range [0x00000006dca80000-0x00000006dca80007]
Mem abort info:
ESR = 0x0000000096000047
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x07: level 3 translation fault
Data abort info:
ISV = 0, ISS = 0x00000047, ISS2 = 0x00000000
CM = 0, WnR = 1, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
pstate: 234010c9 (nzCv daIF +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
pc : tegra_qspi_isr_thread+0xcc0/0x1e60 [spi_tegra210_quad]
lr : tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
x26: 0000000000000001 x25: 0000000000000028 x24: ffff8000db94ffff
x23: ffff0000d16b0918 x22: 0000000000000040 x21: 000000000000003a
x20: ffff8000db94ffff x19: ffff0000d16b08c0 x18: 0000000000000001
x17: 3d3d3d3d3d3b2d2c x16: 3d3d3d3d3d3b2d2c x15: 0000000000000001
x14: 1ffff00010bfce80 x13: 0000000000000000 x12: 0000000000000000
x11: ffff700010bfce81 x10: 0000000000000019 x9 : 0000000000000000
x8 : 0000000000000000 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000b49cf8e0 x4 : ffff800084e7b140 x3 : ffff8000801bbd8c
x2 : 0000000000000001 x1 : 0000000000000008 x0 : 0000000000000001
Call trace:
tegra_qspi_isr_thread+0xcc0/0x1e60 [spi_tegra210_quad] (P)
irq_thread_fn+0x80/0x108
irq_thread+0x158/0x258
kthread+0x3fc/0x530
ret_from_fork+0x10/0x20
Code: 540001aa 1ad92768 f85f83aa 6b1a039f (383a6b08)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x2000,000003c0,62534ca1,5467fea7
Memory Limit: none
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-11-12 14:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 15:57 [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Vishwaroop A
2025-10-28 15:57 ` [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling Vishwaroop A
2025-11-12 14:39 ` Breno Leitao
2025-10-28 15:57 ` [PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions Vishwaroop A
2025-11-03 14:15 ` Thierry Reding
2025-10-28 15:57 ` [PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout Vishwaroop A
2025-11-03 14:16 ` Thierry Reding
2025-11-06 10:06 ` [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load Jon Hunter
2025-11-06 11:34 ` Mark Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).