[PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load

Vishwaroop A posted 3 patches 1 month, 3 weeks ago
drivers/spi/spi-tegra210-quad.c | 174 +++++++++++++++++++++++---------
1 file changed, 128 insertions(+), 46 deletions(-)
[PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
Posted by Vishwaroop A 1 month, 3 weeks ago
Hi,

This patch series addresses timeout handling issues in the Tegra QSPI driver
that occur under high system load conditions. We've observed that when CPUs
are saturated (due to error injection, RAS firmware activity, or general CPU
contention), QSPI interrupt handlers can be delayed, causing spurious transfer
failures even though the hardware completed the operation successfully.

Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
and checked when the IRQ thread finally runs. It also ensures interrupts are
properly cleared on failure paths.

Patch 2 refactors the timeout cleanup code into dedicated helper functions
(tegra_qspi_reset, tegra_qspi_dma_stop, tegra_qspi_pio_stop) to improve code
readability and maintainability. This is purely a code reorganization with no
functional changes.

Patch 3 adds hardware status checking on timeout. Before failing a transfer,
the driver now reads QSPI_TRANS_STATUS to verify if the hardware actually
completed the operation. If so, it manually invokes the completion handler
instead of failing the transfer. This distinguishes genuine hardware timeouts
from delayed/lost interrupts.

These changes have been tested in production environments under various high
load scenarios including RAS testing and CPU saturation workloads.

Changes in v5:
- No code changes, rebased to resolve conflicts

Changes in v4:
- Removed Change-Id from commit messages

Changes in v3:
- Added missing tqspi->curr_xfer = NULL assignment in handle_cpu_based_xfer()
- Split the previous patch 2/2 into two separate patches (now 2/3 and 3/3)
- Patch 2/3: New patch - refactoring only, no functional changes
- Patch 3/3: Functional changes to add hardware timeout checking

Changes in v2:
- Fixed indentation in patch 1/2: The "Reset controller if timeout happens"
  block now has correct indentation (inside the WARN_ON_ONCE block)
- No functional changes

Thierry Reding (1):
  spi: tegra210-quad: Fix timeout handling

Vishwaroop A (2):
  spi: tegra210-quad: Refactor error handling into helper functions
  spi: tegra210-quad: Check hardware status on timeout

 drivers/spi/spi-tegra210-quad.c | 174 +++++++++++++++++++++++---------
 1 file changed, 128 insertions(+), 46 deletions(-)

-- 
2.17.1
Re: [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
Posted by Mark Brown 1 month, 1 week ago
On Tue, 28 Oct 2025 15:57:00 +0000, Vishwaroop A wrote:
> This patch series addresses timeout handling issues in the Tegra QSPI driver
> that occur under high system load conditions. We've observed that when CPUs
> are saturated (due to error injection, RAS firmware activity, or general CPU
> contention), QSPI interrupt handlers can be delayed, causing spurious transfer
> failures even though the hardware completed the operation successfully.
> 
> Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
> and checked when the IRQ thread finally runs. It also ensures interrupts are
> properly cleared on failure paths.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/3] spi: tegra210-quad: Fix timeout handling
      commit: b4e002d8a7cee3b1d70efad0e222567f92a73000
[2/3] spi: tegra210-quad: Refactor error handling into helper functions
      commit: 6022eacdda8b0b06a2e1d4122e5268099b62ff5d
[3/3] spi: tegra210-quad: Check hardware status on timeout
      commit: 380fd29d57abe6679d87ec56babe65ddc5873a37

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark
Re: [PATCH v5 0/3] spi: tegra210-quad: Improve timeout handling under high system load
Posted by Jon Hunter 1 month, 1 week ago
On 28/10/2025 15:57, Vishwaroop A wrote:
> Hi,
> 
> This patch series addresses timeout handling issues in the Tegra QSPI driver
> that occur under high system load conditions. We've observed that when CPUs
> are saturated (due to error injection, RAS firmware activity, or general CPU
> contention), QSPI interrupt handlers can be delayed, causing spurious transfer
> failures even though the hardware completed the operation successfully.
> 
> Patch 1 fixes a stale pointer issue by ensuring curr_xfer is cleared on timeout
> and checked when the IRQ thread finally runs. It also ensures interrupts are
> properly cleared on failure paths.
> 
> Patch 2 refactors the timeout cleanup code into dedicated helper functions
> (tegra_qspi_reset, tegra_qspi_dma_stop, tegra_qspi_pio_stop) to improve code
> readability and maintainability. This is purely a code reorganization with no
> functional changes.
> 
> Patch 3 adds hardware status checking on timeout. Before failing a transfer,
> the driver now reads QSPI_TRANS_STATUS to verify if the hardware actually
> completed the operation. If so, it manually invokes the completion handler
> instead of failing the transfer. This distinguishes genuine hardware timeouts
> from delayed/lost interrupts.
> 
> These changes have been tested in production environments under various high
> load scenarios including RAS testing and CPU saturation workloads.


For the series ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>

Thanks
Jon

-- 
nvpublic