[v5] spi: tegra210-quad: Improve timeout handling under high system load

[PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling

Posted by Vishwaroop A 3 months, 1 week ago

When the CPU that the QSPI interrupt handler runs on (typically CPU 0)
is excessively busy, it can lead to rare cases of the IRQ thread not
running before the transfer timeout is reached.

While handling the timeouts, any pending transfers are cleaned up and
the message that they correspond to is marked as failed, which leaves
the curr_xfer field pointing at stale memory.

To avoid this, clear curr_xfer to NULL upon timeout and check for this
condition when the IRQ thread is finally run.

While at it, also make sure to clear interrupts on failure so that new
interrupts can be run.

A better, more involved, fix would move the interrupt clearing into a
hard IRQ handler. Ideally we would also want to signal that the IRQ
thread no longer needs to be run after the timeout is hit to avoid the
extra check for a valid transfer.

Fixes: 921fc1838fb0 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Vishwaroop A <va@nvidia.com>
---
 drivers/spi/spi-tegra210-quad.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 3be7499db21e..d9ca3d7b082f 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1024,8 +1024,10 @@ static void tegra_qspi_handle_error(struct tegra_qspi *tqspi)
 	dev_err(tqspi->dev, "error in transfer, fifo status 0x%08x\n", tqspi->status_reg);
 	tegra_qspi_dump_regs(tqspi);
 	tegra_qspi_flush_fifos(tqspi, true);
-	if (device_reset(tqspi->dev) < 0)
+	if (device_reset(tqspi->dev) < 0) {
 		dev_warn_once(tqspi->dev, "device reset failed\n");
+		tegra_qspi_mask_clear_irq(tqspi);
+	}
 }
 
 static void tegra_qspi_transfer_end(struct spi_device *spi)
@@ -1176,9 +1178,11 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 				}
 
 				/* Reset controller if timeout happens */
-				if (device_reset(tqspi->dev) < 0)
+				if (device_reset(tqspi->dev) < 0) {
 					dev_warn_once(tqspi->dev,
 						      "device reset failed\n");
+					tegra_qspi_mask_clear_irq(tqspi);
+				}
 				ret = -EIO;
 				goto exit;
 			}
@@ -1200,11 +1204,13 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 			tegra_qspi_transfer_end(spi);
 			spi_transfer_delay_exec(xfer);
 		}
+		tqspi->curr_xfer = NULL;
 		transfer_phase++;
 	}
 	ret = 0;
 
 exit:
+	tqspi->curr_xfer = NULL;
 	msg->status = ret;
 
 	return ret;
@@ -1290,6 +1296,8 @@ static int tegra_qspi_non_combined_seq_xfer(struct tegra_qspi *tqspi,
 		msg->actual_length += xfer->len + dummy_bytes;
 
 complete_xfer:
+		tqspi->curr_xfer = NULL;
+
 		if (ret < 0) {
 			tegra_qspi_transfer_end(spi);
 			spi_transfer_delay_exec(xfer);
@@ -1395,6 +1403,7 @@ static irqreturn_t handle_cpu_based_xfer(struct tegra_qspi *tqspi)
 	tegra_qspi_calculate_curr_xfer_param(tqspi, t);
 	tegra_qspi_start_cpu_based_transfer(tqspi, t);
 exit:
+	tqspi->curr_xfer = NULL;
 	spin_unlock_irqrestore(&tqspi->lock, flags);
 	return IRQ_HANDLED;
 }
@@ -1480,6 +1489,15 @@ static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
 	struct tegra_qspi *tqspi = context_data;
 
+	/*
+	 * Occasionally the IRQ thread takes a long time to wake up (usually
+	 * when the CPU that it's running on is excessively busy) and we have
+	 * already reached the timeout before and cleaned up the timed out
+	 * transfer. Avoid any processing in that case and bail out early.
+	 */
+	if (!tqspi->curr_xfer)
+		return IRQ_NONE;
+
 	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
 
 	if (tqspi->cur_direction & DATA_DIR_TX)
-- 
2.17.1

Re: [PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling

Posted by Breno Leitao 2 months, 3 weeks ago

On Tue, Oct 28, 2025 at 03:57:01PM +0000, Vishwaroop A wrote:
> When the CPU that the QSPI interrupt handler runs on (typically CPU 0)
> is excessively busy, it can lead to rare cases of the IRQ thread not
> running before the transfer timeout is reached.
> 
> While handling the timeouts, any pending transfers are cleaned up and
> the message that they correspond to is marked as failed, which leaves
> the curr_xfer field pointing at stale memory.

I saw something similar on one of my hosts, and I debugged it, and it
seemed similar to what you are fixing in here.

Just sharing what I got while debugging this, in case this is useful:

	UBSAN: shift-out-of-bounds in drivers/spi/spi-tegra210-quad.c:385:25
	shift exponent 198 is too large for 32-bit type 'u32' (aka 'unsigned int')
	CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G        W   E    N  6.16.1 #1 PREEMPT(none)
	Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
	Call trace:
	show_stack+0x1c/0x30 (C)
	dump_stack_lvl+0x38/0xb0
	dump_stack+0x14/0x1c
	__ubsan_handle_shift_out_of_bounds+0x24c/0x2c0
	tegra_qspi_isr_thread+0x1cc8/0x1e60 [spi_tegra210_quad]
	irq_thread_fn+0x80/0x108
	irq_thread+0x158/0x258
	kthread+0x3fc/0x530
	ret_from_fork+0x10/0x20
	---[ end trace ]---

	------------[ cut here ]------------
	UBSAN: shift-out-of-bounds in drivers/spi/spi-tegra210-quad.c:397:20
	shift exponent 32 is too large for 32-bit type 'u32' (aka 'unsigned int')
	CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G        W   E    N  6.16.1 #1 PREEMPT(none)
	Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
	Call trace:
	show_stack+0x1c/0x30 (C)
	dump_stack_lvl+0x38/0xb0
	dump_stack+0x14/0x1c
	__ubsan_handle_shift_out_of_bounds+0x24c/0x2c0
	tegra_qspi_isr_thread+0xc90/0x1e60 [spi_tegra210_quad]
	irq_thread_fn+0x80/0x108
	irq_thread+0x158/0x258
	kthread+0x3fc/0x530
	ret_from_fork+0x10/0x20

	---[ end trace ]---

and then KASAN and a kernel crash.

	BUG: KASAN: vmalloc-out-of-bounds in tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
	Write of size 1 at addr ffff8000db950000 by task irq/43-NVDA1513/883

	CPU: 0 UID: 0 PID: 883 Comm: irq/43-NVDA1513 Tainted: G        W   E    N  6.16.1-0_fbk0_debug_rc20_0_g977c20cb5846 #1 PREEMPT(none)
	Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	Hardware name: Quanta JAVA ISLAND PVT 29F0EMAZ049/Java Island, BIOS F0EJ3A14 09/02/2025
	Call trace:
	show_stack+0x1c/0x30 (C)
	dump_stack_lvl+0x38/0xb0
	print_report+0x164/0x6d8
	kasan_report+0xcc/0x128
	__asan_report_store1_noabort+0x1c/0x28
	tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
	irq_thread_fn+0x80/0x108
	irq_thread+0x158/0x258
	kthread+0x3fc/0x530
	ret_from_fork+0x10/0x20

	The buggy address belongs to a 1-page vmalloc region starting at 0xffff8000db940000 allocated at copy_process+0x258/0x28d8
	Memory state around the buggy address:
	ffff8000db94ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	ffff8000db94ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	>ffff8000db950000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
			^
	ffff8000db950080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
	ffff8000db950100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
	==================================================================
	Unable to handle kernel paging request at virtual address ffff8000db950000
	KASAN: probably user-memory-access in range [0x00000006dca80000-0x00000006dca80007]
	Mem abort info:
	ESR = 0x0000000096000047
	EC = 0x25: DABT (current EL), IL = 32 bits
	SET = 0, FnV = 0
	EA = 0, S1PTW = 0
	FSC = 0x07: level 3 translation fault
	Data abort info:
	ISV = 0, ISS = 0x00000047, ISS2 = 0x00000000
	CM = 0, WnR = 1, TnD = 0, TagAccess = 0
	GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0


	 pstate: 234010c9 (nzCv daIF +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	 pc : tegra_qspi_isr_thread+0xcc0/0x1e60 [spi_tegra210_quad]
	 lr : tegra_qspi_isr_thread+0xce8/0x1e60 [spi_tegra210_quad]
	 x26: 0000000000000001 x25: 0000000000000028 x24: ffff8000db94ffff
	 x23: ffff0000d16b0918 x22: 0000000000000040 x21: 000000000000003a
	 x20: ffff8000db94ffff x19: ffff0000d16b08c0 x18: 0000000000000001
	 x17: 3d3d3d3d3d3b2d2c x16: 3d3d3d3d3d3b2d2c x15: 0000000000000001
	 x14: 1ffff00010bfce80 x13: 0000000000000000 x12: 0000000000000000
	 x11: ffff700010bfce81 x10: 0000000000000019 x9 : 0000000000000000
	 x8 : 0000000000000000 x7 : 0000000000000001 x6 : 0000000000000001
	 x5 : ffff8000b49cf8e0 x4 : ffff800084e7b140 x3 : ffff8000801bbd8c
	 x2 : 0000000000000001 x1 : 0000000000000008 x0 : 0000000000000001
	 Call trace:
	  tegra_qspi_isr_thread+0xcc0/0x1e60 [spi_tegra210_quad] (P)
	  irq_thread_fn+0x80/0x108
	  irq_thread+0x158/0x258
	  kthread+0x3fc/0x530
	  ret_from_fork+0x10/0x20
	 Code: 540001aa 1ad92768 f85f83aa 6b1a039f (383a6b08)
	 ---[ end trace 0000000000000000 ]---
	 Kernel panic - not syncing: Oops: Fatal exception
	 SMP: stopping secondary CPUs
	 Kernel Offset: disabled
	 CPU features: 0x2000,000003c0,62534ca1,5467fea7
	 Memory Limit: none

[PATCH v5 1/3] spi: tegra210-quad: Fix timeout handling
[PATCH v5 2/3] spi: tegra210-quad: Refactor error handling into helper functions
[PATCH v5 3/3] spi: tegra210-quad: Check hardware status on timeout