[v2] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

[PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

Posted by Breno Leitao 1 week, 5 days ago

The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
the following failure sequence:

1) After running for a while, the interrupt gets marked as spurious:

    irq 63: nobody cared (try booting with the "irqpoll" option)
    Disabling IRQ #63

2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
   responsible for signaling xfer_completion.
   Once the interrupt is disabled, xfer_completion is never completed, causing
   transfers to hit the timeout:

    WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0

3) The timeout handler completes the transfer:

    tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete

4) Later, the ISR thread finally runs and crashes trying to dereference
   curr_xfer which the timeout handler already set to NULL:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
    lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
    Call trace:
      handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)

Root cause analysis identified three issues:

1) Race condition on tqspi->curr_xfer

   The curr_xfer pointer can change during ISR execution without proper
   synchronization. The timeout path clears curr_xfer while the ISR
   thread may still be accessing it.

   This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
   adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
   at entry and exit - the value changes mid-execution. I've used the
   following test to reproduce this issue:

   https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

   The existing comment in the ISR acknowledges this race but the
   protection is insufficient:

       /*
        * Occasionally the IRQ thread takes a long time to wake up (usually
        * when the CPU that it's running on is excessively busy) and we have
        * already reached the timeout before and cleaned up the timed out
        * transfer. Avoid any processing in that case and bail out early.
        */

   This is bad because tqspi->curr_xfer can just get NULLed

2) Incorrect IRQ_NONE return causing spurious IRQ detection

   When the timeout handler processes a transfer before the ISR thread
   runs, tegra_qspi_isr_thread() returns IRQ_NONE.

   After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
   and disables it - but these were legitimate interrupts that happened to be
   processed by the timeout path first.

   Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
   the interrupt (!?)

3) Complex locking makes full protection difficult

   Ideally the entire tqspi structure would be protected by tqspi->lock,
   but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
   which can sleep, preventing the lock from being held across the entire
   ISR execution.

   Usama Arif has some ideas here, and he can share more.

This patchset addresses these issues:

Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
already processed the transfer. Use the QSPI_RDY bit in
QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
distinguish real interrupts from truly spurious ones.

Protect curr_xfer access with spinlock everywhere in the code, given
Interrupt handling can run in parallel with timeout and transfer.
This prevents the NULL pointer dereference by ensuring curr_xfer cannot
be cleared while being checked.

While this may not provide complete protection for all tqspi fields
(which might be necessary?!), it fixes the observed crashes and prevents
the spurious IRQ detection that was disabling the interrupt entirely.

This was tested with a simple TPM application, where the TPM lives
behind the tegra qspi driver:

https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

A special thanks for Usama Arif for his help investigating the problem
and helping with the fixes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Replaced the TODO comment to clarify why the lock is being released.
- Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org

---
Breno Leitao (6):
      spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
      spi: tegra210-quad: Move curr_xfer read inside spinlock
      spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
      spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer check in IRQ handler

 drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)
---
base-commit: 9b7977f9e39b7768c70c2aa497f04e7569fd3e00
change-id: 20260112-tegra_xfer-6acb30a6720f

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

Posted by Mark Brown 1 week, 1 day ago

On Mon, 26 Jan 2026 09:50:25 -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
> 
> 1) After running for a while, the interrupt gets marked as spurious:
> 
>     irq 63: nobody cared (try booting with the "irqpoll" option)
>     Disabling IRQ #63
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
      commit: aabd8ea0aa253d40cf5f20a609fc3d6f61e38299
[2/6] spi: tegra210-quad: Move curr_xfer read inside spinlock
      commit: ef13ba357656451d6371940d8414e3e271df97e3
[3/6] spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
      commit: f5a4d7f5e32ba163cff893493ec1cbb0fd2fb0d5
[4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
      commit: bf4528ab28e2bf112c3a2cdef44fd13f007781cd
[5/6] spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
      commit: 6d7723e8161f3c3f14125557e19dd080e9d882be
[6/6] spi: tegra210-quad: Protect curr_xfer check in IRQ handler
      commit: edf9088b6e1d6d88982db7eb5e736a0e4fbcc09e

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

Posted by Thierry Reding 1 week, 1 day ago

On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
> 
> 1) After running for a while, the interrupt gets marked as spurious:
> 
>     irq 63: nobody cared (try booting with the "irqpoll" option)
>     Disabling IRQ #63
> 
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>    responsible for signaling xfer_completion.
>    Once the interrupt is disabled, xfer_completion is never completed, causing
>    transfers to hit the timeout:
> 
>     WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
> 
> 3) The timeout handler completes the transfer:
> 
>     tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
> 
> 4) Later, the ISR thread finally runs and crashes trying to dereference
>    curr_xfer which the timeout handler already set to NULL:
> 
>     Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>     pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>     lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>     Call trace:
>       handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
> 
> Root cause analysis identified three issues:
> 
> 1) Race condition on tqspi->curr_xfer
> 
>    The curr_xfer pointer can change during ISR execution without proper
>    synchronization. The timeout path clears curr_xfer while the ISR
>    thread may still be accessing it.
> 
>    This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>    adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>    at entry and exit - the value changes mid-execution. I've used the
>    following test to reproduce this issue:
> 
>    https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
>    The existing comment in the ISR acknowledges this race but the
>    protection is insufficient:
> 
>        /*
>         * Occasionally the IRQ thread takes a long time to wake up (usually
>         * when the CPU that it's running on is excessively busy) and we have
>         * already reached the timeout before and cleaned up the timed out
>         * transfer. Avoid any processing in that case and bail out early.
>         */
> 
>    This is bad because tqspi->curr_xfer can just get NULLed
> 
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
> 
>    When the timeout handler processes a transfer before the ISR thread
>    runs, tegra_qspi_isr_thread() returns IRQ_NONE.
> 
>    After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>    and disables it - but these were legitimate interrupts that happened to be
>    processed by the timeout path first.
> 
>    Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>    the interrupt (!?)
> 
> 3) Complex locking makes full protection difficult
> 
>    Ideally the entire tqspi structure would be protected by tqspi->lock,
>    but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>    which can sleep, preventing the lock from being held across the entire
>    ISR execution.
> 
>    Usama Arif has some ideas here, and he can share more.
> 
> This patchset addresses these issues:
> 
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
> 
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
> 
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
> 
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
> 
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
> 
> ---
> Breno Leitao (6):
>       spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>       spi: tegra210-quad: Move curr_xfer read inside spinlock
>       spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>       spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer check in IRQ handler
> 
>  drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)

For the series:

Acked-by: Thierry Reding <treding@nvidia.com>

Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

Posted by Jon Hunter 1 week, 1 day ago

On 30/01/2026 10:16, Thierry Reding wrote:
> On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
>> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
>> the following failure sequence:
>>
>> 1) After running for a while, the interrupt gets marked as spurious:
>>
>>      irq 63: nobody cared (try booting with the "irqpoll" option)
>>      Disabling IRQ #63
>>
>> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>>     responsible for signaling xfer_completion.
>>     Once the interrupt is disabled, xfer_completion is never completed, causing
>>     transfers to hit the timeout:
>>
>>      WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>>
>> 3) The timeout handler completes the transfer:
>>
>>      tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>>
>> 4) Later, the ISR thread finally runs and crashes trying to dereference
>>     curr_xfer which the timeout handler already set to NULL:
>>
>>      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>>      pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>>      lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>>      Call trace:
>>        handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>>
>> Root cause analysis identified three issues:
>>
>> 1) Race condition on tqspi->curr_xfer
>>
>>     The curr_xfer pointer can change during ISR execution without proper
>>     synchronization. The timeout path clears curr_xfer while the ISR
>>     thread may still be accessing it.
>>
>>     This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>>     adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>>     at entry and exit - the value changes mid-execution. I've used the
>>     following test to reproduce this issue:
>>
>>     https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>>     The existing comment in the ISR acknowledges this race but the
>>     protection is insufficient:
>>
>>         /*
>>          * Occasionally the IRQ thread takes a long time to wake up (usually
>>          * when the CPU that it's running on is excessively busy) and we have
>>          * already reached the timeout before and cleaned up the timed out
>>          * transfer. Avoid any processing in that case and bail out early.
>>          */
>>
>>     This is bad because tqspi->curr_xfer can just get NULLed
>>
>> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>>
>>     When the timeout handler processes a transfer before the ISR thread
>>     runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>>
>>     After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>>     and disables it - but these were legitimate interrupts that happened to be
>>     processed by the timeout path first.
>>
>>     Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>>     the interrupt (!?)
>>
>> 3) Complex locking makes full protection difficult
>>
>>     Ideally the entire tqspi structure would be protected by tqspi->lock,
>>     but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>>     which can sleep, preventing the lock from being held across the entire
>>     ISR execution.
>>
>>     Usama Arif has some ideas here, and he can share more.
>>
>> This patchset addresses these issues:
>>
>> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
>> already processed the transfer. Use the QSPI_RDY bit in
>> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
>> distinguish real interrupts from truly spurious ones.
>>
>> Protect curr_xfer access with spinlock everywhere in the code, given
>> Interrupt handling can run in parallel with timeout and transfer.
>> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
>> be cleared while being checked.
>>
>> While this may not provide complete protection for all tqspi fields
>> (which might be necessary?!), it fixes the observed crashes and prevents
>> the spurious IRQ detection that was disabling the interrupt entirely.
>>
>> This was tested with a simple TPM application, where the TPM lives
>> behind the tegra qspi driver:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> A special thanks for Usama Arif for his help investigating the problem
>> and helping with the fixes.
>>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
>> ---
>> Changes in v2:
>> - Replaced the TODO comment to clarify why the lock is being released.
>> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>>
>> ---
>> Breno Leitao (6):
>>        spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>>        spi: tegra210-quad: Move curr_xfer read inside spinlock
>>        spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>>        spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>>        spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>>        spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>>
>>   drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 52 insertions(+), 4 deletions(-)
> 
> For the series:
> 
> Acked-by: Thierry Reding <treding@nvidia.com>

This also resolves a NULL pointer deference I see on Tegra194 Jetson 
Xavier NX and so ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic