spi: tegra210-quad: Improve reset and tx failures

[PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 4 weeks ago

My UEFI machines with tegra210-quad consistently report "device reset
failed". Investigation showed this isn't an actual failure
- __device_reset() returns -ENOENT because ACPI has no "*_RST" method.

Replace device_reset() with device_reset_optional() to prevent
errors when the reset method doesn't exist. With this change, the
function only fails if the actual device reset operation fails when
called.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/spi/spi-tegra210-quad.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index 08e49a8768943..9027f995a6669 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -999,7 +999,7 @@ static void tegra_qspi_handle_error(struct tegra_qspi *tqspi)
 	dev_err(tqspi->dev, "error in transfer, fifo status 0x%08x\n", tqspi->status_reg);
 	tegra_qspi_dump_regs(tqspi);
 	tegra_qspi_flush_fifos(tqspi, true);
-	if (device_reset(tqspi->dev) < 0)
+	if (device_reset_optional(tqspi->dev) < 0)
 		dev_warn_once(tqspi->dev, "device reset failed\n");
 }
 
@@ -1149,7 +1149,7 @@ static int tegra_qspi_combined_seq_xfer(struct tegra_qspi *tqspi,
 				}
 
 				/* Reset controller if timeout happens */
-				if (device_reset(tqspi->dev) < 0)
+				if (device_reset_optional(tqspi->dev) < 0)
 					dev_warn_once(tqspi->dev,
 						      "device reset failed\n");
 				ret = -EIO;
@@ -1606,7 +1606,7 @@ static int tegra_qspi_probe(struct platform_device *pdev)
 		goto exit_pm_disable;
 	}
 
-	if (device_reset(tqspi->dev) < 0)
+	if (device_reset_optional(tqspi->dev) < 0)
 		dev_warn_once(tqspi->dev, "device reset failed\n");
 
 	tqspi->def_command1_reg = QSPI_M_S | QSPI_CS_SW_HW |  QSPI_CS_SW_VAL;

-- 
2.47.1

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 4 weeks ago

On Mon, Mar 17, 2025 at 08:44:01AM -0700, Breno Leitao wrote:
> My UEFI machines with tegra210-quad consistently report "device reset
> failed". Investigation showed this isn't an actual failure
> - __device_reset() returns -ENOENT because ACPI has no "*_RST" method.

That's not the case, it's returning an error because there is no reset
controller discoverable via any mechanism.  There's no specific handling
for ACPI here.  It's also not clear that this is a false positive, the
driver did indeed fail to reset the device and especially for the error
handling case that seems like relevant information.  At the very least
the changelog should be clarified.

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 4 weeks ago

Hello Mark,

On Mon, Mar 17, 2025 at 04:45:31PM +0000, Mark Brown wrote:
> On Mon, Mar 17, 2025 at 08:44:01AM -0700, Breno Leitao wrote:
> > My UEFI machines with tegra210-quad consistently report "device reset
> > failed". Investigation showed this isn't an actual failure
> > - __device_reset() returns -ENOENT because ACPI has no "*_RST" method.
> 
> That's not the case, it's returning an error because there is no reset
> controller discoverable via any mechanism. 
>

Sorry, I was not very familiar with this subsystem, but I chase down
__device_reset(), and I found the return was coming from:

	int __device_reset(struct device *dev, bool optional)
	{
		acpi_handle handle = ACPI_HANDLE(dev);
		if (handle) {
			if (!acpi_has_method(handle, "_RST"))
				return optional ? 0 : -ENOENT;

> There's no specific handling for ACPI here.  

Do you mean no _RST method as stated above?

> It's also not clear that this is a false positive, the
> driver did indeed fail to reset the device and especially for the error
> handling case that seems like relevant information.

If the driver failed to reset the device, then device_reset_optional()
it will return an error code, but it will not return an error code if
the RST method is not found, right?

Sorry, if I am mis-understading the code here.

> At the very least the changelog should be clarified.

What would you add to the changelog to make this clear?

Thanks for the quick review!
--breno

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 4 weeks ago

On Mon, Mar 17, 2025 at 09:56:43AM -0700, Breno Leitao wrote:
> Hello Mark,
> 
> On Mon, Mar 17, 2025 at 04:45:31PM +0000, Mark Brown wrote:
> > On Mon, Mar 17, 2025 at 08:44:01AM -0700, Breno Leitao wrote:
> > > My UEFI machines with tegra210-quad consistently report "device reset
> > > failed". Investigation showed this isn't an actual failure
> > > - __device_reset() returns -ENOENT because ACPI has no "*_RST" method.

> > That's not the case, it's returning an error because there is no reset
> > controller discoverable via any mechanism. 

> Sorry, I was not very familiar with this subsystem, but I chase down
> __device_reset(), and I found the return was coming from:

> 	int __device_reset(struct device *dev, bool optional)
> 	{
> 		acpi_handle handle = ACPI_HANDLE(dev);
> 		if (handle) {
> 			if (!acpi_has_method(handle, "_RST"))
> 				return optional ? 0 : -ENOENT;

> > There's no specific handling for ACPI here.  

> Do you mean no _RST method as stated above?

That's only happening in the case where the device has an ACPI handle,
the SPI driver has no idea why the reset API failed to look up a reset
controller.  Your change is to the SPI driver, not the reset framework.

> > It's also not clear that this is a false positive, the
> > driver did indeed fail to reset the device and especially for the error
> > handling case that seems like relevant information.

> If the driver failed to reset the device, then device_reset_optional()
> it will return an error code, but it will not return an error code if
> the RST method is not found, right?

> Sorry, if I am mis-understading the code here.

Clearly if no reset controller is available then the driver will have
been unable to reset the hardware.  That seems like something it
actually wanted to do, especially in the error handling case - it's a
lot less likely that we'll recover things without the reset happening.
During probe it's possibly not so urgent but at other times it seems
more relevant.

> > At the very least the changelog should be clarified.

> What would you add to the changelog to make this clear?

For starters the mention of ACPI is irrelevant to what the SPI driver is
doing.  This sounds like a change specific to ACPI but it affects all
users.

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 3 weeks ago

Hello Mark,

On Mon, Mar 17, 2025 at 05:24:24PM +0000, Mark Brown wrote:
> > > There's no specific handling for ACPI here.
>
> > Do you mean no _RST method as stated above?
>
> That's only happening in the case where the device has an ACPI handle,
> the SPI driver has no idea why the reset API failed to look up a reset
> controller.  Your change is to the SPI driver, not the reset framework.
>
> > > It's also not clear that this is a false positive, the
> > > driver did indeed fail to reset the device and especially for the error
> > > handling case that seems like relevant information.
>
> > If the driver failed to reset the device, then device_reset_optional()
> > it will return an error code, but it will not return an error code if
> > the RST method is not found, right?
>
> > Sorry, if I am mis-understading the code here.
>
> Clearly if no reset controller is available then the driver will have
> been unable to reset the hardware.  That seems like something it
> actually wanted to do, especially in the error handling case - it's a
> lot less likely that we'll recover things without the reset happening.
> During probe it's possibly not so urgent but at other times it seems
> more relevant.

Thanks for your answer! Let me backup and explain how I am understanding
this issue, and my possible wrong assumptions:

  1) The SPI controller is reseted in the driver in a few cases:
  	a) At probe time
  	b) At transmission side (when there is a timeout to the controller)

  2) On the machines I have, I understand that  the controller failed to
     reset on both cases:
  	a) At boot time with "tegra-qspi NVDA1513:00: device reset failed"
  	b) At error handling with, the message below:

  	   tegra-qspi NVDA1513:00: QSPI Transfer failed with timeout: 0
             spi_master spi0: failed to transfer one message from queue
             spi_master spi0: noqueue transfer failed
  	   WARNING: CPU: 1 PID: 1221 at drivers/spi/spi-tegra210-quad.c:1120 tegra_qspi_transfer_one_message+0x780/0x918 [spi_tegra210_quad]

  	   Full log at: https://paste.debian.net/1363773/

  	c) I don't see the "device reset failed" in this case
  	transmission side. But the device doesn't recover also.

  3) These device fail to reset at probe because there is no ACPI method
     related resetting then (aka "_RST" methods), thus, device_reset() will
     return -ENOENT;

  4) Not being able to reset the driver seems to be the root cause of
     the WARNING flood I am seeing.


My assumptions, now:

  1) This controller doesn't have _RST ACPI method by design.

  2) It is OK to not have reset methods (!?)

  3) There are two helpers to reset the driver device_reset_optional() and
     device_reset().
  	a) For device_reset(), the helper will fail if the device
  	doesn't reset, thus, for ACPI systems, the _RST method needs
  	to exist and return successful, otherwise it will return
  	a ERRNO.

  	b) device_reset_optional() only fails if the reset fail (either
  	in ACPI or not), but, doesn't fail (aka returning 0) if reset
  	methods (aka _RST in ACPI) is not available.

  	c) Given assumption #1, device_reset_optional() is more
  	appropriate given that this method does not exist anyway.

  	d) This should be a no-op for systems that have proper reset
  	methods.

Thanks for helping me with this issue,
--breno

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 03:36:54AM -0700, Breno Leitao wrote:

> My assumptions, now:

>   1) This controller doesn't have _RST ACPI method by design.

>   2) It is OK to not have reset methods (!?)

Well, that's not clear to me.  It seems likely to work a lot of the time
on probe but I don't know how well it handles a warm reboot for example.
Like I say the error handling case seems more likely to be at least less
effective without a reset controller so it'd be worth logging.  In the
DT the reset controller is a required property which suggests the driver
might be assuming it's got the hardware into a known state.

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 12:48:13PM +0000, Mark Brown wrote:

> Well, that's not clear to me.  It seems likely to work a lot of the time
> on probe but I don't know how well it handles a warm reboot for example.
> Like I say the error handling case seems more likely to be at least less
> effective without a reset controller so it'd be worth logging.  In the
> DT the reset controller is a required property which suggests the driver
> might be assuming it's got the hardware into a known state.

Makes sense. Another question, for platforms like this one that doesn't
have the device reset methods, what can we do to stop the bleed?

Basically every message that is sent to the SPI controller will fail,
which will trigger the device_reet() which is a no-op, but the device
will continue to be online. Should we disable the device after some
point?

Regarding this patchset, I understand that patch #1 is not ideal as
discussed above, what about patch 2 and 3?

Thanks
--breno

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 10:02:47AM -0700, Breno Leitao wrote:

> Makes sense. Another question, for platforms like this one that doesn't
> have the device reset methods, what can we do to stop the bleed?

> Basically every message that is sent to the SPI controller will fail,
> which will trigger the device_reet() which is a no-op, but the device
> will continue to be online. Should we disable the device after some
> point?

The SPI controller is only going to be doing something because some
driver for an attached SPI device is trying to do something.  Presumably
whatever driver that is won't be having a good time and can hopefully
figure something out, though given that SPI is simple and not
hotpluggable this isn't really something that comes up a lot in
production so I'd be unsurprised to see things just keep on retrying.
I'd expect to see any substantial error handling in the driver for the
device rather than in the controller.

Obviously there's something wrong with the device description here which
is upsetting the controller driver.

> Regarding this patchset, I understand that patch #1 is not ideal as
> discussed above, what about patch 2 and 3?

If I didn't say anything they're probably fine.

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 05:34:55PM +0000, Mark Brown wrote:
> On Tue, Mar 18, 2025 at 10:02:47AM -0700, Breno Leitao wrote:
> 
> > Makes sense. Another question, for platforms like this one that doesn't
> > have the device reset methods, what can we do to stop the bleed?
> 
> > Basically every message that is sent to the SPI controller will fail,
> > which will trigger the device_reet() which is a no-op, but the device
> > will continue to be online. Should we disable the device after some
> > point?
> 
> The SPI controller is only going to be doing something because some
> driver for an attached SPI device is trying to do something.  Presumably
> whatever driver that is won't be having a good time and can hopefully
> figure something out, though given that SPI is simple and not
> hotpluggable this isn't really something that comes up a lot in
> production so I'd be unsurprised to see things just keep on retrying.
> I'd expect to see any substantial error handling in the driver for the
> device rather than in the controller.

Good point. In my specific case, this is coming from tpm_tis,
which is not aware that the device is totally dead, and continues to ask
for random numbers:

            tegra_qspi_transfer_one_message
            __spi_pump_transfer_message
            __spi_sync
            spi_sync
            tpm_tis_spi_transfer
            tpm_tis_spi_read_bytes
            tpm_tis_request_locality
            tpm_chip_start
            tpm_try_get_ops
            tpm_find_get_ops
            tpm_get_random
            tpm_hwrng_read
            hwrng_fillfn
            kthread
            ret_from_fork

Looking at tpm_tis, it seems it doesn't care if the the SPI is dead, and
just forward through the requests, which never complete. Adding Arnd to
see if he has any idea about this.

Arnd,

Summary of the proiblem: tpm_tis is trying to read random numbers
through a dead SPI controller. That causes infinite amounts of warnings
on the kernel, given that the controller is WARNing on time outs (which
is being fixed in one of the patches in this patchset).

Question: Should tpm_tis be aware that the underneath SPI controller is
dead, and eventually get unplugged?

> Obviously there's something wrong with the device description here which
> is upsetting the controller driver.
> 
> > Regarding this patchset, I understand that patch #1 is not ideal as
> > discussed above, what about patch 2 and 3?
> 
> If I didn't say anything they're probably fine.

Do you want me to resend those two separately, or, is this thread
enough?

Thanks again,
--breno

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 11:29:26AM -0700, Breno Leitao wrote:
> On Tue, Mar 18, 2025 at 05:34:55PM +0000, Mark Brown wrote:
> > On Tue, Mar 18, 2025 at 10:02:47AM -0700, Breno Leitao wrote:

> > > Regarding this patchset, I understand that patch #1 is not ideal as
> > > discussed above, what about patch 2 and 3?

> > If I didn't say anything they're probably fine.

> Do you want me to resend those two separately, or, is this thread
> enough?

Please resend.  I think I was anticipating a new version of this patch
with a clarified changelog and some rework to tone down the logging
that's generated similar to the other patches rather than just silently
ignoring the lack of a reset controller.

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Breno Leitao 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 06:35:18PM +0000, Mark Brown wrote:

> > Do you want me to resend those two separately, or, is this thread
> > enough?
> 
> Please resend.  I think I was anticipating a new version of this patch
> with a clarified changelog and some rework to tone down the logging
> that's generated similar to the other patches rather than just silently
> ignoring the lack of a reset controller.

Sorry, I am more than happy to change it the way you prefer, but, the
warnings coming from "device reset failed" are already printed once:

Here are the instances of calls to device_reset(), all of them with
`dev_warn_once()`:

	if (device_reset(tqspi->dev) < 0)
		dev_warn_once(tqspi->dev, "device reset failed\n");

and

	/* Reset controller if timeout happens */
	if (device_reset(tqspi->dev) < 0)
		dev_warn_once(tqspi->dev,
			"device reset failed\n");

So, this one is not very noisy. Should I change anything?

On the other side, I see some other messages that are very noise, being
displayed at every message that is failing to go through. They are:

           spi_master spi0: failed to transfer one message from queue
           spi_master spi0: noqueue transfer failed

I will rate limit those as well.

Thanks for your direction,
--breno

Re: [PATCH 1/3] spi: tegra210-quad: use device_reset_optional() instead of device_reset()

Posted by Mark Brown 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 12:08:56PM -0700, Breno Leitao wrote:

> Sorry, I am more than happy to change it the way you prefer, but, the
> warnings coming from "device reset failed" are already printed once:

> Here are the instances of calls to device_reset(), all of them with
> `dev_warn_once()`:

Oh, in that case I guess just drop it (or based on Arnd's suggestion
change the one in probe() to be fatal).