[PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled

Jason A. Donenfeld posted 1 patch 2 years, 8 months ago
drivers/char/tpm/tpm-chip.c | 8 ++++++++
1 file changed, 8 insertions(+)
[PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled
Posted by Jason A. Donenfeld 2 years, 8 months ago
TPM 1's support for its hardware RNG is broken across system suspends,
due to races or locking issues or something else that haven't been
diagnosed or fixed yet. These issues prevent the system from actually
suspending. So disable the driver in this case. Later, when this is
fixed properly, we can remove this.

Current breakage amounts to something like:

  tpm tpm0: A TPM error (28) occurred continue selftest
  ...
  tpm tpm0: A TPM error (28) occurred attempting get random
  ...
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
  tpm_tis 00:08: PM: failed to suspend: error 28
  PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

The hwrng driver appears already to be occasionally disabled due to
other conditions, so this shouldn't be too large of a surprise.

Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/char/tpm/tpm-chip.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
index 741d8f3e8fb3..eed67ea8d3a7 100644
--- a/drivers/char/tpm/tpm-chip.c
+++ b/drivers/char/tpm/tpm-chip.c
@@ -524,6 +524,14 @@ static int tpm_add_hwrng(struct tpm_chip *chip)
 	if (!IS_ENABLED(CONFIG_HW_RANDOM_TPM) || tpm_is_firmware_upgrade(chip))
 		return 0;
 
+	/*
+	 * This driver's support for using the RNG across suspend is broken on
+	 * TPM1. Until somebody fixes this, just stop registering a HWRNG in
+	 * that case.
+	 */
+	if (!(chip->flags & TPM_CHIP_FLAG_TPM2) && IS_ENABLED(CONFIG_PM_SLEEP))
+		return 0;
+
 	snprintf(chip->hwrng_name, sizeof(chip->hwrng_name),
 		 "tpm-rng-%d", chip->dev_num);
 	chip->hwrng.name = chip->hwrng_name;
-- 
2.39.0
Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled
Posted by Linus Torvalds 2 years, 8 months ago
On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.

How about just keeping it enabled, but not making it a fatal error if
the TPM saving doesn't work? IOW, just print the warning, and then
"return 0" from the suspend function.

I doubt anybody cares, but your patch disables that TPM device just
because PM is *enabled*. That's basically "all the time".

Imagine being on a desktop with a distro kernel that enables suspend -
because that kernel obviously is expected to work on laptops too.
You're never actually going to suspend things on that machine, but
maybe you still want to register it as a source of hw random data?

          Linus
Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled
Posted by Jason A. Donenfeld 2 years, 8 months ago
On Thu, Jan 05, 2023 at 01:58:48PM -0800, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 6:48 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > TPM 1's support for its hardware RNG is broken across system suspends,
> > due to races or locking issues or something else that haven't been
> > diagnosed or fixed yet. These issues prevent the system from actually
> > suspending. So disable the driver in this case. Later, when this is
> > fixed properly, we can remove this.
> 
> How about just keeping it enabled, but not making it a fatal error if
> the TPM saving doesn't work? IOW, just print the warning, and then
> "return 0" from the suspend function.

You're right that returning 0 from the pm notifier would make the
problem that users actually care about -- laptop doesn't sleep when you
close the lid -- go away.

From a random.c perspective, the RNG is already initialized when the
driver loads, which will be before suspend bricks the driver. So even if
the behavior afterwards is a buggy driver handing all zeros to random.c,
it won't really matter much; random.c can deal with that
cryptographically. I have no idea if this is actually the case with the
driver's error condition. But if it is, it's good that it doesn't
matter.

So okay, I'll roll a patch to do that when I get home. I'm writing on my
phone now, but from memory it's just changing a 'return rc;' into
'return 0;'.

Then the TPM folks can fix the underlying issue at their leisure
whenever.

Jason
[PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jason A. Donenfeld 2 years, 8 months ago
TPM 1 is sometimes broken across system suspends, due to races or
locking issues or something else that haven't been diagnosed or fixed
yet, most likely having to do with concurrent reads from the TPM's
hardware random number generator driver. These issues prevent the system
from actually suspending, with errors like:

  tpm tpm0: A TPM error (28) occurred continue selftest
  ...
  tpm tpm0: A TPM error (28) occurred attempting get random
  ...
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
  tpm_tis 00:08: PM: failed to suspend: error 28
  PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

In lieu of actually fixing the underlying bug, just allow system suspend
to continue, so that laptops still go to sleep fine. Later, this can be
reverted when the real bug is fixed.

Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
Cc: stable@vger.kernel.org # 6.1+
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
This is basically untested and I haven't worked out if there are any
awful implications of letting the system sleep when TPM suspend fails.
Maybe some PCRs get cleared and that will make everything explode on
resume? Maybe it doesn't matter? Somebody well versed in TPMology should
probably [n]ack this approach.

 drivers/char/tpm/tpm-interface.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index d69905233aff..6df9067ef7f9 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
 	}
 
 suspended:
-	return rc;
+	if (rc)
+		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
+		       chip->dev_num, rc);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(tpm_pm_suspend);
 
-- 
2.39.0
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jarkko Sakkinen 2 years, 8 months ago
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...

<REMOVE>

>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected

</REMOVE>

Unrelated to thix particular fix.

> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
> 
>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>  	}
>  
>  suspended:
> -	return rc;
> +	if (rc)
> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +		       chip->dev_num, rc);
> +	return 0;
>  }
>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>  
> -- 
> 2.39.0
> 

This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
the call is located in tpm_tis_resume() [*].

[*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/

BR, Jarkko
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Vlastimil Babka 2 years, 8 months ago
On 1/16/23 12:44, Jarkko Sakkinen wrote:
> On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
>> TPM 1 is sometimes broken across system suspends, due to races or
>> locking issues or something else that haven't been diagnosed or fixed
>> yet, most likely having to do with concurrent reads from the TPM's
>> hardware random number generator driver. These issues prevent the system
>> from actually suspending, with errors like:
>> 
>>   tpm tpm0: A TPM error (28) occurred continue selftest
>>   ...
> 
> <REMOVE>
> 
>>   tpm tpm0: A TPM error (28) occurred attempting get random
>>   ...
>>   tpm tpm0: Error (28) sending savestate before suspend
>>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>>   tpm_tis 00:08: PM: failed to suspend: error 28
>>   PM: Some devices failed to suspend, or early wake event detected
> 
> </REMOVE>
> 
> Unrelated to thix particular fix.

Not sure I understand.
AFAIK this is not a proper fix, but a workaround for when laptop suspend no
longer works because TPM fails to suspend. The error messages quoted above
are very much related to the problem of suspend not working, and this patch
did work as advertised at least for me. I see errors but they don't prevent
suspend anymore:

https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/

>> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
>> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
>> directly because the TPM maintainers weren't available. However, it
>> seems like this just addresses the most common cases of the bug, rather
>> than addressing it entirely. So there are more things to fix still,
>> apparently.
>> 
>> In lieu of actually fixing the underlying bug, just allow system suspend
>> to continue, so that laptops still go to sleep fine. Later, this can be
>> reverted when the real bug is fixed.
>> 
>> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
>> Cc: stable@vger.kernel.org # 6.1+
>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> ---
>> This is basically untested and I haven't worked out if there are any
>> awful implications of letting the system sleep when TPM suspend fails.
>> Maybe some PCRs get cleared and that will make everything explode on
>> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
>> probably [n]ack this approach.
>> 
>>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
>> index d69905233aff..6df9067ef7f9 100644
>> --- a/drivers/char/tpm/tpm-interface.c
>> +++ b/drivers/char/tpm/tpm-interface.c
>> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>>  	}
>>  
>>  suspended:
>> -	return rc;
>> +	if (rc)
>> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
>> +		       chip->dev_num, rc);
>> +	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>>  
>> -- 
>> 2.39.0
>> 
> 
> This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> the call is located in tpm_tis_resume() [*].
> 
> [*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/

Yes the changelog at the top does say "due to races or locking issues or
something else that haven't been diagnosed or fixed yet"

I don't know what causes the TPM to start returning error 28 on resume and
never recover from it. But it didn't happen before hwrng started using the
TPM. Before that, it was probably just the selftest ever doing anything with
the TPM, and on its own I don't recall it ever (before 6.1) failing and
preventing further suspend/resume.

> BR, Jarkko
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jarkko Sakkinen 2 years, 7 months ago
On Mon, Jan 16, 2023 at 03:00:03PM +0100, Vlastimil Babka wrote:
> On 1/16/23 12:44, Jarkko Sakkinen wrote:
> > On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> >> TPM 1 is sometimes broken across system suspends, due to races or
> >> locking issues or something else that haven't been diagnosed or fixed
> >> yet, most likely having to do with concurrent reads from the TPM's
> >> hardware random number generator driver. These issues prevent the system
> >> from actually suspending, with errors like:
> >> 
> >>   tpm tpm0: A TPM error (28) occurred continue selftest
> >>   ...
> > 
> > <REMOVE>
> > 
> >>   tpm tpm0: A TPM error (28) occurred attempting get random
> >>   ...
> >>   tpm tpm0: Error (28) sending savestate before suspend
> >>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> >>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> >>   tpm_tis 00:08: PM: failed to suspend: error 28
> >>   PM: Some devices failed to suspend, or early wake event detected
> > 
> > </REMOVE>
> > 
> > Unrelated to thix particular fix.
> 
> Not sure I understand.
> AFAIK this is not a proper fix, but a workaround for when laptop suspend no
> longer works because TPM fails to suspend. The error messages quoted above
> are very much related to the problem of suspend not working, and this patch
> did work as advertised at least for me. I see errors but they don't prevent
> suspend anymore:
> 
> https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/
> 
> >> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> >> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> >> directly because the TPM maintainers weren't available. However, it
> >> seems like this just addresses the most common cases of the bug, rather
> >> than addressing it entirely. So there are more things to fix still,
> >> apparently.
> >> 
> >> In lieu of actually fixing the underlying bug, just allow system suspend
> >> to continue, so that laptops still go to sleep fine. Later, this can be
> >> reverted when the real bug is fixed.
> >> 
> >> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> >> Cc: stable@vger.kernel.org # 6.1+
> >> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> >> ---
> >> This is basically untested and I haven't worked out if there are any
> >> awful implications of letting the system sleep when TPM suspend fails.
> >> Maybe some PCRs get cleared and that will make everything explode on
> >> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> >> probably [n]ack this approach.
> >> 
> >>  drivers/char/tpm/tpm-interface.c | 5 ++++-
> >>  1 file changed, 4 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> >> index d69905233aff..6df9067ef7f9 100644
> >> --- a/drivers/char/tpm/tpm-interface.c
> >> +++ b/drivers/char/tpm/tpm-interface.c
> >> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> >>  	}
> >>  
> >>  suspended:
> >> -	return rc;
> >> +	if (rc)
> >> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> >> +		       chip->dev_num, rc);
> >> +	return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >>  
> >> -- 
> >> 2.39.0
> >> 
> > 
> > This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> > the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> > the call is located in tpm_tis_resume() [*].
> > 
> > [*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/
> 
> Yes the changelog at the top does say "due to races or locking issues or
> something else that haven't been diagnosed or fixed yet"
> 
> I don't know what causes the TPM to start returning error 28 on resume and
> never recover from it. But it didn't happen before hwrng started using the
> TPM. Before that, it was probably just the selftest ever doing anything with
> the TPM, and on its own I don't recall it ever (before 6.1) failing and
> preventing further suspend/resume.

Would it be possible to test this theory by commenting out tpm_add_hwrng()
call from tpm_chip_register()?

Since they are called sequentially any sort of concurrency issue can be
probably ruled out.

One thing that I noticed is that probably it would be more safe-play to
move tpm_add_hwrng() call after creating the character device, as there's
no need to do it before anything else.

BR, Jarkko
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jarkko Sakkinen 2 years, 8 months ago
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
> 
>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>  	}
>  
>  suspended:
> -	return rc;
> +	if (rc)
> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +		       chip->dev_num, rc);
> +	return 0;
>  }
>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>  
> -- 
> 2.39.0
> 

Let me read all the threads through starting from the original report. I've
had emails piling up because of getting sick before holiday, and holiday
season after that.

This looks sane

Apologies for the lack of response.

BR, Jarkko
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jason A. Donenfeld 2 years, 8 months ago
Hi Jarkko,

On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > index d69905233aff..6df9067ef7f9 100644
> > --- a/drivers/char/tpm/tpm-interface.c
> > +++ b/drivers/char/tpm/tpm-interface.c
> > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> >       }
> >
> >  suspended:
> > -     return rc;
> > +     if (rc)
> > +             pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > +                    chip->dev_num, rc);
> > +     return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >
> > --
> > 2.39.0
> >
>
> Let me read all the threads through starting from the original report. I've
> had emails piling up because of getting sick before holiday, and holiday
> season after that.
>
> This looks sane

No, not really. I mean, it was sane under the circumstances of, "I'm
not going to spend time fixing this for real if the maintainers aren't
around," and it fixed the suspend issue. But it doesn't actually fix
any real tpm issue. The real issue, AFAICT, is there's some sort of
race between the tpm rng read command and either suspend or wakeup or
selftest. One of these is missing some locking. And then commands step
on each other and the tpm gets upset. This is probably something that
should be fixed. I assume the "Fixes: ..." tag will actually go quite
far back, with recent things only unearthing a somewhat old bug. But
just a hunch.

Jason
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jarkko Sakkinen 2 years, 7 months ago
On Mon, Jan 16, 2023 at 03:03:17PM +0100, Jason A. Donenfeld wrote:
> Hi Jarkko,
> 
> On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > > index d69905233aff..6df9067ef7f9 100644
> > > --- a/drivers/char/tpm/tpm-interface.c
> > > +++ b/drivers/char/tpm/tpm-interface.c
> > > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> > >       }
> > >
> > >  suspended:
> > > -     return rc;
> > > +     if (rc)
> > > +             pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > > +                    chip->dev_num, rc);
> > > +     return 0;
> > >  }
> > >  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> > >
> > > --
> > > 2.39.0
> > >
> >
> > Let me read all the threads through starting from the original report. I've
> > had emails piling up because of getting sick before holiday, and holiday
> > season after that.
> >
> > This looks sane
> 
> No, not really. I mean, it was sane under the circumstances of, "I'm
> not going to spend time fixing this for real if the maintainers aren't
> around," and it fixed the suspend issue. But it doesn't actually fix
> any real tpm issue. The real issue, AFAICT, is there's some sort of
> race between the tpm rng read command and either suspend or wakeup or
> selftest. One of these is missing some locking. And then commands step
> on each other and the tpm gets upset. This is probably something that
> should be fixed. I assume the "Fixes: ..." tag will actually go quite
> far back, with recent things only unearthing a somewhat old bug. But
> just a hunch.
> 
> Jason

See my response to Vlastimil:

https://lore.kernel.org/linux-integrity/Y8sr7YJ8e8eSpPFv@kernel.org/

Can you try what happens if you do not call tpm_add_hwrng()?

BR, Jarkko
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Linus Torvalds 2 years, 8 months ago
On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.

So the patch looks fine to me, but since there's still the ChromeOS
discussion pending I'll wait for that to finish.

Perhaps re-send or at least remind me if/when it does?

Also, a query about the printout:

> +       if (rc)
> +               pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +                      chip->dev_num, rc);

so I suspect that 99% of the time the dev_num isn't actually all that
useful, but what *might* be useful is which tpm driver it is.

Just comparing the error dmesg output you had:

  ..
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  ..

that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.

So I think "dev_err(dev, ...)" would be more useful here.

Finally - and maybe I'm just being difficult here, I will note here
again that TPM2 devices don't have this issue, because the TPM2 path
for suspend doesn't do any of this at all.

It just does

        tpm_transmit_cmd(..);

with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
the return value. In fact, the tpm2 code *used* to have this comment:

        /* In places where shutdown command is sent there's no much we can do
         * except print the error code on a system failure.
         */
        if (rc < 0 && rc != -EPIPE)
                dev_warn(&chip->dev, "transmit returned %d while
stopping the TPM",
                         rc);

but it was summarily removed when doing some re-organization around
buffer handling.

So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
should do this dance at all.

But having a dev_err() is probably a good idea at least as a transitional thing.

                  Linus
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Luigi Semenzato 2 years, 8 months ago
I think it's fine to go ahead with your change, for multiple reasons.

1. I doubt that any of the ChromeOS devices using TPM 1.2 are still
being updated.
2. If the SAVESTATE command fails, it is probably better to continue
the transition to S3, and fail at resume, than to block the suspend.
The suspend is often triggered by closing the lid, so users would not
see what's going on and might put their running laptop in a backpack,
where it could overheat.
3. I don't recall bugs due to failures of TPM suspend, and I didn't
find any such bug in our database.  Many (most?) ChromeOS devices left
the TPM powered on in S3, so didn't use the suspend/resume path.

Thank you for asking!


On Fri, Jan 6, 2023 at 11:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > In lieu of actually fixing the underlying bug, just allow system suspend
> > to continue, so that laptops still go to sleep fine. Later, this can be
> > reverted when the real bug is fixed.
>
> So the patch looks fine to me, but since there's still the ChromeOS
> discussion pending I'll wait for that to finish.
>
> Perhaps re-send or at least remind me if/when it does?
>
> Also, a query about the printout:
>
> > +       if (rc)
> > +               pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > +                      chip->dev_num, rc);
>
> so I suspect that 99% of the time the dev_num isn't actually all that
> useful, but what *might* be useful is which tpm driver it is.
>
> Just comparing the error dmesg output you had:
>
>   ..
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   ..
>
> that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.
>
> So I think "dev_err(dev, ...)" would be more useful here.
>
> Finally - and maybe I'm just being difficult here, I will note here
> again that TPM2 devices don't have this issue, because the TPM2 path
> for suspend doesn't do any of this at all.
>
> It just does
>
>         tpm_transmit_cmd(..);
>
> with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
> the return value. In fact, the tpm2 code *used* to have this comment:
>
>         /* In places where shutdown command is sent there's no much we can do
>          * except print the error code on a system failure.
>          */
>         if (rc < 0 && rc != -EPIPE)
>                 dev_warn(&chip->dev, "transmit returned %d while
> stopping the TPM",
>                          rc);
>
> but it was summarily removed when doing some re-organization around
> buffer handling.
>
> So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
> should do this dance at all.
>
> But having a dev_err() is probably a good idea at least as a transitional thing.
>
>                   Linus
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Linus Torvalds 2 years, 8 months ago
On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <semenzato@chromium.org> wrote:
>
> I think it's fine to go ahead with your change, for multiple reasons.

Ok, I've applied the patch (although I did end up editing it to use
dev_err() before doing that just to make myself happier about the
printout).

            Linus
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jason A. Donenfeld 2 years, 8 months ago
On Fri, Jan 06, 2023 at 02:28:00PM -0800, Linus Torvalds wrote:
> On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <semenzato@chromium.org> wrote:
> >
> > I think it's fine to go ahead with your change, for multiple reasons.
> 
> Ok, I've applied the patch (although I did end up editing it to use
> dev_err() before doing that just to make myself happier about the
> printout).

Thanks for fixing it up to use dev_err(). Final commit looks good to me.

Jason
Re: [PATCH v2] tpm: Allow system suspend to continue when TPM suspend fails
Posted by Jason A. Donenfeld 2 years, 8 months ago
Hi Todd & ChromeOS folks,

On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.

When idling scrolling on my telephone to try to see what the
implications of skipping TPM_ORD_SAVESTATE could be, I stumbled across
some ChromeOS commits related to it, and realized that, ah-hah, finally
there's an obvious group of stakeholders who make heavy use of the TPM
and have likely amassed some expertise on it.

So I was wondering if you'd take a look at this patch briefly to make
sure it won't break ChromeOS laptops.

Jason
Re: [PATCH] tpm: Disable hwrng for TPM 1 if PM_SLEEP is enabled
Posted by Jason A. Donenfeld 2 years, 8 months ago
On Thu, Jan 05, 2023 at 03:47:42PM +0100, Jason A. Donenfeld wrote:
> TPM 1's support for its hardware RNG is broken across system suspends,
> due to races or locking issues or something else that haven't been
> diagnosed or fixed yet. These issues prevent the system from actually
> suspending. So disable the driver in this case. Later, when this is
> fixed properly, we can remove this.
> 
> Current breakage amounts to something like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> The hwrng driver appears already to be occasionally disabled due to
> other conditions, so this shouldn't be too large of a surprise.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Quoting from my previous email:

| I spent a long time working through the TPM code when this came up
| during 6.1. I set up the TPM emulator with QEMU and reproduced this and
| had a whole test setup and S3 fuzzer. It took a long time, and when I was
| done, I paged it all out of my brain. When I found that patch from Jan
| that fixed the problem most of the time (but not all the time), I wasted
| tons of time trying to get the TPM maintainers to take the patch and
| send it to Linus as part of rc7 or rc8. But they all ignored me, and
| eventually Linus just took that patch directly.
| 
| So I don't think I want to go down another rabbit hole here, having
| experienced the TPM maintainers not really caring much, and that sucking
| away the remaining energy I had before to keep looking at the issue and
| its edge cases not handled by Jan's patch.
| 
| On the contrary, it'd make a big difference if the TPM maintainers could
| actually help analyze the code that they're most familiar with, so that
| we can get to the bottom of this. That's a lot better than some random
| drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
| never even looked at these drivers.
| 
| My plan is to therefore be available to help and analyze and test and
| maybe even write some code, if the TPM maintainers take the lead on
| getting to the bottom of this. But if this hits neglect again like last
| time, I'll just send a `depends on BROKEN if PM` patch to the TPM
| hw_random driver and see what happens... That's a really awful solution
| though, so I hope the maintainers will wake up this cycle.

Seeing as there's still no life from the TPM maintainers, here's the
patch to make the problem go away until they wake up. When they do wake
up, though, I will be available to start looking into this again in
whatever capacity I might be useful.

Jason