REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Posted by Todd Brandt 2 years, 5 months ago

While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing a
crash and reboot situation when S3 suspend is initiated. To reproduce
it, this call is all that's required "sudo sleepgraph -m mem -rtcwake
15".

Ive created a Bugzilla to track this issue here:
https://bugzilla.kernel.org/show_bug.cgi?id=217804

I've bisected the issue to this patch:

commit 554b841d470338a3b1d6335b14ee1cd0c8f5d754
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Wed Aug 2 07:25:33 2023 -0500

    tpm: Disable RNG for all AMD fTPMs
    
    The TPM RNG functionality is not necessary for entropy when the CPU
    already supports the RDRAND instruction. The TPM RNG functionality
    was previously disabled on a subset of AMD fTPM series, but reports
    continue to show problems on some systems causing stutter root
caused
    to TPM RNG functionality.
    
    Expand disabling TPM RNG use for all AMD fTPMs whether they have
versions
    that claim to have fixed or not. To accomplish this, move the
detection
    into part of the TPM CRB registration and add a flag indicating
that
    the TPM should opt-out of registration to hwrng.

By reverting this patch in 6.5.0-rc6 the problem goes away, so it's
pretty clear that this commit is at fault. I've done further debugging
and I've found that if I simply comment out these lines in 6.5.0-rc6
the problem goes away. So the "crb_check_flags" call is the root cause.

diff --git a/drivers/char/tpm/tpm_crb.c b/drivers/char/tpm/tpm_crb.c
index 9eb1a1859012..20ce8102e6bd 100644
--- a/drivers/char/tpm/tpm_crb.c
+++ b/drivers/char/tpm/tpm_crb.c
@@ -826,9 +826,9 @@ static int crb_acpi_add(struct acpi_device *device)
        if (rc)
                goto out;
 
-       rc = crb_check_flags(chip);
-       if (rc)
-               goto out;
+//     rc = crb_check_flags(chip);
+//     if (rc)
+//             goto out;
 
        rc = tpm_chip_register(chip);

Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Posted by Bagas Sanjaya 2 years, 5 months ago

On Thu, Aug 17, 2023 at 02:09:00PM -0700, Todd Brandt wrote:
> While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing a
> crash and reboot situation when S3 suspend is initiated. To reproduce
> it, this call is all that's required "sudo sleepgraph -m mem -rtcwake
> 15".
> 
> Ive created a Bugzilla to track this issue here:
> https://bugzilla.kernel.org/show_bug.cgi?id=217804
> 
> I've bisected the issue to this patch:
> 
> commit 554b841d470338a3b1d6335b14ee1cd0c8f5d754
> Author: Mario Limonciello <mario.limonciello@amd.com>
> Date:   Wed Aug 2 07:25:33 2023 -0500
> 
>     tpm: Disable RNG for all AMD fTPMs
>     

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 554b841d470338
#regzbot title: Disabling RNG for AMD fTPMs breaks S3 on some Intel systems

-- 
An old man doll... just what I always wanted! - Clara

Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Posted by Jarkko Sakkinen 2 years, 5 months ago

On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt wrote:
> While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing a
> crash and reboot situation when S3 suspend is initiated. To reproduce
> it, this call is all that's required "sudo sleepgraph -m mem -rtcwake
> 15".

1. Are there logs available?
2. Is this the test case: https://pypi.org/project/sleepgraph/ (never used it before).

I'll see if I can repeat it with QEMU + swtpm.

> Ive created a Bugzilla to track this issue here:
> https://bugzilla.kernel.org/show_bug.cgi?id=217804

Thank you for reporting this.

BR, Jarkko

Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Posted by Todd Brandt 2 years, 5 months ago

On Fri, 2023-08-18 at 00:47 +0300, Jarkko Sakkinen wrote:
> On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt wrote:
> > While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing
> > a
> > crash and reboot situation when S3 suspend is initiated. To
> > reproduce
> > it, this call is all that's required "sudo sleepgraph -m mem
> > -rtcwake
> > 15".
> 
> 1. Are there logs available?
> 2. Is this the test case: https://pypi.org/project/sleepgraph/ (never
> used it before).

There are no dmesg logs because the S3 crash wipes them out. Sleepgraph
isn't actually necessary to activate it, just an S3 suspend "echo mem >
/sys/power/state".

So far it appears to only have affected test systems, not production
hardware, and none of them have TPM chips, so I'm beginning to wonder
if this patch just inadvertently activated a bug somewhere else in the
kernel that happens to affect test hardware.

I'll continue to debug it, this isn't an emergency as so far I haven't
seen it in production hardware.

> 
> I'll see if I can repeat it with QEMU + swtpm.
> 
> > I’ve created a Bugzilla to track this issue here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=217804
> 
> Thank you for reporting this.
> 
> BR, Jarkko

Re: REGRESSION WITH BISECT: v6.5-rc6 TPM patch breaks S3 on some Intel systems

Posted by Jarkko Sakkinen 2 years, 5 months ago

On Fri Aug 18, 2023 at 1:25 AM EEST, Todd Brandt wrote:
> On Fri, 2023-08-18 at 00:47 +0300, Jarkko Sakkinen wrote:
> > On Fri Aug 18, 2023 at 12:09 AM EEST, Todd Brandt wrote:
> > > While testing S3 on 6.5.0-rc6 we've found that 5 systems are seeing
> > > a
> > > crash and reboot situation when S3 suspend is initiated. To
> > > reproduce
> > > it, this call is all that's required "sudo sleepgraph -m mem
> > > -rtcwake
> > > 15".
> > 
> > 1. Are there logs available?
> > 2. Is this the test case: https://pypi.org/project/sleepgraph/ (never
> > used it before).
>
> There are no dmesg logs because the S3 crash wipes them out. Sleepgraph
> isn't actually necessary to activate it, just an S3 suspend "echo mem >
> /sys/power/state".
>
> So far it appears to only have affected test systems, not production
> hardware, and none of them have TPM chips, so I'm beginning to wonder
> if this patch just inadvertently activated a bug somewhere else in the
> kernel that happens to affect test hardware.
>
> I'll continue to debug it, this isn't an emergency as so far I haven't
> seen it in production hardware.

OK, I'll still see if I could reproduce it just in case.

BR, Jarkko