[PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots

Rong Zhang posted 1 patch 4 months, 3 weeks ago
There is a newer version of this series
Documentation/arch/x86/amd-debugging.rst |  3 +++
arch/x86/kernel/cpu/amd.c                | 25 +++++++++++++++++++++---
2 files changed, 25 insertions(+), 3 deletions(-)
[PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Rong Zhang 4 months, 3 weeks ago
The S5_RESET_STATUS register is parsed on boot and printed to kmsg.
However, this could sometimes be misleading and lead to users wasting a
lot of time on meaningless debugging for two reasons:

* Some bits are never cleared by hardware. It's the software's
responsibility to clear them as per the Processor Programming Reference
(see Link:).

* Some rare hardware-initiated platform resets do not update the
register at all.

In both cases, a previous reboot could leave its trace in the register,
resulting in users seeing unrelated reboot reasons while debugging
random reboots afterward.

Clearing all reason bits solves the issue. Since all reason bits are
write-1-to-clear and we must preserve all other bits, this is done by
writing the read value back to the register.

A debug message with the cleared register value is also printed to help
distinguish between non-reason bits set and present reserved values
defined in the future.

Fixes: ab8131028710 ("x86/CPU/AMD: Print the reason for the last reset")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537#attach_303991
Signed-off-by: Rong Zhang <i@rong.moe>
---
 Documentation/arch/x86/amd-debugging.rst |  3 +++
 arch/x86/kernel/cpu/amd.c                | 25 +++++++++++++++++++++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/Documentation/arch/x86/amd-debugging.rst b/Documentation/arch/x86/amd-debugging.rst
index d92bf59d62c77..4723ad32ddcd3 100644
--- a/Documentation/arch/x86/amd-debugging.rst
+++ b/Documentation/arch/x86/amd-debugging.rst
@@ -366,3 +366,6 @@ There are 6 classes of reasons for the reboot:
 This information is read by the kernel at bootup and printed into
 the syslog. When a random reboot occurs this message can be helpful
 to determine the next component to debug.
+
+To prevent unrelated reboot reasons from being retained among boots,
+the kernel clears all reason bits once reading the register.
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 5398db4dedb4a..c2e3925eb6855 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1344,7 +1344,7 @@ static const char * const s5_reset_reason_txt[] = {
 static __init int print_s5_reset_status_mmio(void)
 {
 	void __iomem *addr;
-	u32 value;
+	u32 value, cleared_value;
 	int i;
 
 	if (!cpu_feature_enabled(X86_FEATURE_ZEN))
@@ -1355,11 +1355,25 @@ static __init int print_s5_reset_status_mmio(void)
 		return 0;
 
 	value = ioread32(addr);
-	iounmap(addr);
 
 	/* Value with "all bits set" is an error response and should be ignored. */
-	if (value == U32_MAX)
+	if (value == U32_MAX) {
+		iounmap(addr);
 		return 0;
+	}
+
+	/*
+	 * Clear all reason bits so they won't be retained if the next reset
+	 * does not update the register. Besides, some bits are never cleared by
+	 * hardware so it's software's responsibility to clear them.
+	 *
+	 * Writing the value back effectively clears all reason bits as they are
+	 * write-1-to-clear.
+	 */
+	iowrite32(value, addr);
+	cleared_value = ioread32(addr);
+
+	iounmap(addr);
 
 	for (i = 0; i < ARRAY_SIZE(s5_reset_reason_txt); i++) {
 		if (!(value & BIT(i)))
@@ -1371,6 +1385,11 @@ static __init int print_s5_reset_status_mmio(void)
 		}
 	}
 
+	if (cleared_value != value) {
+		pr_debug("x86/amd: Cleared system reset reasons [0x%08x => 0x%08x]\n",
+			 value, cleared_value);
+	}
+
 	return 0;
 }
 late_initcall(print_s5_reset_status_mmio);

base-commit: f7a6ef198ded30b63810efdc923b919606ea65c8
-- 
2.51.0
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Borislav Petkov 4 months, 3 weeks ago
On Sat, Sep 13, 2025 at 10:42:45PM +0800, Rong Zhang wrote:
> The S5_RESET_STATUS register is parsed on boot and printed to kmsg.
> However, this could sometimes be misleading and lead to users wasting a
> lot of time on meaningless debugging for two reasons:
> 
> * Some bits are never cleared by hardware. It's the software's
> responsibility to clear them as per the Processor Programming Reference
> (see Link:).
> 
> * Some rare hardware-initiated platform resets do not update the
> register at all.
> 
> In both cases, a previous reboot could leave its trace in the register,
> resulting in users seeing unrelated reboot reasons while debugging
> random reboots afterward.

Just a heads-up: we're figuring out internally what the right thing to do here
would be.

Stay tuned.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Mario Limonciello (AMD) (kernel.org) 4 months, 1 week ago

On 9/16/2025 9:02 AM, Borislav Petkov wrote:
> On Sat, Sep 13, 2025 at 10:42:45PM +0800, Rong Zhang wrote:
>> The S5_RESET_STATUS register is parsed on boot and printed to kmsg.
>> However, this could sometimes be misleading and lead to users wasting a
>> lot of time on meaningless debugging for two reasons:
>>
>> * Some bits are never cleared by hardware. It's the software's
>> responsibility to clear them as per the Processor Programming Reference
>> (see Link:).
>>
>> * Some rare hardware-initiated platform resets do not update the
>> register at all.
>>
>> In both cases, a previous reboot could leave its trace in the register,
>> resulting in users seeing unrelated reboot reasons while debugging
>> random reboots afterward.
> 
> Just a heads-up: we're figuring out internally what the right thing to do here
> would be.
> 
> Stay tuned.
> 
> Thx.
> 

The internal conversation points in the direction of your patch makes sense.

But I don't really see a lot of value in re-reading and printing a debug 
message about what was cleared and what's still there.  Do you see a 
reason to keep that around?
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Rong Zhang 4 months, 1 week ago
Hi Mario,

On Sun, 2025-09-28 at 23:01 -0500, Mario Limonciello (AMD) (kernel.org)
wrote:
> 
> On 9/16/2025 9:02 AM, Borislav Petkov wrote:
> > On Sat, Sep 13, 2025 at 10:42:45PM +0800, Rong Zhang wrote:
> > > The S5_RESET_STATUS register is parsed on boot and printed to kmsg.
> > > However, this could sometimes be misleading and lead to users wasting a
> > > lot of time on meaningless debugging for two reasons:
> > > 
> > > * Some bits are never cleared by hardware. It's the software's
> > > responsibility to clear them as per the Processor Programming Reference
> > > (see Link:).
> > > 
> > > * Some rare hardware-initiated platform resets do not update the
> > > register at all.
> > > 
> > > In both cases, a previous reboot could leave its trace in the register,
> > > resulting in users seeing unrelated reboot reasons while debugging
> > > random reboots afterward.
> > 
> > Just a heads-up: we're figuring out internally what the right thing to do here
> > would be.
> > 
> > Stay tuned.
> > 
> > Thx.
> > 
> 
> The internal conversation points in the direction of your patch makes sense.

Thanks for your effort!

> But I don't really see a lot of value in re-reading and printing a debug 
> message about what was cleared and what's still there.  Do you see a 
> reason to keep that around?

In order that users don't need an up-to-date documentation, or even
don't need a documentation, to distinguish between reason bits and non-
reason ones.

Let's consider two examples.

(a)

   Previous system reset reason [0x08000800]: an uncorrected error...
                                    ^   ^

A user may feel confused: Two bits are set, but only one reason is
reported. Hmm... Is there a hidden failure?

Unless the user has read the PPR, it's hard to realize BIT(11) is
already set in the reset value. The debug message is here to help:

   Cleared system reset reasons [0x08000800 => 0x00000800]
                                    ^   ^         ^   ^

Now the user realizes that BIT(11) has nothing to do with reboot
reasons.

This was literally the confusion I experienced. I had to take some time
looking for an appropriate public PPR and reading the PPR before
realizing this fact.

(b)

Suppose BIT(7) is defined in the future.

   (nothing is printed to kmsg)

The debug message is here to help:

   Cleared system reset reasons [0x00000880 => 0x00000800]
                                        ^^            ^^

Now the user realizes that BIT(7) represents a new reboot reason.

Thanks,
Rong
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Borislav Petkov 4 months ago
On Tue, Sep 30, 2025 at 06:18:30PM +0800, Rong Zhang wrote:
> A user may feel confused: Two bits are set, but only one reason is
> reported. Hmm... Is there a hidden failure?

Why would you assume that 1 set bit == 1 failure reason?

> Unless the user has read the PPR, it's hard to realize BIT(11) is
> already set in the reset value. The debug message is here to help:
> 
>    Cleared system reset reasons [0x08000800 => 0x00000800]
>                                     ^   ^         ^   ^
> 
> Now the user realizes that BIT(11) has nothing to do with reboot
> reasons.
> 
> This was literally the confusion I experienced. I had to take some time
> looking for an appropriate public PPR and reading the PPR before
> realizing this fact.

Please explain in detail what confusion you were experiencing so that we can
address it properly.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Rong Zhang 4 months ago
Hi Borislav,

On Mon, 2025-10-06 at 15:31 +0200, Borislav Petkov wrote:
> On Tue, Sep 30, 2025 at 06:18:30PM +0800, Rong Zhang wrote:
> > A user may feel confused: Two bits are set, but only one reason is
> > reported. Hmm... Is there a hidden failure?
> 
> Why would you assume that 1 set bit == 1 failure reason?

When I first saw the for-loop, I noticed it didn't break once a reason
was found.

And Documentation/arch/x86/amd-debugging.rst says:

   When a random reboot occurs, the high-level reason for the reboot is
   stored in a register that will persist onto the next boot.

These made me assume the only purpose of the register was to store
reboot reasons, which further made me assume 1 set bit == 1 reason.

Yeah, I knew I made a mistake here once I read the PPR. Perhaps we
could improve the wording in the documentation? I.e., mentioning that
non-listed bits have nothing to do with reboot reasons.

> > Unless the user has read the PPR, it's hard to realize BIT(11) is
> > already set in the reset value. The debug message is here to help:
> > 
> >    Cleared system reset reasons [0x08000800 => 0x00000800]
> >                                     ^   ^         ^   ^
> > 
> > Now the user realizes that BIT(11) has nothing to do with reboot
> > reasons.
> > 
> > This was literally the confusion I experienced. I had to take some time
> > looking for an appropriate public PPR and reading the PPR before
> > realizing this fact.
> 
> Please explain in detail what confusion you were experiencing so that we can
> address it properly.

See above. I assumed BIT(11) was an undocumented reboot reason at that
time. I no longer felt confused once I read the PPR, which says "Reset:
0000_0800h."

That said, I am OK to remove the debug message printing the cleared
value. If we decide that it's better to remove it, I will send a [PATCH
v2].

> Thx.

Thanks,
Rong
Re: [PATCH] x86/CPU/AMD: Prevent reset reasons from being retained among boots
Posted by Borislav Petkov 4 months ago
On Tue, Oct 07, 2025 at 01:07:45AM +0800, Rong Zhang wrote:
> Perhaps we could improve the wording in the documentation? I.e., mentioning
> that non-listed bits have nothing to do with reboot reasons.

You don't need to document in that detail. That's in the code itself:
s5_reset_reason_txt is sparse and that already tells you that not every bit in
the corresponding register is a reboot reason. It could just as well be junk
for all we know...

> That said, I am OK to remove the debug message printing the cleared
> value. If we decide that it's better to remove it, I will send a [PATCH
> v2].

The fix should be to simply write back the value read before we return from
that function.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette