drivers/acpi/apei/ghes.c | 1 + 1 file changed, 1 insertion(+)
There is a problem report that when debugging a hard-to-reproduce panic
issue, user wanted the kernel to not reboot by adding "panic=0" in
kernel cmdline, so that the panic context could be kept, say the panic
was caught randomly in the mid-night, and user hoped to check it in
the morning. GHES panic handler may overwrite that user setting and
force to reboot after 'ghes_panic_timeout'(30) seconds.
Make 'ghes_panic_timeout' a parameter can provide user some flexibility
to change the timeout on demand, without changing current behavior.
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
---
drivers/acpi/apei/ghes.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..a8a6310e476a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -174,6 +174,7 @@ static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_
static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
+module_param(ghes_panic_timeout, int, 0644);
static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
{
--
2.39.5 (Apple Git-154)
On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote: >There is a problem report that when debugging a hard-to-reproduce panic >issue, user wanted the kernel to not reboot by adding "panic=0" in >kernel cmdline, so that the panic context could be kept, say the panic >was caught randomly in the mid-night, and user hoped to check it in >the morning. GHES panic handler may overwrite that user setting and >force to reboot after 'ghes_panic_timeout'(30) seconds. Why doesn't the ghes panic handler honor a panic=0 setting? -- Sent from a small device: formatting sucks and brevity is inevitable.
Hi, Boris, Borislav Petkov <bp@alien8.de> writes: > On December 27, 2024 10:54:22 AM GMT+01:00, Feng Tang <feng.tang@linux.alibaba.com> wrote: >>There is a problem report that when debugging a hard-to-reproduce panic >>issue, user wanted the kernel to not reboot by adding "panic=0" in >>kernel cmdline, so that the panic context could be kept, say the panic >>was caught randomly in the mid-night, and user hoped to check it in >>the morning. GHES panic handler may overwrite that user setting and >>force to reboot after 'ghes_panic_timeout'(30) seconds. > > Why doesn't the ghes panic handler honor a panic=0 setting? It appears that I introduced the ghes_panic_timeout originally. panic() is used for software errors, while ghes is used for hardware errors. They may have different requirements. For example, it may be OK to wait forever for a software error, but it may be better to reboot the system to contain the influence of the hardware error for some hardware errors. So, we introduced another knob for that. --- Best Regards, Huang, Ying
On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote:
> For example, it may be OK to wait forever for a software error, but it may
> be better to reboot the system to contain the influence of the hardware
> error for some hardware errors.
A default panic timeout of 30 seconds for hw errors?! You do realize that 30
seconds for a machine is an eternity and by that time your hardware error has
long propagated and corrupted results, right?
So your timeout is not even trying to do what you want.
So unless I'm missing something, this ghes timeout needs to go - if you want
to "contain the influence" you need to panic *immediately*! And not even that
would help in some cases - hw has its own protections there so the OS
panicking is meh. At least on x86, that is.
> So, we introduced another knob for that.
No, that another knob is piling more of the silly ontop.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes: > On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote: >> For example, it may be OK to wait forever for a software error, but it may >> be better to reboot the system to contain the influence of the hardware >> error for some hardware errors. > > A default panic timeout of 30 seconds for hw errors?! You do realize that 30 > seconds for a machine is an eternity and by that time your hardware error has > long propagated and corrupted results, right? > > So your timeout is not even trying to do what you want. > > So unless I'm missing something, this ghes timeout needs to go - if you want > to "contain the influence" you need to panic *immediately*! And not even that > would help in some cases - hw has its own protections there so the OS > panicking is meh. At least on x86, that is. OK. 30 seconds isn't good enough for hw errors. Another possible benefit of ghes_panic_timeout is, rebooting instead of waiting forever can help us to log/report the hardware errors earlier. For example, the hardware errors could be logged in some simple non-volatile storage (such as EFI variables) during panic or kdump, etc. Then, after reboot, the new kernel could report the hardware errors in some way. >> So, we introduced another knob for that. > > No, that another knob is piling more of the silly ontop. --- Best Regards, Huang, Ying
On Mon, Dec 30, 2024 at 07:04:59PM +0800, Huang, Ying wrote:
> Another possible benefit of ghes_panic_timeout is,
>
> rebooting instead of waiting forever can help us to log/report the
> hardware errors earlier.
So you rip out ghes_panic_timeout and set the panic_timeout in the ghes code,
when you get a hw error which requires reboot.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 12:26:08PM +0100, Borislav Petkov wrote: > On Mon, Dec 30, 2024 at 07:04:59PM +0800, Huang, Ying wrote: > > Another possible benefit of ghes_panic_timeout is, > > > > rebooting instead of waiting forever can help us to log/report the > > hardware errors earlier. > > So you rip out ghes_panic_timeout and set the panic_timeout in the ghes code, > when you get a hw error which requires reboot. Still we may need to honor the user setting, say if user specifically set "panic=XXX" in the cmdline, we should detect that case and skip overwritting the panic_timeout? Thanks, Feng > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 07:40:18PM +0800, Feng Tang wrote:
> Still we may need to honor the user setting, say if user specifically set
> "panic=XXX" in the cmdline, we should detect that case and skip overwritting
> the panic_timeout?
So the user has set panic timeout to something, machine encounters a hw error
which you want to log/report earlier but user's panic setting prevents you
from doing that.
So what do you do?
What has higher prio?
I guess we're something like this:
https://youtu.be/gLFQystE8vU?t=71
:-P
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 01:10:09PM +0100, Borislav Petkov wrote: > On Mon, Dec 30, 2024 at 07:40:18PM +0800, Feng Tang wrote: > > Still we may need to honor the user setting, say if user specifically set > > "panic=XXX" in the cmdline, we should detect that case and skip overwritting > > the panic_timeout? > > So the user has set panic timeout to something, machine encounters a hw error > which you want to log/report earlier but user's panic setting prevents you > from doing that. > > So what do you do? > > What has higher prio? > > I guess we're something like this: > > https://youtu.be/gLFQystE8vU?t=71 Hah! As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so need to set the kcmdline. For the case in my commit log, where user had clear requirement for not-reboot and wait, the manually set 'panic=0' should take priority here? Thanks, Feng > > :-P > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 09:04:11PM +0800, Feng Tang wrote:
> As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so
> need to set the kcmdline. For the case in my commit log, where user had
> clear requirement for not-reboot and wait, the manually set 'panic=0'
> should take priority here?
I think so.
I'm not convinced that lets-log-the-hw-error-faster should override the panic
timeout setting but I'm open to real-life example scenarios...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Dec 30, 2024 at 02:24:03PM +0100, Borislav Petkov wrote:
> On Mon, Dec 30, 2024 at 09:04:11PM +0800, Feng Tang wrote:
> > As per kernel config, most ARCH has 'panic_timeout' as 0 by default, so
> > need to set the kcmdline. For the case in my commit log, where user had
> > clear requirement for not-reboot and wait, the manually set 'panic=0'
> > should take priority here?
>
> I think so.
>
> I'm not convinced that lets-log-the-hw-error-faster should override the panic
> timeout setting but I'm open to real-life example scenarios...
>
I see. Upon the discussion so far, how about the change below?
---
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..113471b76d8d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;
-static int ghes_panic_timeout __read_mostly = 30;
-
static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
{
phys_addr_t paddr;
@@ -987,9 +985,10 @@ static void __ghes_panic(struct ghes *ghes,
ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
- /* reboot to log the error! */
- if (!panic_timeout)
- panic_timeout = ghes_panic_timeout;
+ /* If user hasn't specifically set panic timeout, reboot to log the error! */
+ if (!panic_timeout && !strstr(saved_command_line, "panic="))
+ panic_timeout = 30;
+
panic("Fatal hardware error!");
}
Or we want to stick the orignal patch, which doesn't change the
original flow?
Thanks,
Feng
On Tue, Dec 31, 2024 at 02:44:48PM +0800, Feng Tang wrote:
> + /* If user hasn't specifically set panic timeout, reboot to log the error! */
> + if (!panic_timeout && !strstr(saved_command_line, "panic="))
And you want to scan saved_command_line because?
Hint: look at how other code checks panic_timeout.
> Or we want to stick the orignal patch, which doesn't change the
> original flow?
And pile more broken stuff ontop?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 31, 2024 at 10:23:58AM +0100, Borislav Petkov wrote: > On Tue, Dec 31, 2024 at 02:44:48PM +0800, Feng Tang wrote: > > + /* If user hasn't specifically set panic timeout, reboot to log the error! */ > > + if (!panic_timeout && !strstr(saved_command_line, "panic=")) > > And you want to scan saved_command_line because? > > Hint: look at how other code checks panic_timeout. Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout(). One thing is, most ARCHs' default timeout is 0, while in our case, the user will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set value or the OS default one. Originally I even thought about adding a flag of 'timeout_user_changed'. Any suggestion? > > Or we want to stick the orignal patch, which doesn't change the > > original flow? > > And pile more broken stuff ontop? OK, will skip this. Thanks, Feng
On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
> Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
> One thing is, most ARCHs' default timeout is 0, while in our case, the user
> will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
> value or the OS default one. Originally I even thought about adding a flag
> of 'timeout_user_changed'. Any suggestion?
Ok, enough talking. Let's get concrete:
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Tue, 31 Dec 2024 12:03:55 +0100
Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
The GHES driver overrides the panic= setting by rebooting the system
after a fatal hw error has been reported. The intent being that such an
error would be hopefully written out faster on non-volatile storage for
later inspection.
However, this is not optimal when a hard-to-debug issue requires long
time to reproduce and when that happens, the box will get rebooted after
30 seconds and thus destroy the whole hw context of when the error
happened.
So rip out the default GHES panic timeout and honor the global one.
In the panic disabled (panic=0) case, the error will still be logged to
dmesg for later inspection and if panic after a hw error is really
required, then that can be controlled the usual way - use panic= on the
cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
drivers/acpi/apei/ghes.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..b72772494655 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;
-static int ghes_panic_timeout __read_mostly = 30;
-
static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
{
phys_addr_t paddr;
@@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
struct acpi_hest_generic_status *estatus,
u64 buf_paddr, enum fixed_addresses fixmap_idx)
{
+ const char *msg = GHES_PFX "Fatal hardware error";
+
__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
- /* reboot to log the error! */
if (!panic_timeout)
- panic_timeout = ghes_panic_timeout;
- panic("Fatal hardware error!");
+ pr_emerg("%s but panic disabled\n", msg);
+
+ panic(msg);
}
static int ghes_proc(struct ghes *ghes)
--
2.43.0
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 31, 2024 at 12:13:14PM +0100, Borislav Petkov wrote: [...] > From: "Borislav Petkov (AMD)" <bp@alien8.de> > Date: Tue, 31 Dec 2024 12:03:55 +0100 > Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting > > The GHES driver overrides the panic= setting by rebooting the system > after a fatal hw error has been reported. The intent being that such an > error would be hopefully written out faster on non-volatile storage for > later inspection. > > However, this is not optimal when a hard-to-debug issue requires long > time to reproduce and when that happens, the box will get rebooted after > 30 seconds and thus destroy the whole hw context of when the error > happened. > > So rip out the default GHES panic timeout and honor the global one. > > In the panic disabled (panic=0) case, the error will still be logged to > dmesg for later inspection and if panic after a hw error is really > required, then that can be controlled the usual way - use panic= on the > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT. Also, is it worth including this patch into stable tree? Thanks, Feng > > Reported-by: Feng Tang <feng.tang@linux.alibaba.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > --- > drivers/acpi/apei/ghes.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-)
On Thu, Jan 09, 2025 at 09:47:57PM +0800, Feng Tang wrote:
> Also, is it worth including this patch into stable tree?
Only if this fixes a regression. Which this is not, AFAICT.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes:
> On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
>> Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
>> One thing is, most ARCHs' default timeout is 0, while in our case, the user
>> will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
>> value or the OS default one. Originally I even thought about adding a flag
>> of 'timeout_user_changed'. Any suggestion?
>
> Ok, enough talking. Let's get concrete:
>
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 31 Dec 2024 12:03:55 +0100
> Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
>
> The GHES driver overrides the panic= setting by rebooting the system
> after a fatal hw error has been reported. The intent being that such an
> error would be hopefully written out faster on non-volatile storage for
> later inspection.
IIUC, the hardware error will be written out on non-volatile storage at
the same time with or without ghes_panic_timeout overriding. The
difference is that after rebooting, the error information in
non-volatile storage can be extracted and reported via UI, SNMP, or
MCTP earlier.
> However, this is not optimal when a hard-to-debug issue requires long
> time to reproduce and when that happens, the box will get rebooted after
> 30 seconds and thus destroy the whole hw context of when the error
> happened.
>
> So rip out the default GHES panic timeout and honor the global one.
>
> In the panic disabled (panic=0) case, the error will still be logged to
> dmesg for later inspection and if panic after a hw error is really
> required, then that can be controlled the usual way - use panic= on the
> cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
>
> Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
> drivers/acpi/apei/ghes.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 07789f0b59bc..b72772494655 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
> static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> static atomic_t ghes_estatus_cache_alloced;
>
> -static int ghes_panic_timeout __read_mostly = 30;
> -
> static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
> {
> phys_addr_t paddr;
> @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
> struct acpi_hest_generic_status *estatus,
> u64 buf_paddr, enum fixed_addresses fixmap_idx)
> {
> + const char *msg = GHES_PFX "Fatal hardware error";
> +
> __ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
>
> ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
>
> - /* reboot to log the error! */
> if (!panic_timeout)
> - panic_timeout = ghes_panic_timeout;
> - panic("Fatal hardware error!");
> + pr_emerg("%s but panic disabled\n", msg);
> +
> + panic(msg);
> }
>
> static int ghes_proc(struct ghes *ghes)
---
Best Regards,
Huang, Ying
On Thu, Jan 02, 2025 at 10:46:54AM +0800, Huang, Ying wrote:
> IIUC, the hardware error will be written out on non-volatile storage at
> the same time with or without ghes_panic_timeout overriding. The
> difference is that after rebooting, the error information in
> non-volatile storage can be extracted and reported via UI, SNMP, or
> MCTP earlier.
... which basically makes this forceful rebooting even worse. The user can
reboot whenever she sees fit - not when by some questionable decision done by
kernel devs.
One more reason to whack this thing.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The GHES driver overrides the panic= setting by force-rebooting the
system after a fatal hw error has been reported. The intent being that
such an error would be reported earlier.
However, this is not optimal when a hard-to-debug issue requires long
time to reproduce and when that happens, the box will get rebooted after
30 seconds and thus destroy the whole hw context of when the error
happened.
So rip out the default GHES panic timeout and honor the global one.
In the panic disabled (panic=0) case, the error will still be logged to
dmesg for later inspection and if panic after a hw error is really
required, then that can be controlled the usual way - use panic= on the
cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
---
drivers/acpi/apei/ghes.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 07789f0b59bc..b72772494655 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;
-static int ghes_panic_timeout __read_mostly = 30;
-
static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
{
phys_addr_t paddr;
@@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
struct acpi_hest_generic_status *estatus,
u64 buf_paddr, enum fixed_addresses fixmap_idx)
{
+ const char *msg = GHES_PFX "Fatal hardware error";
+
__ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
- /* reboot to log the error! */
if (!panic_timeout)
- panic_timeout = ghes_panic_timeout;
- panic("Fatal hardware error!");
+ pr_emerg("%s but panic disabled\n", msg);
+
+ panic(msg);
}
static int ghes_proc(struct ghes *ghes)
--
2.43.0
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov wrote: > The GHES driver overrides the panic= setting by force-rebooting the > system after a fatal hw error has been reported. The intent being that > such an error would be reported earlier. > > However, this is not optimal when a hard-to-debug issue requires long > time to reproduce and when that happens, the box will get rebooted after > 30 seconds and thus destroy the whole hw context of when the error > happened. > > So rip out the default GHES panic timeout and honor the global one. > > In the panic disabled (panic=0) case, the error will still be logged to > dmesg for later inspection and if panic after a hw error is really > required, then that can be controlled the usual way - use panic= on the > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT. > > Reported-by: Feng Tang <feng.tang@linux.alibaba.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
On Mon, Jan 13, 2025 at 9:08 PM Ira Weiny <ira.weiny@intel.com> wrote: > > Borislav Petkov wrote: > > The GHES driver overrides the panic= setting by force-rebooting the > > system after a fatal hw error has been reported. The intent being that > > such an error would be reported earlier. > > > > However, this is not optimal when a hard-to-debug issue requires long > > time to reproduce and when that happens, the box will get rebooted after > > 30 seconds and thus destroy the whole hw context of when the error > > happened. > > > > So rip out the default GHES panic timeout and honor the global one. > > > > In the panic disabled (panic=0) case, the error will still be logged to > > dmesg for later inspection and if panic after a hw error is really > > required, then that can be controlled the usual way - use panic= on the > > cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT. > > > > Reported-by: Feng Tang <feng.tang@linux.alibaba.com> > > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > > Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> > > Reviewed-by: Ira Weiny <ira.weiny@intel.com> Applied as 6.14 material, thanks!
On Tue, Dec 31, 2024 at 12:13:14PM +0100, Borislav Petkov wrote:
> On Tue, Dec 31, 2024 at 06:15:59PM +0800, Feng Tang wrote:
> > Thanks for the hint! IIUC, you are mentioning the set_arch_panic_timeout().
> > One thing is, most ARCHs' default timeout is 0, while in our case, the user
> > will also set 'panic=0' :), so we can't easily detect if the 0 is the user-set
> > value or the OS default one. Originally I even thought about adding a flag
> > of 'timeout_user_changed'. Any suggestion?
>
> Ok, enough talking. Let's get concrete:
>
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 31 Dec 2024 12:03:55 +0100
> Subject: [PATCH] APEI: GHES: Have GHES honor the panic= setting
>
> The GHES driver overrides the panic= setting by rebooting the system
> after a fatal hw error has been reported. The intent being that such an
> error would be hopefully written out faster on non-volatile storage for
> later inspection.
>
> However, this is not optimal when a hard-to-debug issue requires long
> time to reproduce and when that happens, the box will get rebooted after
> 30 seconds and thus destroy the whole hw context of when the error
> happened.
>
> So rip out the default GHES panic timeout and honor the global one.
>
> In the panic disabled (panic=0) case, the error will still be logged to
> dmesg for later inspection and if panic after a hw error is really
> required, then that can be controlled the usual way - use panic= on the
> cmdline or set it in the kernel .config's CONFIG_PANIC_TIMEOUT.
>
> Reported-by: Feng Tang <feng.tang@linux.alibaba.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Looks good to me, thanks!
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
> ---
> drivers/acpi/apei/ghes.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 07789f0b59bc..b72772494655 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -173,8 +173,6 @@ static struct gen_pool *ghes_estatus_pool;
> static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> static atomic_t ghes_estatus_cache_alloced;
>
> -static int ghes_panic_timeout __read_mostly = 30;
> -
> static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
> {
> phys_addr_t paddr;
> @@ -983,14 +981,16 @@ static void __ghes_panic(struct ghes *ghes,
> struct acpi_hest_generic_status *estatus,
> u64 buf_paddr, enum fixed_addresses fixmap_idx)
> {
> + const char *msg = GHES_PFX "Fatal hardware error";
> +
> __ghes_print_estatus(KERN_EMERG, ghes->generic, estatus);
>
> ghes_clear_estatus(ghes, estatus, buf_paddr, fixmap_idx);
>
> - /* reboot to log the error! */
> if (!panic_timeout)
> - panic_timeout = ghes_panic_timeout;
> - panic("Fatal hardware error!");
> + pr_emerg("%s but panic disabled\n", msg);
> +
> + panic(msg);
> }
>
> static int ghes_proc(struct ghes *ghes)
> --
> 2.43.0
>
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
© 2016 - 2026 Red Hat, Inc.