arch/x86/kernel/setup.c | 17 +++++++++++++++++ include/linux/panic.h | 2 ++ kernel/panic.c | 22 ++++++++++++++++++++++ 3 files changed, 41 insertions(+)
From: Carlos Bilbao <carlos.bilbao@kernel.org> Provide a priority-based mechanism to set the behavior of the kernel at the post-panic stage -- the current default is a waste of CPU except for cases with console that generate insightful output. In v1 cover letter [1], I illustrated the potential to reduce unnecessary CPU resources with an experiment with VMs, reducing more than 70% of CPU usage. The main delta of v2 [2] was that, instead of a weak function that archs can overwrite, we provided a flexible priority-based mechanism (following suggestions by Sean Christopherson), panic_set_handling(). Compared to v2 [2], the main changes in this third version are that (1) we don't set a default function for panic_halt() and (2) we provide a comment for the x86 implementation describing the check for console. [1] https://lore.kernel.org/all/20250326151204.67898-1-carlos.bilbao@kernel.org/ [2] https://lore.kernel.org/all/20250428215952.1332985-1-carlos.bilbao@kernel.org/ Carlos: panic: Allow for dynamic custom behavior after panic x86/panic: Add x86_panic_handler as default post-panic behavior --- arch/x86/kernel/setup.c | 17 +++++++++++++++++ include/linux/panic.h | 2 ++ kernel/panic.c | 22 ++++++++++++++++++++++ 3 files changed, 41 insertions(+)
(cc more x86 people) On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: > From: Carlos Bilbao <carlos.bilbao@kernel.org> > > Provide a priority-based mechanism to set the behavior of the kernel at > the post-panic stage -- the current default is a waste of CPU except for > cases with console that generate insightful output. > > In v1 cover letter [1], I illustrated the potential to reduce unnecessary > CPU resources with an experiment with VMs, reducing more than 70% of CPU > usage. The main delta of v2 [2] was that, instead of a weak function that > archs can overwrite, we provided a flexible priority-based mechanism > (following suggestions by Sean Christopherson), panic_set_handling(). > An effect of this is that the blinky light will never again occur on any x86, I think? I don't know what might the effects of changing such longstanding behavior. Also, why was the `priority' feature added? It has no effect in this patchset.
Hey Andrew, On 4/29/25 15:39, Andrew Morton wrote: > (cc more x86 people) > > On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: > >> From: Carlos Bilbao <carlos.bilbao@kernel.org> >> >> Provide a priority-based mechanism to set the behavior of the kernel at >> the post-panic stage -- the current default is a waste of CPU except for >> cases with console that generate insightful output. >> >> In v1 cover letter [1], I illustrated the potential to reduce unnecessary >> CPU resources with an experiment with VMs, reducing more than 70% of CPU >> usage. The main delta of v2 [2] was that, instead of a weak function that >> archs can overwrite, we provided a flexible priority-based mechanism >> (following suggestions by Sean Christopherson), panic_set_handling(). >> > > An effect of this is that the blinky light will never again occur on > any x86, I think? I don't know what might the effects of changing such > longstanding behavior. Yep, someone pointed this out before. I don't think it's super relevant? Also, in the second patch, I added a check to see that there's no console output left to be flushed. > > Also, why was the `priority' feature added? It has no effect in this > patchset. > This was done to allow for flexibility, for example, if panic devices wish to override the default panic behavior. Other benefits of such flexibility (as opposed to, for example, a weak function that archs can override) were outlined by Sean here: https://lore.kernel.org/lkml/20250326151204.67898-1-carlos.bilbao@kernel.org/T/#m93704ff5cb32ade8b8187764aab56403bbd2b331 Thanks, Carlos
On Tue, 29 Apr 2025 15:17:33 -0500 Carlos Bilbao <bilbao@vt.edu> wrote: > Hey Andrew, > > On 4/29/25 15:39, Andrew Morton wrote: > > (cc more x86 people) > > > > On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: > > > >> From: Carlos Bilbao <carlos.bilbao@kernel.org> > >> > >> Provide a priority-based mechanism to set the behavior of the kernel at > >> the post-panic stage -- the current default is a waste of CPU except for > >> cases with console that generate insightful output. > >> > >> In v1 cover letter [1], I illustrated the potential to reduce unnecessary > >> CPU resources with an experiment with VMs, reducing more than 70% of CPU > >> usage. The main delta of v2 [2] was that, instead of a weak function that > >> archs can overwrite, we provided a flexible priority-based mechanism > >> (following suggestions by Sean Christopherson), panic_set_handling(). > >> > > > > An effect of this is that the blinky light will never again occur on > > any x86, I think? I don't know what might the effects of changing such > > longstanding behavior. > > Yep, someone pointed this out before. I don't think it's super relevant? Why not? It's an alteration in very longstanding behavior - nobody knows who will be affected by this and how they will be affected. > Also, in the second patch, I added a check to see that there's no console > output left to be flushed. It's unclear how this affects such considerations. Please fully changelog all these things. > > > > > Also, why was the `priority' feature added? It has no effect in this > > patchset. > > > > This was done to allow for flexibility, for example, if panic devices > wish to override the default panic behavior. There are no such callers. We can add this feature later, if a need is demonstrated. > Other benefits of such > flexibility (as opposed to, for example, a weak function that archs can > override) were outlined by Sean here: > > https://lore.kernel.org/lkml/20250326151204.67898-1-carlos.bilbao@kernel.org/T/#m93704ff5cb32ade8b8187764aab56403bbd2b331 Again, please fully describe these matters in changelogging and code comments.
Hey Andrew, On 4/29/25 17:53, Andrew Morton wrote: > On Tue, 29 Apr 2025 15:17:33 -0500 Carlos Bilbao <bilbao@vt.edu> wrote: > >> Hey Andrew, >> >> On 4/29/25 15:39, Andrew Morton wrote: >>> (cc more x86 people) >>> >>> On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: >>> >>>> From: Carlos Bilbao <carlos.bilbao@kernel.org> >>>> >>>> Provide a priority-based mechanism to set the behavior of the kernel at >>>> the post-panic stage -- the current default is a waste of CPU except for >>>> cases with console that generate insightful output. >>>> >>>> In v1 cover letter [1], I illustrated the potential to reduce unnecessary >>>> CPU resources with an experiment with VMs, reducing more than 70% of CPU >>>> usage. The main delta of v2 [2] was that, instead of a weak function that >>>> archs can overwrite, we provided a flexible priority-based mechanism >>>> (following suggestions by Sean Christopherson), panic_set_handling(). >>>> >>> >>> An effect of this is that the blinky light will never again occur on >>> any x86, I think? I don't know what might the effects of changing such >>> longstanding behavior. >> >> Yep, someone pointed this out before. I don't think it's super relevant? > > Why not? It's an alteration in very longstanding behavior - nobody > knows who will be affected by this and how they will be affected. It’s difficult for me to imagine how someone might be negatively impacted, but I understand that it could happen. > >> Also, in the second patch, I added a check to see that there's no console >> output left to be flushed. > > It's unclear how this affects such considerations. Please fully > changelog all these things. > >> >>> >>> Also, why was the `priority' feature added? It has no effect in this >>> patchset. >>> >> >> This was done to allow for flexibility, for example, if panic devices >> wish to override the default panic behavior. > > There are no such callers. We can add this feature later, if a need is > demonstrated. I think you'd then prefer what I originally proposed: https://lore.kernel.org/lkml/20250326151204.67898-1-carlos.bilbao@kernel.org/T/ IMHO it's true that this feature might not be necessary ATM, but as Sean pointed out, it could be useful in the future. I don't have strong preferences either way. Would you be happier with the current v3 approach if we add comments to the code explaining the purpose of the priority feature? > >> Other benefits of such >> flexibility (as opposed to, for example, a weak function that archs can >> override) were outlined by Sean here: >> >> https://lore.kernel.org/lkml/20250326151204.67898-1-carlos.bilbao@kernel.org/T/#m93704ff5cb32ade8b8187764aab56403bbd2b331 > > Again, please fully describe these matters in changelogging and code > comments. Thanks, Carlos
On Tue, Apr 29, 2025 at 01:39:41PM -0700, Andrew Morton wrote: > (cc more x86 people) > > On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: > > > From: Carlos Bilbao <carlos.bilbao@kernel.org> > > > > Provide a priority-based mechanism to set the behavior of the kernel at > > the post-panic stage -- the current default is a waste of CPU except for > > cases with console that generate insightful output. > > > > In v1 cover letter [1], I illustrated the potential to reduce unnecessary > > CPU resources with an experiment with VMs, reducing more than 70% of CPU > > usage. The main delta of v2 [2] was that, instead of a weak function that > > archs can overwrite, we provided a flexible priority-based mechanism > > (following suggestions by Sean Christopherson), panic_set_handling(). > > > > An effect of this is that the blinky light will never again occur on > any x86, I think? I don't know what might the effects of changing such > longstanding behavior. > > Also, why was the `priority' feature added? It has no effect in this > patchset. It does what now, and why? Not being copied on anything, the first reaction is, its panic, your machine is dead, who cares about power etc..
Hey Peter, On 4/29/25 16:06, Peter Zijlstra wrote: > On Tue, Apr 29, 2025 at 01:39:41PM -0700, Andrew Morton wrote: >> (cc more x86 people) >> >> On Tue, 29 Apr 2025 10:06:36 -0500 carlos.bilbao@kernel.org wrote: >> >>> From: Carlos Bilbao <carlos.bilbao@kernel.org> >>> >>> Provide a priority-based mechanism to set the behavior of the kernel at >>> the post-panic stage -- the current default is a waste of CPU except for >>> cases with console that generate insightful output. >>> >>> In v1 cover letter [1], I illustrated the potential to reduce unnecessary >>> CPU resources with an experiment with VMs, reducing more than 70% of CPU >>> usage. The main delta of v2 [2] was that, instead of a weak function that >>> archs can overwrite, we provided a flexible priority-based mechanism >>> (following suggestions by Sean Christopherson), panic_set_handling(). >>> >> >> An effect of this is that the blinky light will never again occur on >> any x86, I think? I don't know what might the effects of changing such >> longstanding behavior. >> >> Also, why was the `priority' feature added? It has no effect in this >> patchset. > > It does what now, and why? > > Not being copied on anything, the first reaction is, its panic, your > machine is dead, who cares about power etc.. Thanks for taking the time to look into my patch set! Yes, the machine is effectively dead, but as things stand today, it's still drawing resources unnecessarily. Who cares? An example, as mentioned in the cover letter, is Linux running in VMs. Imagine a scenario where customers are billed based on CPU usage -- having panicked VMs spinning in useless loops wastes their money. In shared envs, those wasted cycles could be used by other processes/VMs. But this is as much about the cloud as it is for laptops/embedded/anywhere -- Linux should avoid wasting resources wherever possible. Thanks, Carlos
On Tue, Apr 29, 2025 at 03:32:56PM -0500, Carlos Bilbao wrote: > Yes, the machine is effectively dead, but as things stand today, > it's still drawing resources unnecessarily. > > Who cares? An example, as mentioned in the cover letter, is Linux running Ah, see, I didn't have no cover letter, only akpm's reply. > in VMs. Imagine a scenario where customers are billed based on CPU usage -- > having panicked VMs spinning in useless loops wastes their money. In shared > envs, those wasted cycles could be used by other processes/VMs. But this > is as much about the cloud as it is for laptops/embedded/anywhere -- Linux > should avoid wasting resources wherever possible. So I don't really buy the laptop and embedded case, people tend to look at laptops when open, and get very impatient when they don't respond. Embedded things really should have a watchdog. Also, should you not be using panic_timeout to auto reboot your machine in all these cases? In any case, the VM nonsense, do they not have a virtual watchdog to 'reap' crashed VMs or something?
Hello, On 4/29/25 17:10, Peter Zijlstra wrote: > On Tue, Apr 29, 2025 at 03:32:56PM -0500, Carlos Bilbao wrote: > >> Yes, the machine is effectively dead, but as things stand today, >> it's still drawing resources unnecessarily. >> >> Who cares? An example, as mentioned in the cover letter, is Linux running > > Ah, see, I didn't have no cover letter, only akpm's reply. > >> in VMs. Imagine a scenario where customers are billed based on CPU usage -- >> having panicked VMs spinning in useless loops wastes their money. In shared >> envs, those wasted cycles could be used by other processes/VMs. But this >> is as much about the cloud as it is for laptops/embedded/anywhere -- Linux >> should avoid wasting resources wherever possible. > > So I don't really buy the laptop and embedded case, people tend to look > at laptops when open, and get very impatient when they don't respond. > Embedded things really should have a watchdog. > > Also, should you not be using panic_timeout to auto reboot your machine > in all these cases? > > In any case, the VM nonsense, do they not have a virtual watchdog to > 'reap' crashed VMs or something? The key word here is "should." Should embedded systems have a watchdog? Maybe. Should I've auto reboot set? Maybe. Perhaps I don’t want to reboot until I’ve root-caused the crash. But my patch set isn’t about “shoulds.” What I’m discussing here is (1) the default Linux behavior, and (2) providing people with the flexibility to do what THEY think they should do, not what you think they should do. Thanks, Carlos
On Tue, Apr 29, 2025 at 03:52:05PM -0500, Carlos Bilbao wrote:
> Hello,
>
> On 4/29/25 17:10, Peter Zijlstra wrote:
> > On Tue, Apr 29, 2025 at 03:32:56PM -0500, Carlos Bilbao wrote:
> >
> >> Yes, the machine is effectively dead, but as things stand today,
> >> it's still drawing resources unnecessarily.
> >>
> >> Who cares? An example, as mentioned in the cover letter, is Linux running
> >
> > Ah, see, I didn't have no cover letter, only akpm's reply.
> >
> >> in VMs. Imagine a scenario where customers are billed based on CPU usage --
> >> having panicked VMs spinning in useless loops wastes their money. In shared
> >> envs, those wasted cycles could be used by other processes/VMs. But this
> >> is as much about the cloud as it is for laptops/embedded/anywhere -- Linux
> >> should avoid wasting resources wherever possible.
> >
> > So I don't really buy the laptop and embedded case, people tend to look
> > at laptops when open, and get very impatient when they don't respond.
> > Embedded things really should have a watchdog.
> >
> > Also, should you not be using panic_timeout to auto reboot your machine
> > in all these cases?
> >
> > In any case, the VM nonsense, do they not have a virtual watchdog to
> > 'reap' crashed VMs or something?
>
> The key word here is "should." Should embedded systems have a watchdog?
> Maybe. Should I've auto reboot set? Maybe. Perhaps I don’t want to reboot
> until I’ve root-caused the crash.
Install a kdump kernel, or log your serial line :-)
> But my patch set isn’t about “shoulds.”
> What I’m discussing here is (1) the default Linux behavior,
Well, the default behaviour works for the 'your own physical machine'
thing just fine -- and that has always been the default use-case.
Nobody is going to be sitting there staring at a panic screen for ages.
All the other weirdo cases like embedded and VMs, they're just that,
weirdos and they can keep their pieces :-)
> and (2)
> providing people with the flexibility to do what THEY think they should do,
> not what you think they should do.
Well, there are a ton of options already. Like said, we have watchdogs,
reboots, crash kernels and all sorts. Why do we need more?
All that said... the default more or less does for(;;) { mdelay(100) },
if you have a modern chip that should not end up using much power at
all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
(depending on you being on Intel or AMD). And spend most its time in
deep idle states.
Is something not working?
Hello,
On 4/30/25 03:48, Peter Zijlstra wrote:
> On Tue, Apr 29, 2025 at 03:52:05PM -0500, Carlos Bilbao wrote:
>> Hello,
>>
>> On 4/29/25 17:10, Peter Zijlstra wrote:
>>> On Tue, Apr 29, 2025 at 03:32:56PM -0500, Carlos Bilbao wrote:
>>>
>>>> Yes, the machine is effectively dead, but as things stand today,
>>>> it's still drawing resources unnecessarily.
>>>>
>>>> Who cares? An example, as mentioned in the cover letter, is Linux running
>>>
>>> Ah, see, I didn't have no cover letter, only akpm's reply.
>>>
>>>> in VMs. Imagine a scenario where customers are billed based on CPU usage --
>>>> having panicked VMs spinning in useless loops wastes their money. In shared
>>>> envs, those wasted cycles could be used by other processes/VMs. But this
>>>> is as much about the cloud as it is for laptops/embedded/anywhere -- Linux
>>>> should avoid wasting resources wherever possible.
>>>
>>> So I don't really buy the laptop and embedded case, people tend to look
>>> at laptops when open, and get very impatient when they don't respond.
>>> Embedded things really should have a watchdog.
>>>
>>> Also, should you not be using panic_timeout to auto reboot your machine
>>> in all these cases?
>>>
>>> In any case, the VM nonsense, do they not have a virtual watchdog to
>>> 'reap' crashed VMs or something?
>>
>> The key word here is "should." Should embedded systems have a watchdog?
>> Maybe. Should I've auto reboot set? Maybe. Perhaps I don’t want to reboot
>> until I’ve root-caused the crash.
>
> Install a kdump kernel, or log your serial line :-)
>
>> But my patch set isn’t about “shoulds.”
>> What I’m discussing here is (1) the default Linux behavior,
>
> Well, the default behaviour works for the 'your own physical machine'
> thing just fine -- and that has always been the default use-case.
>
> Nobody is going to be sitting there staring at a panic screen for ages.
>
> All the other weirdo cases like embedded and VMs, they're just that,
> weirdos and they can keep their pieces :-)
>
>> and (2)
>> providing people with the flexibility to do what THEY think they should do,
>> not what you think they should do.
>
> Well, there are a ton of options already. Like said, we have watchdogs,
> reboots, crash kernels and all sorts. Why do we need more?
>
> All that said... the default more or less does for(;;) { mdelay(100) },
> if you have a modern chip that should not end up using much power at
> all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
> (depending on you being on Intel or AMD). And spend most its time in
> deep idle states.
>
> Is something not working?
Well, in my experiments, that’s not what happened -- halting the CPU in VMs
reduced CPU usage by around 70%.
How would folks feel about adding something like
/proc/sys/kernel/halt_after_panic, disabled by default? It would help in
the Linux use cases I care about (e.g., virtualized environments), without
affecting others.
Thanks,
Carlos
On Wed, Apr 30, 2025 at 01:54:11PM -0500, Carlos Bilbao wrote:
> > All that said... the default more or less does for(;;) { mdelay(100) },
> > if you have a modern chip that should not end up using much power at
> > all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
> > (depending on you being on Intel or AMD). And spend most its time in
> > deep idle states.
> >
> > Is something not working?
>
> Well, in my experiments, that’s not what happened -- halting the CPU in VMs
> reduced CPU usage by around 70%.
Because you're doing VMs, and VMs create problems where there weren't
any before. IOW you get to keep the pieces.
Specifically, VMs do VMEXIT on HLT and this is what's working for you.
On real hardware though, HLT gets you C1, while both TPAUSE and MWAITX
can probably get you deeper C states. As such, HLT is probably a
regression on power.
> How would folks feel about adding something like
> /proc/sys/kernel/halt_after_panic, disabled by default? It would help in
> the Linux use cases I care about (e.g., virtualized environments), without
> affecting others.
What's wrong with any of the existing options? Fact remains you need to
configure your VMs properly.
Hello Peter,
On 5/1/25 03:55, Peter Zijlstra wrote:
> On Wed, Apr 30, 2025 at 01:54:11PM -0500, Carlos Bilbao wrote:
>
>>> All that said... the default more or less does for(;;) { mdelay(100) },
>>> if you have a modern chip that should not end up using much power at
>>> all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
>>> (depending on you being on Intel or AMD). And spend most its time in
>>> deep idle states.
>>>
>>> Is something not working?
>> Well, in my experiments, that’s not what happened -- halting the CPU in VMs
>> reduced CPU usage by around 70%.
> Because you're doing VMs, and VMs create problems where there weren't
> any before. IOW you get to keep the pieces.
>
> Specifically, VMs do VMEXIT on HLT and this is what's working for you.
>
> On real hardware though, HLT gets you C1, while both TPAUSE and MWAITX
> can probably get you deeper C states. As such, HLT is probably a
> regression on power.
That's a good point -- wouldn't TPAUSE achieve what I was trying to
accomplish with HLT? Assuming there's support and wouldn't just #UD.
>
>> How would folks feel about adding something like
>> /proc/sys/kernel/halt_after_panic, disabled by default? It would help in
>> the Linux use cases I care about (e.g., virtualized environments), without
>> affecting others.
> What's wrong with any of the existing options? Fact remains you need to
> configure your VMs properly.
See, that's the problem -- it's not _my_VMs. It's the VMs of cloud users,
who are ultimately responsible for configuring their kernels however they
want. We can try to educate them, as some maintainers have suggested me,
but many people either don't know what the kernel is or don't care -- they
just trust that Linux will have sensible defaults. I get your point that
VM-specific problems shouldn't burden the broader kernel ecosystem, but I’d
still like to think whether there's something we can do to improve the
situation for VMs post-panic without negatively impacting other use cases.
Thanks,
Carlos
On Wed, Apr 30, 2025, Peter Zijlstra wrote:
> All that said... the default more or less does for(;;) { mdelay(100) },
> if you have a modern chip that should not end up using much power at
> all. That should end up in delay_halt_tpause() or delay_halt_mwaitx()
> (depending on you being on Intel or AMD). And spend most its time in
> deep idle states.
>
> Is something not working?
The motivation is to coerce vCPUs into yielding the physical CPU so that a
different vCPU can be scheduled in when the host is oversubscribed. IMO, that's
firmly a "host" problem to solve, where the solution might involve educating
customers for their own benefit[*].
I am indifferent as to whether or not the kernels halts during panic(), my
suggestions/feedback in earlier versions were purely to not make any behavior
specific to VMs. I.e. I am strongly opposed to implementing behavior that kicks
in only when running as a guest.
[*] from https://lore.kernel.org/all/Z_lDzyXJ8JKqOyzs@google.com:
: On Fri, Apr 11, 2025 at 9:31 AM Sean Christopherson <seanjc@google.com> wrote:
: > > On Wed 2025-03-26 10:12:03, carlos.bilbao@kernel.org wrote:
: > > > After handling a panic, the kernel enters a busy-wait loop, unnecessarily
: > > > consuming CPU and potentially impacting other workloads including other
: > > > guest VMs in the case of virtualized setups.
: >
: > Impacting other guests isn't the guest kernel's problem. If the host has heavily
: > overcommited CPUs and can't meet SLOs because VMs are panicking and not rebooting,
: > that's a host problem.
: >
: > This could become a customer problem if they're getting billed based on CPU usage,
: > but I don't know that simply doing HLT is the best solution. E.g. advising the
: > customer to configure their kernels to kexec into a kdump kernel or to reboot
: > on panic, seems like it would provide a better overall experience for most.
: >
: > QEMU (assuming y'all use QEMU) also supports a pvpanic device, so unless the VM
: > and/or customer is using a funky setup, the host should already know the guest
: > has panicked. At that point, the host can make appropiate scheduling decisions,
: > e.g. userspace can simply stop running the VM after a certain timeout, throttle
: > it, jail it, etc.
© 2016 - 2026 Red Hat, Inc.