[PATCH v2 00/13] Dynamic Kernel Stacks

David Stevens posted 13 patches 1 month, 3 weeks ago
arch/Kconfig                          |  38 ++
arch/x86/Kconfig                      |   1 +
arch/x86/entry/entry_64.S             |  49 ++-
arch/x86/entry/entry_64_fred.S        |  57 +++
arch/x86/include/asm/cpu_entry_area.h |  18 +
arch/x86/include/asm/idtentry.h       |  38 +-
arch/x86/include/asm/page_64_types.h  |  10 +-
arch/x86/include/asm/pgtable_64.h     |  36 ++
arch/x86/include/asm/processor.h      |   6 +
arch/x86/include/asm/traps.h          |   5 +
arch/x86/kernel/cpu/common.c          |  11 +
arch/x86/kernel/dumpstack_64.c        |  10 +-
arch/x86/kernel/fred.c                |  20 +-
arch/x86/kernel/idt.c                 |  57 +--
arch/x86/kernel/nmi.c                 |   9 +
arch/x86/lib/usercopy.c               |   9 +
arch/x86/mm/cpu_entry_area.c          |  17 +
arch/x86/mm/dump_pagetables.c         |  14 +-
arch/x86/mm/fault.c                   | 101 +++++-
include/linux/mmzone.h                |   3 +
include/linux/sched.h                 |  11 +-
include/linux/sched/task_stack.h      |  48 ++-
include/linux/vmalloc.h               |  14 +
init/init_task.c                      |   4 +
kernel/exit.c                         |  22 ++
kernel/fork.c                         | 481 ++++++++++++++++++++++++--
kernel/sched/core.c                   |   1 +
mm/memcontrol.c                       |  10 +
mm/vmalloc.c                          |  27 +-
mm/vmstat.c                           |   3 +
30 files changed, 1049 insertions(+), 81 deletions(-)
[PATCH v2 00/13] Dynamic Kernel Stacks
Posted by David Stevens 1 month, 3 weeks ago
This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.

The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.

The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.

The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.

The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).

This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.

[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1

David Stevens (7):
  fork: Don't assume fully populated stack during reuse
  fork: Move vm_stack to the beginning of the stack
  fork: Move vmap stack freeing to work queue
  fork: Store task pointer in unpopulated stack ptes
  x86/entry/fred: encode frame pointer on entry
  x86: Add support for dynamic kernel stacks via FRED
  x86: Add support for dynamic kernel stacks via IST

Pasha Tatashin (6):
  fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
  fork: separate vmap stack allocation and free calls
  mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
    functions
  fork: Dynamic Kernel Stacks
  task_stack.h: Add stack_not_used() support for dynamic stack
  fork: Dynamic Kernel Stack accounting

 arch/Kconfig                          |  38 ++
 arch/x86/Kconfig                      |   1 +
 arch/x86/entry/entry_64.S             |  49 ++-
 arch/x86/entry/entry_64_fred.S        |  57 +++
 arch/x86/include/asm/cpu_entry_area.h |  18 +
 arch/x86/include/asm/idtentry.h       |  38 +-
 arch/x86/include/asm/page_64_types.h  |  10 +-
 arch/x86/include/asm/pgtable_64.h     |  36 ++
 arch/x86/include/asm/processor.h      |   6 +
 arch/x86/include/asm/traps.h          |   5 +
 arch/x86/kernel/cpu/common.c          |  11 +
 arch/x86/kernel/dumpstack_64.c        |  10 +-
 arch/x86/kernel/fred.c                |  20 +-
 arch/x86/kernel/idt.c                 |  57 +--
 arch/x86/kernel/nmi.c                 |   9 +
 arch/x86/lib/usercopy.c               |   9 +
 arch/x86/mm/cpu_entry_area.c          |  17 +
 arch/x86/mm/dump_pagetables.c         |  14 +-
 arch/x86/mm/fault.c                   | 101 +++++-
 include/linux/mmzone.h                |   3 +
 include/linux/sched.h                 |  11 +-
 include/linux/sched/task_stack.h      |  48 ++-
 include/linux/vmalloc.h               |  14 +
 init/init_task.c                      |   4 +
 kernel/exit.c                         |  22 ++
 kernel/fork.c                         | 481 ++++++++++++++++++++++++--
 kernel/sched/core.c                   |   1 +
 mm/memcontrol.c                       |  10 +
 mm/vmalloc.c                          |  27 +-
 mm/vmstat.c                           |   3 +
 30 files changed, 1049 insertions(+), 81 deletions(-)


base-commit: 028ef9c96e96197026887c0f092424679298aae8
-- 
2.54.0.rc2.544.gc7ae2d5bb8-goog
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Dave Hansen 1 month, 3 weeks ago
On 4/24/26 12:14, David Stevens wrote:
> The question is then: is this approach something that is fundamentally
> untenable in the kernel

Yes. Fundamentally untenable.

Not allowing stack faults has been a wonderful simplification. It's one
of those things that just plain makes the kernel easier to maintain.
Saving low single digits of system memory is not exactly making me eager
to go back to the harder-to-maintain days.

I seriously doubt that this 1% is the lowest hanging fruit for memory
bloat on these systems. ;)
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by H. Peter Anvin 1 month, 3 weeks ago
On 2026-04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
>> The question is then: is this approach something that is fundamentally
>> untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

It is worth noting that this was one of the VERY early design decisions that
has shaped Linux from the beginning:

- No swapping of kernel memory
- Kernel stacks are statically allocated
- Physical RAM is mapped into the kernel at all times
- A "monolithic" kernel using function calls, not message passing
- A kernel interface that closely maps to the low-level application API
  (e.g. each user space thread is a kernel thread.)
- Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
  in user space.

Those design decisions are, by and large, what has made Linux Linux: a
relatively simple, highly performant, and reliable system.

	-hpa
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Pasha Tatashin 1 month, 3 weeks ago
On 04-25 02:19, H. Peter Anvin wrote:
> On 2026-04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:
> >> The question is then: is this approach something that is fundamentally
> >> untenable in the kernel
> > 
> > Yes. Fundamentally untenable.
> > 
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> > 
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)
> 
> It is worth noting that this was one of the VERY early design decisions that
> has shaped Linux from the beginning:
> 
> - No swapping of kernel memory
> - Kernel stacks are statically allocated
> - Physical RAM is mapped into the kernel at all times
> - A "monolithic" kernel using function calls, not message passing
> - A kernel interface that closely maps to the low-level application API
>   (e.g. each user space thread is a kernel thread.)
> - Kernel ABIs and APIs are subject to evolution; stability is only guaranteed
>   in user space.
> 
> Those design decisions are, by and large, what has made Linux Linux: a
> relatively simple, highly performant, and reliable system.

I think there is a bit of survivorship bias in that list. Originally,
there were many other foundational assumptions that have since evolved
as hardware and requirements scaled.

For example, there were assumptions about no dynamic hardware
reconfiguration (no memory/CPU hot-plug), uniform memory access (no
NUMA), and fixed page sizes (no THP or HugeTLB). All of those have
changed, and you, better than most, know of many other such examples.

A more recent example is PREEMPT_RT: the Linux kernel was originally
designed to be non-preemptible.

Even the assumptions in your list, such as "physical RAM is mapped into
the kernel at all times," are evolving: emulated pmem is not mapped, and
guestmemfd plans to allow unmapping memory from the direct map for
security reasons.

Aside from trying our best not to break user space and allowing the
internal kernel API to evolve, the other items are architectural
decisions that can and should adapt to new requirements.

We now have machines with thousands of hardware threads. Running
millions of software threads on such machines is a practical reality,
and at fleet scales, statically allocating kernel stacks for all of them
wastes a massive amount of memory.

The proposed solution won't affect Linux as a whole. It can be
optionally enabled for targeted configurations. Additionally, the max
stack size is still statically set; it simply isn't populated until
actually used.

Pasha
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Dave Hansen 1 month, 3 weeks ago
On 4/25/26 02:19, H. Peter Anvin wrote:
> It is worth noting that this was one of the VERY early design decisions that
> has shaped Linux from the beginning:
> 
> - No swapping of kernel memory
> - Kernel stacks are statically allocated
...

One other bit to add here: In the past, kernel faults on kernel memory
have been allowed, like to populate vmalloc() page table entries into
the parts of the page tables that are not shared across processes. Even
*that* turned out to be too much of a pain even though it didn't involve
allocation, and the kernel has been moving away from that.
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Zach O'Keefe 1 day ago
On Mon, Apr 27, 2026 at 9:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/25/26 02:19, H. Peter Anvin wrote:
> > It is worth noting that this was one of the VERY early design decisions that
> > has shaped Linux from the beginning:
> >
> > - No swapping of kernel memory
> > - Kernel stacks are statically allocated
> ...
>
> One other bit to add here: In the past, kernel faults on kernel memory
> have been allowed, like to populate vmalloc() page table entries into
> the parts of the page tables that are not shared across processes. Even
> *that* turned out to be too much of a pain even though it didn't involve
> allocation, and the kernel has been moving away from that.
>

Dave,

Necroing this thread, as the potential aggregate savings continue to
stand out for us on the datacenter side (whereas David is motivated
separately, from the consumer device side).

I certainly empathize with your position, and hesitation to give up
such a nice simplification just to invite new headaches.

However, I'd still like to work with you to understand what feasible
path forward you see, hoping you can proactively steer us away from
some of the bigger headaches.

I think we are fine being forward-looking, and only supporting this
for FRED (which is on our doorstep). That said, understanding the
issues you foresee with the IST approach would still be valuable, as
it might save us internal trouble should we choose to carry it
temporarily to bridge the gap with FRED.

Overall, are there any particular painpoints you'd like to see flushed
out, first? How would you like to proceed? Would explicitly marking
this as an experimental config, in the interim, be more attractive?

Thanks, and I appreciate any help or guidance here.

Best,
Zach
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Dave Hansen 20 hours ago
On 6/18/26 07:50, Zach O'Keefe wrote:
> Overall, are there any particular painpoints you'd like to see flushed
> out, first? 

Handing exceptions in the kernel is hard. Period. That's the pain point.
Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
we've moved away from ever taking random page faults in the kernel. Or,
heck, randomly taking faults at *all*. We've concentrated them in very
specific places, not in general code.

Now you're arguing that the kernel can pretty much take a fault *AND*
allocate memory reliably at any point*.

I just don't see the collateral in this series to justify that claim.

The NMI entry code is a disaster because NMIs can happen anywhere. The
#VC code is a disaster because #VCs can happen anywhere. Once #PF can
happen anywhere*, why won't #PF become a disaster?

It would be a completely different story if there was a track record of
finding and fixing bugs in the x86 entry code from the authors of this
series. But I don't think I've ever seen a single email from your folks
before this, much less a review tag or a patch. I'd be much happier if
you got Andy L's blessing on this, for example.

> How would you like to proceed? Would explicitly marking this as an
> experimental config, in the interim, be more attractive?
No.

The enemy here is complexity. *Maintenance* complexity. Being able to
compile out some of the complexity helps with debugging. But it doesn't
help maintaining the code.

--

* #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
  you that. But it's still pretty darn bad.
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Thomas Gleixner 2 hours ago
On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.

There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....

> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?

It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.

Correct.

Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.

  "It is possible for #MCE to occur on the #PF IST stack, but the #MCE
   handler shouldn't generate new #PFs. The reentrancy check on the #PF
   stack will trigger if any recoverable #MCEs do generate #PFs - if there
   are actually reports of it happening, we can address it then."

Seriously?

We don't wait until the report comes in because the report won't even
happen in the worst case:

       #PF on IST
         ...
         cmp    0, reentrance
         jne	abort

       #MC
          ...
          #PF rewinds #PF IST
          cmp   0, reentrance
          jne	abort		<- Not taken because #MC happened before
                                   it could be set.

IST is fundamentally not suitable for this and I'm sure there are more
holes in this.

I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.

The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.

I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.

Thanks,

        tglx
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by H. Peter Anvin 16 hours ago
On 2026-06-18 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first? 
> 
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
> 
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
> 
> I just don't see the collateral in this series to justify that claim.
> 

That is most definitely the zeroth-order thing. Extraordinary claims require
extraordinary evidence, and this is certainly an extraordinary claim. In
addition to the *massive* maintainability issue, you also have to consider the
additional overheads you will now have to deal with in order to avoid deadlocks.

Almost every OS that have attempted to swap out kernel stacks have been known
to suffer from deadlocks under very high memory load.


> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?
> [...]
> * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
>   you that. But it's still pretty darn bad.

In some ways, they are actually *worse*.

#PFs need to be able to sleep, because the common case for a #PF in the kernel
is that it touched user space. This means #PF needs to be using IST/SL 0.
However, this is obviously incompatible with handling #PFs on the kernel stack
itself, so now it needs a stack switch. In the common case, it will then need
to demote the #PF back onto the normal execution stack, which is complex in
its own right.

Now, if you are on a pre-FRED system, the IST entries don't nest, so you
absolutely have to make sure you can't get there again through any means
whatsoever. With FRED, it isn't quite so dire, but it will still give you lots
of fun if that interrupt is one which would like to be demoted off the IRQ stack.

> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
> 
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
> 
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.
Indeed. Paravirtualization is a great example of how this works. The PV hooks
in the kernel are still a maintenance nightmare 20 years after they were
introduced, and mostly that cost is not borne by the people who introduced and
benefited from them.

	-hpa
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by David Stevens 14 hours ago
On Thu, Jun 18, 2026 at 3:28 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-06-18 11:53, Dave Hansen wrote:
> > On 6/18/26 07:50, Zach O'Keefe wrote:
> >> Overall, are there any particular painpoints you'd like to see flushed
> >> out, first?
> >
> > Handing exceptions in the kernel is hard. Period. That's the pain point.
> > Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> > we've moved away from ever taking random page faults in the kernel. Or,
> > heck, randomly taking faults at *all*. We've concentrated them in very
> > specific places, not in general code.
> >
> > Now you're arguing that the kernel can pretty much take a fault *AND*
> > allocate memory reliably at any point*.
> >
> > I just don't see the collateral in this series to justify that claim.
> >
>
> That is most definitely the zeroth-order thing. Extraordinary claims require
> extraordinary evidence, and this is certainly an extraordinary claim.

I do acknowledge that there is currently a lack of evidence - this is
an RFC after all. The question is whether it is possible in principle
to produce sufficient evidence. From the Android side of Google, we
are willing to carry the RFC patches downstream for a while to build a
case for merging them upstream. However, there needs to be at least a
possibility of success before we undertake that work. If upstream's
position is that dynamic stacks are no good, full stop, and will
absolutely never happen, then there's no point in us trying to pursue
this avenue further. And I assume those from the datacenter side of
the company are in a similar position.

-David


> In addition to the *massive* maintainability issue, you also have to consider the
> additional overheads you will now have to deal with in order to avoid deadlocks.
>
> Almost every OS that have attempted to swap out kernel stacks have been known
> to suffer from deadlocks under very high memory load.
>
>
> > The NMI entry code is a disaster because NMIs can happen anywhere. The
> > #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> > happen anywhere*, why won't #PF become a disaster?
> > [...]
> > * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give
> >   you that. But it's still pretty darn bad.
>
> In some ways, they are actually *worse*.
>
> #PFs need to be able to sleep, because the common case for a #PF in the kernel
> is that it touched user space. This means #PF needs to be using IST/SL 0.
> However, this is obviously incompatible with handling #PFs on the kernel stack
> itself, so now it needs a stack switch. In the common case, it will then need
> to demote the #PF back onto the normal execution stack, which is complex in
> its own right.
>
> Now, if you are on a pre-FRED system, the IST entries don't nest, so you
> absolutely have to make sure you can't get there again through any means
> whatsoever. With FRED, it isn't quite so dire, but it will still give you lots
> of fun if that interrupt is one which would like to be demoted off the IRQ stack.
>
> > It would be a completely different story if there was a track record of
> > finding and fixing bugs in the x86 entry code from the authors of this
> > series. But I don't think I've ever seen a single email from your folks
> > before this, much less a review tag or a patch. I'd be much happier if
> > you got Andy L's blessing on this, for example.
> >
> >> How would you like to proceed? Would explicitly marking this as an
> >> experimental config, in the interim, be more attractive?
> > No.
> >
> > The enemy here is complexity. *Maintenance* complexity. Being able to
> > compile out some of the complexity helps with debugging. But it doesn't
> > help maintaining the code.
> Indeed. Paravirtualization is a great example of how this works. The PV hooks
> in the kernel are still a maintenance nightmare 20 years after they were
> introduced, and mostly that cost is not borne by the people who introduced and
> benefited from them.
>
>         -hpa
>
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by H. Peter Anvin 14 hours ago
On 2026-06-18 17:40, David Stevens wrote:
>>
>> That is most definitely the zeroth-order thing. Extraordinary claims require
>> extraordinary evidence, and this is certainly an extraordinary claim.
> 
> I do acknowledge that there is currently a lack of evidence - this is
> an RFC after all. The question is whether it is possible in principle
> to produce sufficient evidence. From the Android side of Google, we
> are willing to carry the RFC patches downstream for a while to build a
> case for merging them upstream. However, there needs to be at least a
> possibility of success before we undertake that work. If upstream's
> position is that dynamic stacks are no good, full stop, and will
> absolutely never happen, then there's no point in us trying to pursue
> this avenue further. And I assume those from the datacenter side of
> the company are in a similar position.
> 

The answer is pretty much that you would have to present *very* 
impressive-looking evidence. There is very little that's completely 
absolute, but you definitely have a tall hill to climb on this one.

Keep also in mind that there are also people who claim that our current 
page sizes are much too small, and that the kernel should be doing 16K 
or 64K pages. At that point this more or less evaporates, too.

	-hpa
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Pasha Tatashin 1 month, 3 weeks ago
On 04-24 12:41, Dave Hansen wrote:
> On 4/24/26 12:14, David Stevens wrote:
> > The question is then: is this approach something that is fundamentally
> > untenable in the kernel
> 
> Yes. Fundamentally untenable.
> 
> Not allowing stack faults has been a wonderful simplification. It's one
> of those things that just plain makes the kernel easier to maintain.
> Saving low single digits of system memory is not exactly making me eager
> to go back to the harder-to-maintain days.
> 
> I seriously doubt that this 1% is the lowest hanging fruit for memory
> bloat on these systems. ;)

This true until, in a fleet of millions of machines, you encounter a 
one-in-a-billion chance of a stack overflow. You are then forced to 
double the statically allocated kernel stacks on every machine, paying a 
memory tax even though 99.999..% of threads never exceed 4K. This 
overhead accumulates to petabytes of wasted capacity.

Pasha
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by David Laight 1 month, 3 weeks ago
On Fri, 24 Apr 2026 21:35:20 +0000
Pasha Tatashin <pasha.tatashin@soleen.com> wrote:

> On 04-24 12:41, Dave Hansen wrote:
> > On 4/24/26 12:14, David Stevens wrote:  
> > > The question is then: is this approach something that is fundamentally
> > > untenable in the kernel  
> > 
> > Yes. Fundamentally untenable.
> > 
> > Not allowing stack faults has been a wonderful simplification. It's one
> > of those things that just plain makes the kernel easier to maintain.
> > Saving low single digits of system memory is not exactly making me eager
> > to go back to the harder-to-maintain days.
> > 
> > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > bloat on these systems. ;)  
> 
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

And then you hit a stack fault in some path where you can't sleep and
there isn't any available kernel memory.

An alternative idea is to arrange for some system calls to sleep in
userspace, so when the thread is woken it re-executes the system call.
It then makes sense to assign the kernel stack to the process when
it enters the kernel.
That might mean that you don't need a kernel stack for all the threads
sleeping in futex() - it might even be possible to do the retry in
userspace saving the second kernel entry most of the time.
It is all 'hard and difficult' though.

The easier solution is to rewrite the system code so it doesn't have
1000s of threads :-)

	David



> 
> Pasha
>
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Dave Hansen 14 hours ago
On 4/24/26 15:26, David Laight wrote:
>> This true until, in a fleet of millions of machines, you encounter a 
>> one-in-a-billion chance of a stack overflow. You are then forced to 
>> double the statically allocated kernel stacks on every machine, paying a 
>> memory tax even though 99.999..% of threads never exceed 4K. This 
>> overhead accumulates to petabytes of wasted capacity.
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.
> 
> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.

There are probably other ways to do this without handling exceptions.

For instance, let's say you always *map* 16k of stack for each process.
But, after context switching out, you take a look at 4x8b pte_t's that
were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you
can just clear _PAGE_PRESENT and reclaim the page.

If you don't want the overhead in the normal context switch path, you
reclaim in a shrinker, at the cost of needing locking to coordinate with
the scheduler.

A simple rule would be: a thread that ever accesses a page gets to keep
it forever. They're never reclaimed after being accessed, only before.

For that, the worst case is that you go to schedule a new thread and
can't allocate memory fill in the 4 pte_t's. You can't run it until you
or some other CPU goes and does some reclaim.

Needing memory in the middle of schedule() is generally a no-go. But its
a lot better than not being able to continue _execution_ of a kernel
thread at *ALL*, possibly in a non-preemptible context, like when you do
it in a #PF.

Basically, I think there's a way to do this that limits the kernel blast
radius to _mostly_ being a core mm problem.

What else has been considered before the #PF-based mechanism?
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Pasha Tatashin 1 month, 3 weeks ago
On 04-24 23:26, David Laight wrote:
> On Fri, 24 Apr 2026 21:35:20 +0000
> Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
> 
> > On 04-24 12:41, Dave Hansen wrote:
> > > On 4/24/26 12:14, David Stevens wrote:  
> > > > The question is then: is this approach something that is fundamentally
> > > > untenable in the kernel  
> > > 
> > > Yes. Fundamentally untenable.
> > > 
> > > Not allowing stack faults has been a wonderful simplification. It's one
> > > of those things that just plain makes the kernel easier to maintain.
> > > Saving low single digits of system memory is not exactly making me eager
> > > to go back to the harder-to-maintain days.
> > > 
> > > I seriously doubt that this 1% is the lowest hanging fruit for memory
> > > bloat on these systems. ;)  
> > 
> > This true until, in a fleet of millions of machines, you encounter a 
> > one-in-a-billion chance of a stack overflow. You are then forced to 
> > double the statically allocated kernel stacks on every machine, paying a 
> > memory tax even though 99.999..% of threads never exceed 4K. This 
> > overhead accumulates to petabytes of wasted capacity.
> 
> And then you hit a stack fault in some path where you can't sleep and
> there isn't any available kernel memory.

Well, at least if we hit this rare case, we can simply double a buffer 
of pre-reserved stack memory per CPU. This still saves significant 
memory compared to wasting it on every single thread.

> An alternative idea is to arrange for some system calls to sleep in
> userspace, so when the thread is woken it re-executes the system call.
> It then makes sense to assign the kernel stack to the process when
> it enters the kernel.
> That might mean that you don't need a kernel stack for all the threads
> sleeping in futex() - it might even be possible to do the retry in
> userspace saving the second kernel entry most of the time.
> It is all 'hard and difficult' though.

I was thinking about a similar approach as well—sort of multiplexing the 
kernel stacks. But honestly, when trying to cover all the edge cases, I 
didn't find it to be any better or easier than just using dynamic kernel 
stacks.

An alternative approach, which was proposed at LSFMM by Willy, is to add 
an explicit deep stack calls. When we enter a path that we know is 
exceptionally deep, only then do we extend the stack, keeping the 
default (say, 8K) everywhere else.

> The easier solution is to rewrite the system code so it doesn't have
> 1000s of threads :-)

That ship sailed in the early 90s of the previous millennium.  Nowadays, 
we have high end workstations with almost 200 hardware threads. 
Rewriting system code to reduce thread counts simply isn't an option for 
our storage machines, which have millions of threads per unit.

+CC Matthew Wilcox
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by Dave Hansen 1 month, 3 weeks ago
On 4/24/26 14:35, Pasha Tatashin wrote:
> On 04-24 12:41, Dave Hansen wrote:
>> On 4/24/26 12:14, David Stevens wrote:
>>> The question is then: is this approach something that is fundamentally
>>> untenable in the kernel
>> Yes. Fundamentally untenable.
>>
>> Not allowing stack faults has been a wonderful simplification. It's one
>> of those things that just plain makes the kernel easier to maintain.
>> Saving low single digits of system memory is not exactly making me eager
>> to go back to the harder-to-maintain days.
>>
>> I seriously doubt that this 1% is the lowest hanging fruit for memory
>> bloat on these systems. 😉
> This true until, in a fleet of millions of machines, you encounter a 
> one-in-a-billion chance of a stack overflow. You are then forced to 
> double the statically allocated kernel stacks on every machine, paying a 
> memory tax even though 99.999..% of threads never exceed 4K. This 
> overhead accumulates to petabytes of wasted capacity.

I don't disagree with you. But, at that point, you're picking your
poison: bugs dynamic kernel stacks versus crashes from stack overflows.

At some point, I might be able to be talked into dynamic stack as a
FRED-only feature. But FRED isn't widespread enough to go to the trouble
today. I'm sure the folks who want this also don't want to wait until
all the devices in the field have FRED because that even *longer* off.

So maybe this is one of those things that folks just need to deploy
out-of-tree for a couple of years, come back with some data to show us
that we were just paranoid, and we'll look at it again.
Re: [PATCH v2 00/13] Dynamic Kernel Stacks
Posted by David Stevens 1 month, 3 weeks ago
On Fri, Apr 24, 2026 at 3:21 PM Dave Hansen <dave.hansen@intel.com> wrote:
> On 4/24/26 14:35, Pasha Tatashin wrote:
> > On 04-24 12:41, Dave Hansen wrote:
> >> On 4/24/26 12:14, David Stevens wrote:
> >>> The question is then: is this approach something that is fundamentally
> >>> untenable in the kernel
> >> Yes. Fundamentally untenable.
> >>
> >> Not allowing stack faults has been a wonderful simplification. It's one
> >> of those things that just plain makes the kernel easier to maintain.
> >> Saving low single digits of system memory is not exactly making me eager
> >> to go back to the harder-to-maintain days.
> >>
> >> I seriously doubt that this 1% is the lowest hanging fruit for memory
> >> bloat on these systems. 😉
> > This true until, in a fleet of millions of machines, you encounter a
> > one-in-a-billion chance of a stack overflow. You are then forced to
> > double the statically allocated kernel stacks on every machine, paying a
> > memory tax even though 99.999..% of threads never exceed 4K. This
> > overhead accumulates to petabytes of wasted capacity.
>
> I don't disagree with you. But, at that point, you're picking your
> poison: bugs dynamic kernel stacks versus crashes from stack overflows.
>
> At some point, I might be able to be talked into dynamic stack as a
> FRED-only feature. But FRED isn't widespread enough to go to the trouble
> today. I'm sure the folks who want this also don't want to wait until
> all the devices in the field have FRED because that even *longer* off.

Why does this need to be FRED only? True, the lack of reentrancy with
IST stacks complicates a few situations. That adds some complexity
beyond what's needed for FRED-only support, but the additional
complexity doesn't really seem like a hard blocker, at least if we
accept the complexity of kernel stack faults for FRED.

-David