arch/Kconfig | 38 ++ arch/x86/Kconfig | 1 + arch/x86/entry/entry_64.S | 49 ++- arch/x86/entry/entry_64_fred.S | 57 +++ arch/x86/include/asm/cpu_entry_area.h | 18 + arch/x86/include/asm/idtentry.h | 38 +- arch/x86/include/asm/page_64_types.h | 10 +- arch/x86/include/asm/pgtable_64.h | 36 ++ arch/x86/include/asm/processor.h | 6 + arch/x86/include/asm/traps.h | 5 + arch/x86/kernel/cpu/common.c | 11 + arch/x86/kernel/dumpstack_64.c | 10 +- arch/x86/kernel/fred.c | 20 +- arch/x86/kernel/idt.c | 57 +-- arch/x86/kernel/nmi.c | 9 + arch/x86/lib/usercopy.c | 9 + arch/x86/mm/cpu_entry_area.c | 17 + arch/x86/mm/dump_pagetables.c | 14 +- arch/x86/mm/fault.c | 101 +++++- include/linux/mmzone.h | 3 + include/linux/sched.h | 11 +- include/linux/sched/task_stack.h | 48 ++- include/linux/vmalloc.h | 14 + init/init_task.c | 4 + kernel/exit.c | 22 ++ kernel/fork.c | 481 ++++++++++++++++++++++++-- kernel/sched/core.c | 1 + mm/memcontrol.c | 10 + mm/vmalloc.c | 27 +- mm/vmstat.c | 3 + 30 files changed, 1049 insertions(+), 81 deletions(-)
This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.
The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.
The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.
The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.
The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).
This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.
[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1
David Stevens (7):
fork: Don't assume fully populated stack during reuse
fork: Move vm_stack to the beginning of the stack
fork: Move vmap stack freeing to work queue
fork: Store task pointer in unpopulated stack ptes
x86/entry/fred: encode frame pointer on entry
x86: Add support for dynamic kernel stacks via FRED
x86: Add support for dynamic kernel stacks via IST
Pasha Tatashin (6):
fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
fork: separate vmap stack allocation and free calls
mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
functions
fork: Dynamic Kernel Stacks
task_stack.h: Add stack_not_used() support for dynamic stack
fork: Dynamic Kernel Stack accounting
arch/Kconfig | 38 ++
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_64.S | 49 ++-
arch/x86/entry/entry_64_fred.S | 57 +++
arch/x86/include/asm/cpu_entry_area.h | 18 +
arch/x86/include/asm/idtentry.h | 38 +-
arch/x86/include/asm/page_64_types.h | 10 +-
arch/x86/include/asm/pgtable_64.h | 36 ++
arch/x86/include/asm/processor.h | 6 +
arch/x86/include/asm/traps.h | 5 +
arch/x86/kernel/cpu/common.c | 11 +
arch/x86/kernel/dumpstack_64.c | 10 +-
arch/x86/kernel/fred.c | 20 +-
arch/x86/kernel/idt.c | 57 +--
arch/x86/kernel/nmi.c | 9 +
arch/x86/lib/usercopy.c | 9 +
arch/x86/mm/cpu_entry_area.c | 17 +
arch/x86/mm/dump_pagetables.c | 14 +-
arch/x86/mm/fault.c | 101 +++++-
include/linux/mmzone.h | 3 +
include/linux/sched.h | 11 +-
include/linux/sched/task_stack.h | 48 ++-
include/linux/vmalloc.h | 14 +
init/init_task.c | 4 +
kernel/exit.c | 22 ++
kernel/fork.c | 481 ++++++++++++++++++++++++--
kernel/sched/core.c | 1 +
mm/memcontrol.c | 10 +
mm/vmalloc.c | 27 +-
mm/vmstat.c | 3 +
30 files changed, 1049 insertions(+), 81 deletions(-)
base-commit: 028ef9c96e96197026887c0f092424679298aae8
--
2.54.0.rc2.544.gc7ae2d5bb8-goog
On 4/24/26 12:14, David Stevens wrote: > The question is then: is this approach something that is fundamentally > untenable in the kernel Yes. Fundamentally untenable. Not allowing stack faults has been a wonderful simplification. It's one of those things that just plain makes the kernel easier to maintain. Saving low single digits of system memory is not exactly making me eager to go back to the harder-to-maintain days. I seriously doubt that this 1% is the lowest hanging fruit for memory bloat on these systems. ;)
On 2026-04-24 12:41, Dave Hansen wrote: > On 4/24/26 12:14, David Stevens wrote: >> The question is then: is this approach something that is fundamentally >> untenable in the kernel > > Yes. Fundamentally untenable. > > Not allowing stack faults has been a wonderful simplification. It's one > of those things that just plain makes the kernel easier to maintain. > Saving low single digits of system memory is not exactly making me eager > to go back to the harder-to-maintain days. > > I seriously doubt that this 1% is the lowest hanging fruit for memory > bloat on these systems. ;) It is worth noting that this was one of the VERY early design decisions that has shaped Linux from the beginning: - No swapping of kernel memory - Kernel stacks are statically allocated - Physical RAM is mapped into the kernel at all times - A "monolithic" kernel using function calls, not message passing - A kernel interface that closely maps to the low-level application API (e.g. each user space thread is a kernel thread.) - Kernel ABIs and APIs are subject to evolution; stability is only guaranteed in user space. Those design decisions are, by and large, what has made Linux Linux: a relatively simple, highly performant, and reliable system. -hpa
On 04-25 02:19, H. Peter Anvin wrote: > On 2026-04-24 12:41, Dave Hansen wrote: > > On 4/24/26 12:14, David Stevens wrote: > >> The question is then: is this approach something that is fundamentally > >> untenable in the kernel > > > > Yes. Fundamentally untenable. > > > > Not allowing stack faults has been a wonderful simplification. It's one > > of those things that just plain makes the kernel easier to maintain. > > Saving low single digits of system memory is not exactly making me eager > > to go back to the harder-to-maintain days. > > > > I seriously doubt that this 1% is the lowest hanging fruit for memory > > bloat on these systems. ;) > > It is worth noting that this was one of the VERY early design decisions that > has shaped Linux from the beginning: > > - No swapping of kernel memory > - Kernel stacks are statically allocated > - Physical RAM is mapped into the kernel at all times > - A "monolithic" kernel using function calls, not message passing > - A kernel interface that closely maps to the low-level application API > (e.g. each user space thread is a kernel thread.) > - Kernel ABIs and APIs are subject to evolution; stability is only guaranteed > in user space. > > Those design decisions are, by and large, what has made Linux Linux: a > relatively simple, highly performant, and reliable system. I think there is a bit of survivorship bias in that list. Originally, there were many other foundational assumptions that have since evolved as hardware and requirements scaled. For example, there were assumptions about no dynamic hardware reconfiguration (no memory/CPU hot-plug), uniform memory access (no NUMA), and fixed page sizes (no THP or HugeTLB). All of those have changed, and you, better than most, know of many other such examples. A more recent example is PREEMPT_RT: the Linux kernel was originally designed to be non-preemptible. Even the assumptions in your list, such as "physical RAM is mapped into the kernel at all times," are evolving: emulated pmem is not mapped, and guestmemfd plans to allow unmapping memory from the direct map for security reasons. Aside from trying our best not to break user space and allowing the internal kernel API to evolve, the other items are architectural decisions that can and should adapt to new requirements. We now have machines with thousands of hardware threads. Running millions of software threads on such machines is a practical reality, and at fleet scales, statically allocating kernel stacks for all of them wastes a massive amount of memory. The proposed solution won't affect Linux as a whole. It can be optionally enabled for targeted configurations. Additionally, the max stack size is still statically set; it simply isn't populated until actually used. Pasha
On 4/25/26 02:19, H. Peter Anvin wrote: > It is worth noting that this was one of the VERY early design decisions that > has shaped Linux from the beginning: > > - No swapping of kernel memory > - Kernel stacks are statically allocated ... One other bit to add here: In the past, kernel faults on kernel memory have been allowed, like to populate vmalloc() page table entries into the parts of the page tables that are not shared across processes. Even *that* turned out to be too much of a pain even though it didn't involve allocation, and the kernel has been moving away from that.
On Mon, Apr 27, 2026 at 9:22 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/25/26 02:19, H. Peter Anvin wrote: > > It is worth noting that this was one of the VERY early design decisions that > > has shaped Linux from the beginning: > > > > - No swapping of kernel memory > > - Kernel stacks are statically allocated > ... > > One other bit to add here: In the past, kernel faults on kernel memory > have been allowed, like to populate vmalloc() page table entries into > the parts of the page tables that are not shared across processes. Even > *that* turned out to be too much of a pain even though it didn't involve > allocation, and the kernel has been moving away from that. > Dave, Necroing this thread, as the potential aggregate savings continue to stand out for us on the datacenter side (whereas David is motivated separately, from the consumer device side). I certainly empathize with your position, and hesitation to give up such a nice simplification just to invite new headaches. However, I'd still like to work with you to understand what feasible path forward you see, hoping you can proactively steer us away from some of the bigger headaches. I think we are fine being forward-looking, and only supporting this for FRED (which is on our doorstep). That said, understanding the issues you foresee with the IST approach would still be valuable, as it might save us internal trouble should we choose to carry it temporarily to bridge the gap with FRED. Overall, are there any particular painpoints you'd like to see flushed out, first? How would you like to proceed? Would explicitly marking this as an experimental config, in the interim, be more attractive? Thanks, and I appreciate any help or guidance here. Best, Zach
On 6/18/26 07:50, Zach O'Keefe wrote: > Overall, are there any particular painpoints you'd like to see flushed > out, first? Handing exceptions in the kernel is hard. Period. That's the pain point. Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how we've moved away from ever taking random page faults in the kernel. Or, heck, randomly taking faults at *all*. We've concentrated them in very specific places, not in general code. Now you're arguing that the kernel can pretty much take a fault *AND* allocate memory reliably at any point*. I just don't see the collateral in this series to justify that claim. The NMI entry code is a disaster because NMIs can happen anywhere. The #VC code is a disaster because #VCs can happen anywhere. Once #PF can happen anywhere*, why won't #PF become a disaster? It would be a completely different story if there was a track record of finding and fixing bugs in the x86 entry code from the authors of this series. But I don't think I've ever seen a single email from your folks before this, much less a review tag or a patch. I'd be much happier if you got Andy L's blessing on this, for example. > How would you like to proceed? Would explicitly marking this as an > experimental config, in the interim, be more attractive? No. The enemy here is complexity. *Maintenance* complexity. Being able to compile out some of the complexity helps with debugging. But it doesn't help maintaining the code. -- * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give you that. But it's still pretty darn bad.
On Thu, Jun 18 2026 at 11:53, Dave Hansen wrote:
> On 6/18/26 07:50, Zach O'Keefe wrote:
>> Overall, are there any particular painpoints you'd like to see flushed
>> out, first?
>
> Handing exceptions in the kernel is hard. Period. That's the pain point.
> Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how
> we've moved away from ever taking random page faults in the kernel. Or,
> heck, randomly taking faults at *all*. We've concentrated them in very
> specific places, not in general code.
>
> Now you're arguing that the kernel can pretty much take a fault *AND*
> allocate memory reliably at any point*.
>
> I just don't see the collateral in this series to justify that claim.
There is none because it's simply impossible to guarantee and when
reading through the series even a CPU hotplug operation happily
continues with success when the stack page cache of the upcoming CPU
can't be filled....
> The NMI entry code is a disaster because NMIs can happen anywhere. The
> #VC code is a disaster because #VCs can happen anywhere. Once #PF can
> happen anywhere*, why won't #PF become a disaster?
It's already a disaster. See kvm_handle_async_pf() and the cute issues
vs. taking a #PF in NMI or some other IST handler.
> It would be a completely different story if there was a track record of
> finding and fixing bugs in the x86 entry code from the authors of this
> series. But I don't think I've ever seen a single email from your folks
> before this, much less a review tag or a patch. I'd be much happier if
> you got Andy L's blessing on this, for example.
>
>> How would you like to proceed? Would explicitly marking this as an
>> experimental config, in the interim, be more attractive?
> No.
>
> The enemy here is complexity. *Maintenance* complexity. Being able to
> compile out some of the complexity helps with debugging. But it doesn't
> help maintaining the code.
Correct.
Aside of that the part which worries me most is the IDT hackery. That's
fragile as hell and full of unvalidated assumptions. Reading "should not
happen" several times in a changelog doesn't make me more confident.
"It is possible for #MCE to occur on the #PF IST stack, but the #MCE
handler shouldn't generate new #PFs. The reentrancy check on the #PF
stack will trigger if any recoverable #MCEs do generate #PFs - if there
are actually reports of it happening, we can address it then."
Seriously?
We don't wait until the report comes in because the report won't even
happen in the worst case:
#PF on IST
...
cmp 0, reentrance
jne abort
#MC
...
#PF rewinds #PF IST
cmp 0, reentrance
jne abort <- Not taken because #MC happened before
it could be set.
IST is fundamentally not suitable for this and I'm sure there are more
holes in this.
I haven't looked at the FRED side of affairs yet in detail, but the
handwavy explanation about external interrupts having to be moved to
stack level 1 and unconditionally bounced back does not really make it
appealing. I agree that chapter 8.3.4 in the SDM volume 3 is not really
helpful, but papering over the problem without understanding the root
cause is not cutting it. If it's a genuine FRED hardware issue, then
this needs to be understood and documented.
The x86 folks have spent a lot of time to make the horrific x86
interrupt and exception handling solid and therefore have zero interest
to deal with the fallout of something based on "shouldn't happen"
assumptions. Either it can prove correctness under all circumstances or
not.
I understand the save tons of memory accross a fleet argument, but a
large fleet is also a guarantee to trigger all the "should not happen
and impropable" issues which are gracefully handwaved away. That's a
truly bad tradeoff as it ends up in non-decodable bug reports. What's
worse the have to be handled by the maintainers and not necessarily by
those who implemented it.
Thanks,
tglx
On 2026-06-18 11:53, Dave Hansen wrote: > On 6/18/26 07:50, Zach O'Keefe wrote: >> Overall, are there any particular painpoints you'd like to see flushed >> out, first? > > Handing exceptions in the kernel is hard. Period. That's the pain point. > Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how > we've moved away from ever taking random page faults in the kernel. Or, > heck, randomly taking faults at *all*. We've concentrated them in very > specific places, not in general code. > > Now you're arguing that the kernel can pretty much take a fault *AND* > allocate memory reliably at any point*. > > I just don't see the collateral in this series to justify that claim. > That is most definitely the zeroth-order thing. Extraordinary claims require extraordinary evidence, and this is certainly an extraordinary claim. In addition to the *massive* maintainability issue, you also have to consider the additional overheads you will now have to deal with in order to avoid deadlocks. Almost every OS that have attempted to swap out kernel stacks have been known to suffer from deadlocks under very high memory load. > The NMI entry code is a disaster because NMIs can happen anywhere. The > #VC code is a disaster because #VCs can happen anywhere. Once #PF can > happen anywhere*, why won't #PF become a disaster? > [...] > * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give > you that. But it's still pretty darn bad. In some ways, they are actually *worse*. #PFs need to be able to sleep, because the common case for a #PF in the kernel is that it touched user space. This means #PF needs to be using IST/SL 0. However, this is obviously incompatible with handling #PFs on the kernel stack itself, so now it needs a stack switch. In the common case, it will then need to demote the #PF back onto the normal execution stack, which is complex in its own right. Now, if you are on a pre-FRED system, the IST entries don't nest, so you absolutely have to make sure you can't get there again through any means whatsoever. With FRED, it isn't quite so dire, but it will still give you lots of fun if that interrupt is one which would like to be demoted off the IRQ stack. > It would be a completely different story if there was a track record of > finding and fixing bugs in the x86 entry code from the authors of this > series. But I don't think I've ever seen a single email from your folks > before this, much less a review tag or a patch. I'd be much happier if > you got Andy L's blessing on this, for example. > >> How would you like to proceed? Would explicitly marking this as an >> experimental config, in the interim, be more attractive? > No. > > The enemy here is complexity. *Maintenance* complexity. Being able to > compile out some of the complexity helps with debugging. But it doesn't > help maintaining the code. Indeed. Paravirtualization is a great example of how this works. The PV hooks in the kernel are still a maintenance nightmare 20 years after they were introduced, and mostly that cost is not borne by the people who introduced and benefited from them. -hpa
On Thu, Jun 18, 2026 at 3:28 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On 2026-06-18 11:53, Dave Hansen wrote: > > On 6/18/26 07:50, Zach O'Keefe wrote: > >> Overall, are there any particular painpoints you'd like to see flushed > >> out, first? > > > > Handing exceptions in the kernel is hard. Period. That's the pain point. > > Just look at NMIs, #VC, #MC and the rest of that mess. Just look at how > > we've moved away from ever taking random page faults in the kernel. Or, > > heck, randomly taking faults at *all*. We've concentrated them in very > > specific places, not in general code. > > > > Now you're arguing that the kernel can pretty much take a fault *AND* > > allocate memory reliably at any point*. > > > > I just don't see the collateral in this series to justify that claim. > > > > That is most definitely the zeroth-order thing. Extraordinary claims require > extraordinary evidence, and this is certainly an extraordinary claim. I do acknowledge that there is currently a lack of evidence - this is an RFC after all. The question is whether it is possible in principle to produce sufficient evidence. From the Android side of Google, we are willing to carry the RFC patches downstream for a while to build a case for merging them upstream. However, there needs to be at least a possibility of success before we undertake that work. If upstream's position is that dynamic stacks are no good, full stop, and will absolutely never happen, then there's no point in us trying to pursue this avenue further. And I assume those from the datacenter side of the company are in a similar position. -David > In addition to the *massive* maintainability issue, you also have to consider the > additional overheads you will now have to deal with in order to avoid deadlocks. > > Almost every OS that have attempted to swap out kernel stacks have been known > to suffer from deadlocks under very high memory load. > > > > The NMI entry code is a disaster because NMIs can happen anywhere. The > > #VC code is a disaster because #VCs can happen anywhere. Once #PF can > > happen anywhere*, why won't #PF become a disaster? > > [...] > > * #PF on stack accesses isn't *quite* as bad as NMI or #VC, I'll give > > you that. But it's still pretty darn bad. > > In some ways, they are actually *worse*. > > #PFs need to be able to sleep, because the common case for a #PF in the kernel > is that it touched user space. This means #PF needs to be using IST/SL 0. > However, this is obviously incompatible with handling #PFs on the kernel stack > itself, so now it needs a stack switch. In the common case, it will then need > to demote the #PF back onto the normal execution stack, which is complex in > its own right. > > Now, if you are on a pre-FRED system, the IST entries don't nest, so you > absolutely have to make sure you can't get there again through any means > whatsoever. With FRED, it isn't quite so dire, but it will still give you lots > of fun if that interrupt is one which would like to be demoted off the IRQ stack. > > > It would be a completely different story if there was a track record of > > finding and fixing bugs in the x86 entry code from the authors of this > > series. But I don't think I've ever seen a single email from your folks > > before this, much less a review tag or a patch. I'd be much happier if > > you got Andy L's blessing on this, for example. > > > >> How would you like to proceed? Would explicitly marking this as an > >> experimental config, in the interim, be more attractive? > > No. > > > > The enemy here is complexity. *Maintenance* complexity. Being able to > > compile out some of the complexity helps with debugging. But it doesn't > > help maintaining the code. > Indeed. Paravirtualization is a great example of how this works. The PV hooks > in the kernel are still a maintenance nightmare 20 years after they were > introduced, and mostly that cost is not borne by the people who introduced and > benefited from them. > > -hpa >
On 2026-06-18 17:40, David Stevens wrote: >> >> That is most definitely the zeroth-order thing. Extraordinary claims require >> extraordinary evidence, and this is certainly an extraordinary claim. > > I do acknowledge that there is currently a lack of evidence - this is > an RFC after all. The question is whether it is possible in principle > to produce sufficient evidence. From the Android side of Google, we > are willing to carry the RFC patches downstream for a while to build a > case for merging them upstream. However, there needs to be at least a > possibility of success before we undertake that work. If upstream's > position is that dynamic stacks are no good, full stop, and will > absolutely never happen, then there's no point in us trying to pursue > this avenue further. And I assume those from the datacenter side of > the company are in a similar position. > The answer is pretty much that you would have to present *very* impressive-looking evidence. There is very little that's completely absolute, but you definitely have a tall hill to climb on this one. Keep also in mind that there are also people who claim that our current page sizes are much too small, and that the kernel should be doing 16K or 64K pages. At that point this more or less evaporates, too. -hpa
On 04-24 12:41, Dave Hansen wrote: > On 4/24/26 12:14, David Stevens wrote: > > The question is then: is this approach something that is fundamentally > > untenable in the kernel > > Yes. Fundamentally untenable. > > Not allowing stack faults has been a wonderful simplification. It's one > of those things that just plain makes the kernel easier to maintain. > Saving low single digits of system memory is not exactly making me eager > to go back to the harder-to-maintain days. > > I seriously doubt that this 1% is the lowest hanging fruit for memory > bloat on these systems. ;) This true until, in a fleet of millions of machines, you encounter a one-in-a-billion chance of a stack overflow. You are then forced to double the statically allocated kernel stacks on every machine, paying a memory tax even though 99.999..% of threads never exceed 4K. This overhead accumulates to petabytes of wasted capacity. Pasha
On Fri, 24 Apr 2026 21:35:20 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > On 04-24 12:41, Dave Hansen wrote: > > On 4/24/26 12:14, David Stevens wrote: > > > The question is then: is this approach something that is fundamentally > > > untenable in the kernel > > > > Yes. Fundamentally untenable. > > > > Not allowing stack faults has been a wonderful simplification. It's one > > of those things that just plain makes the kernel easier to maintain. > > Saving low single digits of system memory is not exactly making me eager > > to go back to the harder-to-maintain days. > > > > I seriously doubt that this 1% is the lowest hanging fruit for memory > > bloat on these systems. ;) > > This true until, in a fleet of millions of machines, you encounter a > one-in-a-billion chance of a stack overflow. You are then forced to > double the statically allocated kernel stacks on every machine, paying a > memory tax even though 99.999..% of threads never exceed 4K. This > overhead accumulates to petabytes of wasted capacity. And then you hit a stack fault in some path where you can't sleep and there isn't any available kernel memory. An alternative idea is to arrange for some system calls to sleep in userspace, so when the thread is woken it re-executes the system call. It then makes sense to assign the kernel stack to the process when it enters the kernel. That might mean that you don't need a kernel stack for all the threads sleeping in futex() - it might even be possible to do the retry in userspace saving the second kernel entry most of the time. It is all 'hard and difficult' though. The easier solution is to rewrite the system code so it doesn't have 1000s of threads :-) David > > Pasha >
On 4/24/26 15:26, David Laight wrote: >> This true until, in a fleet of millions of machines, you encounter a >> one-in-a-billion chance of a stack overflow. You are then forced to >> double the statically allocated kernel stacks on every machine, paying a >> memory tax even though 99.999..% of threads never exceed 4K. This >> overhead accumulates to petabytes of wasted capacity. > And then you hit a stack fault in some path where you can't sleep and > there isn't any available kernel memory. > > An alternative idea is to arrange for some system calls to sleep in > userspace, so when the thread is woken it re-executes the system call. > It then makes sense to assign the kernel stack to the process when > it enters the kernel. There are probably other ways to do this without handling exceptions. For instance, let's say you always *map* 16k of stack for each process. But, after context switching out, you take a look at 4x8b pte_t's that were mapping the kernel stack. If the _PAGE_ACCESSED bit is clear, you can just clear _PAGE_PRESENT and reclaim the page. If you don't want the overhead in the normal context switch path, you reclaim in a shrinker, at the cost of needing locking to coordinate with the scheduler. A simple rule would be: a thread that ever accesses a page gets to keep it forever. They're never reclaimed after being accessed, only before. For that, the worst case is that you go to schedule a new thread and can't allocate memory fill in the 4 pte_t's. You can't run it until you or some other CPU goes and does some reclaim. Needing memory in the middle of schedule() is generally a no-go. But its a lot better than not being able to continue _execution_ of a kernel thread at *ALL*, possibly in a non-preemptible context, like when you do it in a #PF. Basically, I think there's a way to do this that limits the kernel blast radius to _mostly_ being a core mm problem. What else has been considered before the #PF-based mechanism?
On 04-24 23:26, David Laight wrote: > On Fri, 24 Apr 2026 21:35:20 +0000 > Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > On 04-24 12:41, Dave Hansen wrote: > > > On 4/24/26 12:14, David Stevens wrote: > > > > The question is then: is this approach something that is fundamentally > > > > untenable in the kernel > > > > > > Yes. Fundamentally untenable. > > > > > > Not allowing stack faults has been a wonderful simplification. It's one > > > of those things that just plain makes the kernel easier to maintain. > > > Saving low single digits of system memory is not exactly making me eager > > > to go back to the harder-to-maintain days. > > > > > > I seriously doubt that this 1% is the lowest hanging fruit for memory > > > bloat on these systems. ;) > > > > This true until, in a fleet of millions of machines, you encounter a > > one-in-a-billion chance of a stack overflow. You are then forced to > > double the statically allocated kernel stacks on every machine, paying a > > memory tax even though 99.999..% of threads never exceed 4K. This > > overhead accumulates to petabytes of wasted capacity. > > And then you hit a stack fault in some path where you can't sleep and > there isn't any available kernel memory. Well, at least if we hit this rare case, we can simply double a buffer of pre-reserved stack memory per CPU. This still saves significant memory compared to wasting it on every single thread. > An alternative idea is to arrange for some system calls to sleep in > userspace, so when the thread is woken it re-executes the system call. > It then makes sense to assign the kernel stack to the process when > it enters the kernel. > That might mean that you don't need a kernel stack for all the threads > sleeping in futex() - it might even be possible to do the retry in > userspace saving the second kernel entry most of the time. > It is all 'hard and difficult' though. I was thinking about a similar approach as well—sort of multiplexing the kernel stacks. But honestly, when trying to cover all the edge cases, I didn't find it to be any better or easier than just using dynamic kernel stacks. An alternative approach, which was proposed at LSFMM by Willy, is to add an explicit deep stack calls. When we enter a path that we know is exceptionally deep, only then do we extend the stack, keeping the default (say, 8K) everywhere else. > The easier solution is to rewrite the system code so it doesn't have > 1000s of threads :-) That ship sailed in the early 90s of the previous millennium. Nowadays, we have high end workstations with almost 200 hardware threads. Rewriting system code to reduce thread counts simply isn't an option for our storage machines, which have millions of threads per unit. +CC Matthew Wilcox
On 4/24/26 14:35, Pasha Tatashin wrote: > On 04-24 12:41, Dave Hansen wrote: >> On 4/24/26 12:14, David Stevens wrote: >>> The question is then: is this approach something that is fundamentally >>> untenable in the kernel >> Yes. Fundamentally untenable. >> >> Not allowing stack faults has been a wonderful simplification. It's one >> of those things that just plain makes the kernel easier to maintain. >> Saving low single digits of system memory is not exactly making me eager >> to go back to the harder-to-maintain days. >> >> I seriously doubt that this 1% is the lowest hanging fruit for memory >> bloat on these systems. 😉 > This true until, in a fleet of millions of machines, you encounter a > one-in-a-billion chance of a stack overflow. You are then forced to > double the statically allocated kernel stacks on every machine, paying a > memory tax even though 99.999..% of threads never exceed 4K. This > overhead accumulates to petabytes of wasted capacity. I don't disagree with you. But, at that point, you're picking your poison: bugs dynamic kernel stacks versus crashes from stack overflows. At some point, I might be able to be talked into dynamic stack as a FRED-only feature. But FRED isn't widespread enough to go to the trouble today. I'm sure the folks who want this also don't want to wait until all the devices in the field have FRED because that even *longer* off. So maybe this is one of those things that folks just need to deploy out-of-tree for a couple of years, come back with some data to show us that we were just paranoid, and we'll look at it again.
On Fri, Apr 24, 2026 at 3:21 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 4/24/26 14:35, Pasha Tatashin wrote: > > On 04-24 12:41, Dave Hansen wrote: > >> On 4/24/26 12:14, David Stevens wrote: > >>> The question is then: is this approach something that is fundamentally > >>> untenable in the kernel > >> Yes. Fundamentally untenable. > >> > >> Not allowing stack faults has been a wonderful simplification. It's one > >> of those things that just plain makes the kernel easier to maintain. > >> Saving low single digits of system memory is not exactly making me eager > >> to go back to the harder-to-maintain days. > >> > >> I seriously doubt that this 1% is the lowest hanging fruit for memory > >> bloat on these systems. 😉 > > This true until, in a fleet of millions of machines, you encounter a > > one-in-a-billion chance of a stack overflow. You are then forced to > > double the statically allocated kernel stacks on every machine, paying a > > memory tax even though 99.999..% of threads never exceed 4K. This > > overhead accumulates to petabytes of wasted capacity. > > I don't disagree with you. But, at that point, you're picking your > poison: bugs dynamic kernel stacks versus crashes from stack overflows. > > At some point, I might be able to be talked into dynamic stack as a > FRED-only feature. But FRED isn't widespread enough to go to the trouble > today. I'm sure the folks who want this also don't want to wait until > all the devices in the field have FRED because that even *longer* off. Why does this need to be FRED only? True, the lack of reentrancy with IST stacks complicates a few situations. That adds some complexity beyond what's needed for FRED-only support, but the additional complexity doesn't really seem like a hard blocker, at least if we accept the complexity of kernel stack faults for FRED. -David
© 2016 - 2026 Red Hat, Inc.