[v2] Dynamic Kernel Stacks

[PATCH v2 00/13] Dynamic Kernel Stacks

Posted by David Stevens 1 month, 3 weeks ago

This RFC is a continuation of Pasha Tatashin's original RFC [1], and is
based on Linus Walleij's rebased version of the patches [2]. My focus
was x86_64 devices, so I didn't include his arm64 WIP patches.

The impetus for reviving this RFC is kernel stack usage on Android. On
regular Android (i.e. non-wear/automotive), system processes typically
have 2000-3000 threads. When adding threads from app processes, this
means that systems with 4GB of memory are using 1-2% of total memory for
kernel thread stacks. Dynamic kernel stacks reduce this by 65%-70%.

The main change compared to Pasha's v1 RFC is how x86_64 handles kernel
stack faults. On systems where FRED is available, it handles kernel page
faults on stack level 1. When FRED isn't available, it uses a dedicated
IST stack for page faults. In both cases, page faults which aren't
dynamic stack faults are moved back onto the regular kernel stack. This
does introduce some overhead for page faults on user memory that
originate in the kernel (note that non-FRED systems already needed to
bounce userspace page faults through the entry stack), but such faults
aren't as hot a path as regular user page faults. There are certainly
systems where the memory savings are worth the overhead. That said, the
config could be made optional to give systems the option to pay the
memory cost to avoid the CPU overhead.

The biggest open issue is how to deal with reliability. This series uses
GFP_ATOMIC when refilling the per-CPU magazines during context switch,
which is necessary to avoid deadlock. This of course raises concerns
about allocation failure. If a magazine got depleted, then refilling the
magazine failed due to atomic reserve depletion, and then another thread
triggered a dynamic stack fault, that would trigger a fatal page fault.
There is also a secondary concern about additional pressure on the
memory reserves causing allocation failures at other atomic call sites.

The question is then: is this approach something that is fundamentally
untenable in the kernel, or are there compromises that would allow it to
be merged? One obvious compromise is to make the feature optional. Both
kernel stack faults and running out of memory reserves are rare events.
I've never seen this failure in my testing, although I don't have field
data to back that up at this point. Some sysadmins may view it as low
enough risk to be worth the memory savings. There are also additional
measures that could be taken to reduce the likelihood of failure (e.g.
magazine management on kernel entry/exit, tunable magazine sizes, adding
best-effort trylock reclaim or oom kill).

This series was developed and tested on devices running 6.18 kernels. It
has been rebased onto 7.0, with minimal smoke testing after rebasing.

[1] https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/aarch64-dynamic-kernel-stacks-v6.18-rc1

David Stevens (7):
fork: Don't assume fully populated stack during reuse
fork: Move vm_stack to the beginning of the stack
fork: Move vmap stack freeing to work queue
fork: Store task pointer in unpopulated stack ptes
x86/entry/fred: encode frame pointer on entry
x86: Add support for dynamic kernel stacks via FRED
x86: Add support for dynamic kernel stacks via IST

Pasha Tatashin (6):
fork: Remove assumption that vm_area->nr_pages equals to THREAD_SIZE
fork: separate vmap stack allocation and free calls
mm/vmalloc: Add a get_vm_area_node() and vmap_pages_range() public
functions
fork: Dynamic Kernel Stacks
task_stack.h: Add stack_not_used() support for dynamic stack
fork: Dynamic Kernel Stack accounting

base-commit: 028ef9c96e96197026887c0f092424679298aae8
--
2.54.0.rc2.544.gc7ae2d5bb8-goog