[v1] fork: stop ignoring NUMA while handling cached thread stacks

[PATCH] fork: stop ignoring NUMA while handling cached thread stacks

Posted by Mateusz Guzik 2 months, 3 weeks ago

1. the numa parameter was straight up ignored.
2. nothing was done to check if the to-be-cached/allocated stack matches
   the local node

The id remains ignored on free in case of memoryless nodes.

Note the current caching is already bad as the cache keeps overflowing
and a different solution is needed for the long run, to be worked
out(tm).

Stats collected over a kernel build with the patch with the following
topology:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-11
  NUMA node1 CPU(s):         12-23

caller's node vs stack backing pages on free:
matching:	50083 (70%)
mismatched:	21492 (30%)

caching efficiency:
cached:		32651 (65.2%)
dropped:	17432 (34.8%)

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

I lifted page node id checking out of vmalloc, I presume it works(tm).

 kernel/fork.c | 55 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f1857672426e..9448582737ff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -208,15 +208,54 @@ struct vm_stack {
 	struct vm_struct *stack_vm_area;
 };
 
+static struct vm_struct *alloc_thread_stack_node_from_cache(struct task_struct *tsk, int node)
+{
+	struct vm_struct *vm_area;
+	unsigned int i;
+
+	/*
+	 * If the node has memory, we are guaranteed the stacks are backed by local pages.
+	 * Otherwise the pages are arbitrary.
+	 *
+	 * Note that depending on cpuset it is possible we will get migrated to a different
+	 * node immediately after allocating here, so this does *not* guarantee locality for
+	 * arbitrary callers.
+	 */
+	scoped_guard(preempt) {
+		if (node != NUMA_NO_NODE && numa_node_id() != node)
+			return NULL;
+
+		for (i = 0; i < NR_CACHED_STACKS; i++) {
+			vm_area = this_cpu_xchg(cached_stacks[i], NULL);
+			if (vm_area)
+				return vm_area;
+		}
+	}
+
+	return NULL;
+}
+
 static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
 {
 	unsigned int i;
+	int nid;
+
+	scoped_guard(preempt) {
+		nid = numa_node_id();
+		if (node_state(nid, N_MEMORY)) {
+			for (i = 0; i < vm_area->nr_pages; i++) {
+				struct page *page = vm_area->pages[i];
+				if (page_to_nid(page) != nid)
+					return false;
+			}
+		}
 
-	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		struct vm_struct *tmp = NULL;
+		for (i = 0; i < NR_CACHED_STACKS; i++) {
+			struct vm_struct *tmp = NULL;
 
-		if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
-			return true;
+			if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
+				return true;
+		}
 	}
 	return false;
 }
@@ -283,13 +322,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 	struct vm_struct *vm_area;
 	void *stack;
-	int i;
-
-	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		vm_area = this_cpu_xchg(cached_stacks[i], NULL);
-		if (!vm_area)
-			continue;
 
+	vm_area = alloc_thread_stack_node_from_cache(tsk, node);
+	if (vm_area) {
 		if (memcg_charge_kernel_stack(vm_area)) {
 			vfree(vm_area->addr);
 			return -ENOMEM;
-- 
2.48.1

Re: [PATCH] fork: stop ignoring NUMA while handling cached thread stacks

Posted by Linus Walleij 2 months, 2 weeks ago

Hi Mateusz,

excellent initiative!

I had this on some TODO-list, really nice to see that you
picked it up.

The patch looks solid just some questions:

On Mon, Nov 17, 2025 at 3:08 PM Mateusz Guzik <mjguzik@gmail.com> wrote:

> Note the current caching is already bad as the cache keeps overflowing
> and a different solution is needed for the long run, to be worked
> out(tm).

That isn't very strange since we just have 2 stacks in the cache.

The best I can think of is to scale the number of cached stacks to
a function of free physical memory and process fork rate, if we have
much memory (for some definition of) and we are forking a lot we
should keep some more stacks around, if the forkrate goes down
or we are low on memory compared to the stack size we should
dynamically scale down the stack cache size. (OTOMH)

> +static struct vm_struct *alloc_thread_stack_node_from_cache(struct task_struct *tsk, int node)
> +{
> +       struct vm_struct *vm_area;
> +       unsigned int i;
> +
> +       /*
> +        * If the node has memory, we are guaranteed the stacks are backed by local pages.
> +        * Otherwise the pages are arbitrary.
> +        *
> +        * Note that depending on cpuset it is possible we will get migrated to a different
> +        * node immediately after allocating here, so this does *not* guarantee locality for
> +        * arbitrary callers.
> +        */
> +       scoped_guard(preempt) {
> +               if (node != NUMA_NO_NODE && numa_node_id() != node)
> +                       return NULL;
> +
> +               for (i = 0; i < NR_CACHED_STACKS; i++) {
> +                       vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> +                       if (vm_area)
> +                               return vm_area;

So we check each stack slot in order to see if we can find one which isn't
NULL, and we can use this_cpu_xchg() because nothing can contest
this here as we are under the preempt guard, so we will get a !NULL
vm_area then we know we are good, right?

>  static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
>  {
>         unsigned int i;
> +       int nid;
> +
> +       scoped_guard(preempt) {
> +               nid = numa_node_id();
> +               if (node_state(nid, N_MEMORY)) {
> +                       for (i = 0; i < vm_area->nr_pages; i++) {
> +                               struct page *page = vm_area->pages[i];
> +                               if (page_to_nid(page) != nid)
> +                                       return false;
> +                       }
> +               }

I would maybe add a comment saying:

"if we have node-local memory, don't even bother to cache a stack
if any page of it isn't on the same node, we only want clean local
node stacks"

(I guess that is the semantic you wanted.)

>
> -       for (i = 0; i < NR_CACHED_STACKS; i++) {
> -               struct vm_struct *tmp = NULL;
> +               for (i = 0; i < NR_CACHED_STACKS; i++) {
> +                       struct vm_struct *tmp = NULL;
>
> -               if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
> -                       return true;
> +                       if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
> +                               return true;

So since this now is under the preemption guard, this will always
succeed, right? I understand that using this_cpu_try_cmpxchg() is
the idiom, but just asking so I don't miss something else
possibly contesting the stacks here.

If the code should have the same style as alloc_thread_stack_node_from_cache()
I suppose it should be:

for (i = 0; i < NR_CACHED_STACKS; i++) {
        struct vm_struct *tmp = NULL;
        if (!this_cpu_cmpxchg(cached_stacks[i], &tmp, vm_area))
                return true;

Since if it managed to exchange the old value NULL for
the value of vm_area then it is returning NULL on success.

If I understood correctly +/- the above code style change:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij

Re: [PATCH] fork: stop ignoring NUMA while handling cached thread stacks

Posted by Mateusz Guzik 2 months, 2 weeks ago

On Tue, Nov 18, 2025 at 10:15:04PM +0100, Linus Walleij wrote:
> On Mon, Nov 17, 2025 at 3:08 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
> > Note the current caching is already bad as the cache keeps overflowing
> > and a different solution is needed for the long run, to be worked
> > out(tm).
> 
> That isn't very strange since we just have 2 stacks in the cache.
> 
> The best I can think of is to scale the number of cached stacks to
> a function of free physical memory and process fork rate, if we have
> much memory (for some definition of) and we are forking a lot we
> should keep some more stacks around, if the forkrate goes down
> or we are low on memory compared to the stack size we should
> dynamically scale down the stack cache size. (OTOMH)
> 

I mentioned the cache problem when writing the patch $elsewhere and an
idea was floated of implementing vmalloc-level caching. One person
claimed they are going to look into it, but I don't know how serious it
is.

Even so, my take on the ordeal is that per-cpu level caching for
something like thread stacks is a waste of resources.

Stacks are only allocated for threads (go figure). Threads are allocated
using a per-cpu cache and then proceed to globally serialize in 3
different spots at the moment (one can be elided, does not change the
point). One of the locks is tasklist and I don't see anyone removing
that problem in the foreseeable future.

So there is no real win from per-cpu caching for threads to begin with.

Instead, a cache with a granularity of n cpus (say 8) would be more
memory-efficient *and* still not reduce scalability due to
aforementioned bottlenecks.

All that said, I'm not working on it. :)

> > +static struct vm_struct *alloc_thread_stack_node_from_cache(struct task_struct *tsk, int node)
> > +{
> > +       struct vm_struct *vm_area;
> > +       unsigned int i;
> > +
> > +       /*
> > +        * If the node has memory, we are guaranteed the stacks are backed by local pages.
> > +        * Otherwise the pages are arbitrary.
> > +        *
> > +        * Note that depending on cpuset it is possible we will get migrated to a different
> > +        * node immediately after allocating here, so this does *not* guarantee locality for
> > +        * arbitrary callers.
> > +        */
> > +       scoped_guard(preempt) {
> > +               if (node != NUMA_NO_NODE && numa_node_id() != node)
> > +                       return NULL;
> > +
> > +               for (i = 0; i < NR_CACHED_STACKS; i++) {
> > +                       vm_area = this_cpu_xchg(cached_stacks[i], NULL);
> > +                       if (vm_area)
> > +                               return vm_area;
> 
> So we check each stack slot in order to see if we can find one which isn't
> NULL, and we can use this_cpu_xchg() because nothing can contest
> this here as we are under the preempt guard, so we will get a !NULL
> vm_area then we know we are good, right?
> 

This code is the same as in the original loop.

> >  static bool try_release_thread_stack_to_cache(struct vm_struct *vm_area)
> >  {
> >         unsigned int i;
> > +       int nid;
> > +
> > +       scoped_guard(preempt) {
> > +               nid = numa_node_id();
> > +               if (node_state(nid, N_MEMORY)) {
> > +                       for (i = 0; i < vm_area->nr_pages; i++) {
> > +                               struct page *page = vm_area->pages[i];
> > +                               if (page_to_nid(page) != nid)
> > +                                       return false;
> > +                       }
> > +               }
> 
> I would maybe add a comment saying:
> 
> "if we have node-local memory, don't even bother to cache a stack
> if any page of it isn't on the same node, we only want clean local
> node stacks"
> 
> (I guess that is the semantic you wanted.)
> 

I'll add something to that extent, maybe like this:
/*
 * alloc_thread_stack_node_from_cache() assumes stacks are fully backed
 * by the local node, provided it has memory.
 */

> >
> > -       for (i = 0; i < NR_CACHED_STACKS; i++) {
> > -               struct vm_struct *tmp = NULL;
> > +               for (i = 0; i < NR_CACHED_STACKS; i++) {
> > +                       struct vm_struct *tmp = NULL;
> >
> > -               if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
> > -                       return true;
> > +                       if (this_cpu_try_cmpxchg(cached_stacks[i], &tmp, vm_area))
> > +                               return true;
> 
> So since this now is under the preemption guard, this will always
> succeed, right? I understand that using this_cpu_try_cmpxchg() is
> the idiom, but just asking so I don't miss something else
> possibly contesting the stacks here.
> 

I think so, but unfortunately the typical expectation is that routines
are callable from any context, which I'm retaining here.

If one was to modify this to drop the behavior, asserts would have to be
added this is only called from task context.

> If the code should have the same style as alloc_thread_stack_node_from_cache()
> I suppose it should be:
> 
> for (i = 0; i < NR_CACHED_STACKS; i++) {
>         struct vm_struct *tmp = NULL;
>         if (!this_cpu_cmpxchg(cached_stacks[i], &tmp, vm_area))
>                 return true;
> 
> Since if it managed to exchange the old value NULL for
> the value of vm_area then it is returning NULL on success.


This bit I don't follow. Seems like this flips the return value?

My patch aimed to be about as minimal as it gets to damage-control the
numa problem, so I kept everything as close to "as is" as it gets.

> 
> If I understood correctly +/- the above code style change:
> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
> 
> Yours,
> Linus Walleij