[v1] RE: [RFC 00/14] Dynamic Kernel Stacks

RE: [RFC 00/14] Dynamic Kernel Stacks

Posted by David Laight 1 year, 10 months ago

From: Pasha Tatashin
> Sent: 16 March 2024 19:18
...
> Expanding on Mathew's idea of an interface for dynamic kernel stack
> sizes, here's what I'm thinking:
> 
> - Kernel Threads: Create all kernel threads with a fully populated
> THREAD_SIZE stack.  (i.e. 16K)
> - User Threads: Create all user threads with THREAD_SIZE kernel stack
> but only the top page mapped. (i.e. 4K)
> - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> three additional pages from the per-CPU stack cache. This function is
> called early in kernel entry points.
> - exit_to_user_mode(): Unmap the extra three pages and return them to
> the per-CPU cache. This function is called late in the kernel exit
> path.

Isn't that entirely horrid for TLB use and so will require a lot of IPI?

Remember, if a thread sleeps in 'extra stack' and is then resheduled
on a different cpu the extra pages get 'pumped' from one cpu to
another.

I also suspect a stack_probe() is likely to end up being a cache miss
and also slow???
So you wouldn't want one on all calls.
I'm not sure you'd want a conditional branch either.

The explicit request for 'more stack' can be required to be allowed
to sleep - removing a lot of issues.
It would also be portable to all architectures.
I'd also suspect that any thread that needs extra stack is likely
to need to again.
So while the memory could be recovered, I'd bet is isn't worth
doing except under memory pressure.
The call could also return 'no' - perhaps useful for (broken) code
that insists on being recursive.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [RFC 00/14] Dynamic Kernel Stacks

Posted by Pasha Tatashin 1 year, 10 months ago

On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Pasha Tatashin
> > Sent: 16 March 2024 19:18
> ...
> > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > sizes, here's what I'm thinking:
> >
> > - Kernel Threads: Create all kernel threads with a fully populated
> > THREAD_SIZE stack.  (i.e. 16K)
> > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > but only the top page mapped. (i.e. 4K)
> > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > three additional pages from the per-CPU stack cache. This function is
> > called early in kernel entry points.
> > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > the per-CPU cache. This function is called late in the kernel exit
> > path.
>
> Isn't that entirely horrid for TLB use and so will require a lot of IPI?

The TLB load is going to be exactly the same as today, we already use
small pages for VMA mapped stacks. We won't need to have extra
flushing either, the mappings are in the kernel space, and once pages
are removed from the page table, no one is going to access that VA
space until that thread enters the kernel again. We will need to
invalidate the VA range only when the pages are mapped, and only on
the local cpu.

> Remember, if a thread sleeps in 'extra stack' and is then resheduled
> on a different cpu the extra pages get 'pumped' from one cpu to
> another.

Yes, the per-cpu cache can get unbalanced this way, we can remember
the original CPU where we acquired the pages to return to the same
place.

> I also suspect a stack_probe() is likely to end up being a cache miss
> and also slow???

Can you please elaborate on this point. I am not aware of
stack_probe() and how it is used.

> So you wouldn't want one on all calls.
> I'm not sure you'd want a conditional branch either.
>
> The explicit request for 'more stack' can be required to be allowed
> to sleep - removing a lot of issues.
> It would also be portable to all architectures.
> I'd also suspect that any thread that needs extra stack is likely
> to need to again.
> So while the memory could be recovered, I'd bet is isn't worth
> doing except under memory pressure.
> The call could also return 'no' - perhaps useful for (broken) code
> that insists on being recursive.

The current approach discussed is somewhat different from explicit
more stack requests API. I am investigating how feasible it is to use
kernel stack multiplexing, so the same pages can be re-used by many
threads when they are actually used. If the multiplexing approach
won't work, I will come back to the explicit more stack API.

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Re: [RFC 00/14] Dynamic Kernel Stacks

Posted by Matthew Wilcox 1 year, 10 months ago

On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

No; we can pass pointers to our kernel stack to other threads.  The
obvious one is a mutex; we put a mutex_waiter on our own stack and
add its list_head to the mutex's waiter list.  I'm sure you can
think of many other places we do this (eg wait queues, poll(), select(),
etc).

Re: [RFC 00/14] Dynamic Kernel Stacks

Posted by Pasha Tatashin 1 year, 10 months ago

On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > The TLB load is going to be exactly the same as today, we already use
> > small pages for VMA mapped stacks. We won't need to have extra
> > flushing either, the mappings are in the kernel space, and once pages
> > are removed from the page table, no one is going to access that VA
> > space until that thread enters the kernel again. We will need to
> > invalidate the VA range only when the pages are mapped, and only on
> > the local cpu.
>
> No; we can pass pointers to our kernel stack to other threads.  The
> obvious one is a mutex; we put a mutex_waiter on our own stack and
> add its list_head to the mutex's waiter list.  I'm sure you can
> think of many other places we do this (eg wait queues, poll(), select(),
> etc).

Hm, it means that stack is sleeping in the kernel space, and has its
stack pages mapped and invalidated on the local CPU, but access from
the remote CPU to that stack pages would be problematic.

I think we still won't need IPI, but VA-range invalidation is actually
needed on unmaps, and should happen during context switch so every
time we go off-cpu. Therefore, what Brian/Andy have suggested makes
more sense instead of kernel/enter/exit paths.

Pasha

RE: [RFC 00/14] Dynamic Kernel Stacks

Posted by David Laight 1 year, 10 months ago

From: Pasha Tatashin
> Sent: 18 March 2024 15:31
> 
> On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote:
> > > The TLB load is going to be exactly the same as today, we already use
> > > small pages for VMA mapped stacks. We won't need to have extra
> > > flushing either, the mappings are in the kernel space, and once pages
> > > are removed from the page table, no one is going to access that VA
> > > space until that thread enters the kernel again. We will need to
> > > invalidate the VA range only when the pages are mapped, and only on
> > > the local cpu.
> >
> > No; we can pass pointers to our kernel stack to other threads.  The
> > obvious one is a mutex; we put a mutex_waiter on our own stack and
> > add its list_head to the mutex's waiter list.  I'm sure you can
> > think of many other places we do this (eg wait queues, poll(), select(),
> > etc).
> 
> Hm, it means that stack is sleeping in the kernel space, and has its
> stack pages mapped and invalidated on the local CPU, but access from
> the remote CPU to that stack pages would be problematic.
> 
> I think we still won't need IPI, but VA-range invalidation is actually
> needed on unmaps, and should happen during context switch so every
> time we go off-cpu. Therefore, what Brian/Andy have suggested makes
> more sense instead of kernel/enter/exit paths.

I think you'll need to broadcast an invalidate.
Consider:
CPU A: task allocates extra pages and adds something to some list.
CPU B: accesses that data and maybe modifies it.
	Some page-table walk setup ut the TLB.
CPU A: task detects the modify, removes the item from the list,
	collapses back the stack and sleeps.
	Stack pages freed.
CPU A: task wakes up (on the same cpu for simplicity).
	Goes down a deep stack and puts an item on a list.
	Different physical pages are allocated.
CPU B: accesses the associated KVA.
	It better not have a cached TLB.

Doesn't that need an IPI?

Freeing the pages is much harder than allocating them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [RFC 00/14] Dynamic Kernel Stacks

Posted by Pasha Tatashin 1 year, 10 months ago

> I think you'll need to broadcast an invalidate.
> Consider:
> CPU A: task allocates extra pages and adds something to some list.
> CPU B: accesses that data and maybe modifies it.
>         Some page-table walk setup ut the TLB.
> CPU A: task detects the modify, removes the item from the list,
>         collapses back the stack and sleeps.
>         Stack pages freed.
> CPU A: task wakes up (on the same cpu for simplicity).
>         Goes down a deep stack and puts an item on a list.
>         Different physical pages are allocated.
> CPU B: accesses the associated KVA.
>         It better not have a cached TLB.
>
> Doesn't that need an IPI?

Yes, this is annoying. If we share a stack with another CPU, then get
a new stack, and share it again with another CPU we get in trouble.
Yet, IPI during context switch would kill the performance :-\

I wonder if there is a way to optimize this scenario like doing IPI
invalidation only after stack sharing?

Pasha

Re: [RFC 00/14] Dynamic Kernel Stacks

Posted by Pasha Tatashin 1 year, 10 months ago

On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Pasha Tatashin
> > > Sent: 16 March 2024 19:18
> > ...
> > > Expanding on Mathew's idea of an interface for dynamic kernel stack
> > > sizes, here's what I'm thinking:
> > >
> > > - Kernel Threads: Create all kernel threads with a fully populated
> > > THREAD_SIZE stack.  (i.e. 16K)
> > > - User Threads: Create all user threads with THREAD_SIZE kernel stack
> > > but only the top page mapped. (i.e. 4K)
> > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping
> > > three additional pages from the per-CPU stack cache. This function is
> > > called early in kernel entry points.
> > > - exit_to_user_mode(): Unmap the extra three pages and return them to
> > > the per-CPU cache. This function is called late in the kernel exit
> > > path.
> >
> > Isn't that entirely horrid for TLB use and so will require a lot of IPI?
>
> The TLB load is going to be exactly the same as today, we already use
> small pages for VMA mapped stacks. We won't need to have extra
> flushing either, the mappings are in the kernel space, and once pages
> are removed from the page table, no one is going to access that VA
> space until that thread enters the kernel again. We will need to
> invalidate the VA range only when the pages are mapped, and only on
> the local cpu.

The TLB miss rate is going to slightly increase, but very slightly,
because stacks are small 4-pages with only 3-dynamic pages, and
therefore only up-to 2-3 new misses per syscalls, and that is only for
the complicated deep syscalls, therefore, I suspect it won't affect
the real world performance.

> > Remember, if a thread sleeps in 'extra stack' and is then resheduled
> > on a different cpu the extra pages get 'pumped' from one cpu to
> > another.
>
> Yes, the per-cpu cache can get unbalanced this way, we can remember
> the original CPU where we acquired the pages to return to the same
> place.
>
> > I also suspect a stack_probe() is likely to end up being a cache miss
> > and also slow???
>
> Can you please elaborate on this point. I am not aware of
> stack_probe() and how it is used.
>
> > So you wouldn't want one on all calls.
> > I'm not sure you'd want a conditional branch either.
> >
> > The explicit request for 'more stack' can be required to be allowed
> > to sleep - removing a lot of issues.
> > It would also be portable to all architectures.
> > I'd also suspect that any thread that needs extra stack is likely
> > to need to again.
> > So while the memory could be recovered, I'd bet is isn't worth
> > doing except under memory pressure.
> > The call could also return 'no' - perhaps useful for (broken) code
> > that insists on being recursive.
>
> The current approach discussed is somewhat different from explicit
> more stack requests API. I am investigating how feasible it is to use
> kernel stack multiplexing, so the same pages can be re-used by many
> threads when they are actually used. If the multiplexing approach
> won't work, I will come back to the explicit more stack API.
>
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)