From: Pasha Tatashin > Sent: 16 March 2024 19:18 ... > Expanding on Mathew's idea of an interface for dynamic kernel stack > sizes, here's what I'm thinking: > > - Kernel Threads: Create all kernel threads with a fully populated > THREAD_SIZE stack. (i.e. 16K) > - User Threads: Create all user threads with THREAD_SIZE kernel stack > but only the top page mapped. (i.e. 4K) > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping > three additional pages from the per-CPU stack cache. This function is > called early in kernel entry points. > - exit_to_user_mode(): Unmap the extra three pages and return them to > the per-CPU cache. This function is called late in the kernel exit > path. Isn't that entirely horrid for TLB use and so will require a lot of IPI? Remember, if a thread sleeps in 'extra stack' and is then resheduled on a different cpu the extra pages get 'pumped' from one cpu to another. I also suspect a stack_probe() is likely to end up being a cache miss and also slow??? So you wouldn't want one on all calls. I'm not sure you'd want a conditional branch either. The explicit request for 'more stack' can be required to be allowed to sleep - removing a lot of issues. It would also be portable to all architectures. I'd also suspect that any thread that needs extra stack is likely to need to again. So while the memory could be recovered, I'd bet is isn't worth doing except under memory pressure. The call could also return 'no' - perhaps useful for (broken) code that insists on being recursive. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote: > > From: Pasha Tatashin > > Sent: 16 March 2024 19:18 > ... > > Expanding on Mathew's idea of an interface for dynamic kernel stack > > sizes, here's what I'm thinking: > > > > - Kernel Threads: Create all kernel threads with a fully populated > > THREAD_SIZE stack. (i.e. 16K) > > - User Threads: Create all user threads with THREAD_SIZE kernel stack > > but only the top page mapped. (i.e. 4K) > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping > > three additional pages from the per-CPU stack cache. This function is > > called early in kernel entry points. > > - exit_to_user_mode(): Unmap the extra three pages and return them to > > the per-CPU cache. This function is called late in the kernel exit > > path. > > Isn't that entirely horrid for TLB use and so will require a lot of IPI? The TLB load is going to be exactly the same as today, we already use small pages for VMA mapped stacks. We won't need to have extra flushing either, the mappings are in the kernel space, and once pages are removed from the page table, no one is going to access that VA space until that thread enters the kernel again. We will need to invalidate the VA range only when the pages are mapped, and only on the local cpu. > Remember, if a thread sleeps in 'extra stack' and is then resheduled > on a different cpu the extra pages get 'pumped' from one cpu to > another. Yes, the per-cpu cache can get unbalanced this way, we can remember the original CPU where we acquired the pages to return to the same place. > I also suspect a stack_probe() is likely to end up being a cache miss > and also slow??? Can you please elaborate on this point. I am not aware of stack_probe() and how it is used. > So you wouldn't want one on all calls. > I'm not sure you'd want a conditional branch either. > > The explicit request for 'more stack' can be required to be allowed > to sleep - removing a lot of issues. > It would also be portable to all architectures. > I'd also suspect that any thread that needs extra stack is likely > to need to again. > So while the memory could be recovered, I'd bet is isn't worth > doing except under memory pressure. > The call could also return 'no' - perhaps useful for (broken) code > that insists on being recursive. The current approach discussed is somewhat different from explicit more stack requests API. I am investigating how feasible it is to use kernel stack multiplexing, so the same pages can be re-used by many threads when they are actually used. If the multiplexing approach won't work, I will come back to the explicit more stack API. > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote: > The TLB load is going to be exactly the same as today, we already use > small pages for VMA mapped stacks. We won't need to have extra > flushing either, the mappings are in the kernel space, and once pages > are removed from the page table, no one is going to access that VA > space until that thread enters the kernel again. We will need to > invalidate the VA range only when the pages are mapped, and only on > the local cpu. No; we can pass pointers to our kernel stack to other threads. The obvious one is a mutex; we put a mutex_waiter on our own stack and add its list_head to the mutex's waiter list. I'm sure you can think of many other places we do this (eg wait queues, poll(), select(), etc).
On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote: > > The TLB load is going to be exactly the same as today, we already use > > small pages for VMA mapped stacks. We won't need to have extra > > flushing either, the mappings are in the kernel space, and once pages > > are removed from the page table, no one is going to access that VA > > space until that thread enters the kernel again. We will need to > > invalidate the VA range only when the pages are mapped, and only on > > the local cpu. > > No; we can pass pointers to our kernel stack to other threads. The > obvious one is a mutex; we put a mutex_waiter on our own stack and > add its list_head to the mutex's waiter list. I'm sure you can > think of many other places we do this (eg wait queues, poll(), select(), > etc). Hm, it means that stack is sleeping in the kernel space, and has its stack pages mapped and invalidated on the local CPU, but access from the remote CPU to that stack pages would be problematic. I think we still won't need IPI, but VA-range invalidation is actually needed on unmaps, and should happen during context switch so every time we go off-cpu. Therefore, what Brian/Andy have suggested makes more sense instead of kernel/enter/exit paths. Pasha
From: Pasha Tatashin > Sent: 18 March 2024 15:31 > > On Mon, Mar 18, 2024 at 11:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Mar 18, 2024 at 11:09:47AM -0400, Pasha Tatashin wrote: > > > The TLB load is going to be exactly the same as today, we already use > > > small pages for VMA mapped stacks. We won't need to have extra > > > flushing either, the mappings are in the kernel space, and once pages > > > are removed from the page table, no one is going to access that VA > > > space until that thread enters the kernel again. We will need to > > > invalidate the VA range only when the pages are mapped, and only on > > > the local cpu. > > > > No; we can pass pointers to our kernel stack to other threads. The > > obvious one is a mutex; we put a mutex_waiter on our own stack and > > add its list_head to the mutex's waiter list. I'm sure you can > > think of many other places we do this (eg wait queues, poll(), select(), > > etc). > > Hm, it means that stack is sleeping in the kernel space, and has its > stack pages mapped and invalidated on the local CPU, but access from > the remote CPU to that stack pages would be problematic. > > I think we still won't need IPI, but VA-range invalidation is actually > needed on unmaps, and should happen during context switch so every > time we go off-cpu. Therefore, what Brian/Andy have suggested makes > more sense instead of kernel/enter/exit paths. I think you'll need to broadcast an invalidate. Consider: CPU A: task allocates extra pages and adds something to some list. CPU B: accesses that data and maybe modifies it. Some page-table walk setup ut the TLB. CPU A: task detects the modify, removes the item from the list, collapses back the stack and sleeps. Stack pages freed. CPU A: task wakes up (on the same cpu for simplicity). Goes down a deep stack and puts an item on a list. Different physical pages are allocated. CPU B: accesses the associated KVA. It better not have a cached TLB. Doesn't that need an IPI? Freeing the pages is much harder than allocating them. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
> I think you'll need to broadcast an invalidate. > Consider: > CPU A: task allocates extra pages and adds something to some list. > CPU B: accesses that data and maybe modifies it. > Some page-table walk setup ut the TLB. > CPU A: task detects the modify, removes the item from the list, > collapses back the stack and sleeps. > Stack pages freed. > CPU A: task wakes up (on the same cpu for simplicity). > Goes down a deep stack and puts an item on a list. > Different physical pages are allocated. > CPU B: accesses the associated KVA. > It better not have a cached TLB. > > Doesn't that need an IPI? Yes, this is annoying. If we share a stack with another CPU, then get a new stack, and share it again with another CPU we get in trouble. Yet, IPI during context switch would kill the performance :-\ I wonder if there is a way to optimize this scenario like doing IPI invalidation only after stack sharing? Pasha
On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Pasha Tatashin > > > Sent: 16 March 2024 19:18 > > ... > > > Expanding on Mathew's idea of an interface for dynamic kernel stack > > > sizes, here's what I'm thinking: > > > > > > - Kernel Threads: Create all kernel threads with a fully populated > > > THREAD_SIZE stack. (i.e. 16K) > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack > > > but only the top page mapped. (i.e. 4K) > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping > > > three additional pages from the per-CPU stack cache. This function is > > > called early in kernel entry points. > > > - exit_to_user_mode(): Unmap the extra three pages and return them to > > > the per-CPU cache. This function is called late in the kernel exit > > > path. > > > > Isn't that entirely horrid for TLB use and so will require a lot of IPI? > > The TLB load is going to be exactly the same as today, we already use > small pages for VMA mapped stacks. We won't need to have extra > flushing either, the mappings are in the kernel space, and once pages > are removed from the page table, no one is going to access that VA > space until that thread enters the kernel again. We will need to > invalidate the VA range only when the pages are mapped, and only on > the local cpu. The TLB miss rate is going to slightly increase, but very slightly, because stacks are small 4-pages with only 3-dynamic pages, and therefore only up-to 2-3 new misses per syscalls, and that is only for the complicated deep syscalls, therefore, I suspect it won't affect the real world performance. > > Remember, if a thread sleeps in 'extra stack' and is then resheduled > > on a different cpu the extra pages get 'pumped' from one cpu to > > another. > > Yes, the per-cpu cache can get unbalanced this way, we can remember > the original CPU where we acquired the pages to return to the same > place. > > > I also suspect a stack_probe() is likely to end up being a cache miss > > and also slow??? > > Can you please elaborate on this point. I am not aware of > stack_probe() and how it is used. > > > So you wouldn't want one on all calls. > > I'm not sure you'd want a conditional branch either. > > > > The explicit request for 'more stack' can be required to be allowed > > to sleep - removing a lot of issues. > > It would also be portable to all architectures. > > I'd also suspect that any thread that needs extra stack is likely > > to need to again. > > So while the memory could be recovered, I'd bet is isn't worth > > doing except under memory pressure. > > The call could also return 'no' - perhaps useful for (broken) code > > that insists on being recursive. > > The current approach discussed is somewhat different from explicit > more stack requests API. I am investigating how feasible it is to use > kernel stack multiplexing, so the same pages can be re-used by many > threads when they are actually used. If the multiplexing approach > won't work, I will come back to the explicit more stack API. > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales)
© 2016 - 2026 Red Hat, Inc.