... > - exit_to_user_mode(): Unmap the extra three pages and return them to > the per-CPU cache. This function is called late in the kernel exit > path. Why bother? The number of tasks running in user_mode is limited to the number of cpu. So the most you save is a few pages per cpu. Plausibly a context switch from an interrupt (eg timer tick) could suspend a task without saving anything on its kernel stack. But how common is that in reality? In a well behaved system most user threads will be sleeping on some event - so with an active kernel stack. I can also imagine that something like sys_epoll() actually sleeps with not (that much) stack allocated. But the calls into all the drivers to check the status could easily go into another page. You really wouldn't to keep allocating and deallocating physical pages (which I'm sure has TLB flushing costs) all the time for those processes. Perhaps a 'garbage collection' activity that reclaims stack pages from processes that have been asleep 'for a while' or haven't used a lot of stack recently (if hw 'page accessed' bit can be used) might make more sense. Have you done any instrumentation to see which system calls are actually using more than (say) 8k of stack? And how often the user threads that make those calls do so? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Mar 18, 2024 at 11:39 AM David Laight <David.Laight@aculab.com> wrote: > > ... > > - exit_to_user_mode(): Unmap the extra three pages and return them to > > the per-CPU cache. This function is called late in the kernel exit > > path. > > Why bother? > The number of tasks running in user_mode is limited to the number > of cpu. So the most you save is a few pages per cpu. > > Plausibly a context switch from an interrupt (eg timer tick) > could suspend a task without saving anything on its kernel stack. > But how common is that in reality? > In a well behaved system most user threads will be sleeping on > some event - so with an active kernel stack. > > I can also imagine that something like sys_epoll() actually > sleeps with not (that much) stack allocated. > But the calls into all the drivers to check the status > could easily go into another page. > You really wouldn't to keep allocating and deallocating > physical pages (which I'm sure has TLB flushing costs) > all the time for those processes. > > Perhaps a 'garbage collection' activity that reclaims stack > pages from processes that have been asleep 'for a while' or > haven't used a lot of stack recently (if hw 'page accessed' > bit can be used) might make more sense. > > Have you done any instrumentation to see which system calls > are actually using more than (say) 8k of stack? > And how often the user threads that make those calls do so? None of our syscalls, AFAIK. Pasha > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
> > Perhaps a 'garbage collection' activity that reclaims stack > > pages from processes that have been asleep 'for a while' or > > haven't used a lot of stack recently (if hw 'page accessed' > > bit can be used) might make more sense. Interesting approach: if we take the original Andy's suggestion of using an access bit to know which stack pages were never used during context switch and unmap them, and as an extra optimization have a "garbage collector" that unmaps stacks in some long sleeping rarely used threads. I will think about this. Thanks, Pasha
© 2016 - 2026 Red Hat, Inc.