arch/Kconfig | 5 +++++ arch/x86/Kconfig | 1 + arch/x86/kernel/process.c | 9 ++++++++- include/linux/sched/task.h | 7 +++++++ kernel/exit.c | 2 ++ 5 files changed, 23 insertions(+), 1 deletion(-)
When using shadow stacks the kernel will transparently allocate a shadow
stack for each thread. The intention is that this will be freed when the
thread exits but currently this doesn't actually happen. The deallocation
is done by shstk_free() which is called from exit_thread() and has a guard
to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually
do anything since in do_exit() we call exit_mm() prior to thread_exit() and
exit_mm() disassociates the task from the mm and clears tsk->mm. The result
is that no shadow stacks will be freed until the process exits, leaking
memory for any process which creates and destroys threads.
Fix this by moving the shstk_free() to a new exit_thread_early() call which
runs immediately prior to exit_mm(). We don't do this right at the start of
do_exit() due to the handling for klling init. This could trigger some
incompatibility if there is code that looks at the shadow stack of a thread
which has exited but this seems much less likely than the leaking of shadow
stacks due to thread deletion.
Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack")
Signed-off-by: Mark Brown <broonie@kernel.org>
---
It is entirely possible I am missing something here, I don't have a
system that allows me to test shadow stack support directly and have
only checked this by inspection and tested with my arm64 GCS series.
If this makes sense it'll need to become a dependency for GCS.
---
arch/Kconfig | 5 +++++
arch/x86/Kconfig | 1 +
arch/x86/kernel/process.c | 9 ++++++++-
include/linux/sched/task.h | 7 +++++++
kernel/exit.c | 2 ++
5 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..2fa3aedc1d23 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1037,6 +1037,11 @@ config HAVE_EXIT_THREAD
help
An architecture implements exit_thread.
+config HAVE_EXIT_THREAD_EARLY
+ bool
+ help
+ An architecture implements exit_thread_early.
+
config ARCH_MMAP_RND_BITS_MIN
int
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 007bab9f2a0e..916616026111 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -225,6 +225,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EISA
select HAVE_EXIT_THREAD
+ select HAVE_EXIT_THREAD_EARLY
select HAVE_GUP_FAST
select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91..af3cabe08185 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -110,6 +110,14 @@ void arch_release_task_struct(struct task_struct *tsk)
}
#endif
+/*
+ * Free thread data structures etc..
+ */
+void exit_thread_early(struct task_struct *tsk)
+{
+ shstk_free(tsk);
+}
+
/*
* Free thread data structures etc..
*/
@@ -123,7 +131,6 @@ void exit_thread(struct task_struct *tsk)
free_vm86(t);
- shstk_free(tsk);
fpu__drop(fpu);
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d362aacf9f89..cbafc6ba4e3e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -88,6 +88,13 @@ static inline void exit_thread(struct task_struct *tsk)
{
}
#endif
+#ifdef CONFIG_HAVE_EXIT_THREAD_EARLY
+extern void exit_thread_early(struct task_struct *tsk);
+#else
+static inline void exit_thread_early(struct task_struct *tsk)
+{
+}
+#endif
extern __noreturn void do_group_exit(int);
extern void exit_files(struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..bc2c2a3da9fb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -866,6 +866,8 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ exit_thread_early(tsk);
+
exit_mm();
if (group_dead)
---
base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
change-id: 20240910-x86-fix-shstk-leak-add636c1d0e4
Best regards,
--
Mark Brown <broonie@kernel.org>
On Tue, 2024-09-10 at 23:56 +0100, Mark Brown wrote: > When using shadow stacks the kernel will transparently allocate a shadow > stack for each thread. The intention is that this will be freed when the > thread exits but currently this doesn't actually happen. The deallocation > is done by shstk_free() which is called from exit_thread() and has a guard > to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually > do anything since in do_exit() we call exit_mm() prior to thread_exit() and > exit_mm() disassociates the task from the mm and clears tsk->mm. The result > is that no shadow stacks will be freed until the process exits, leaking > memory for any process which creates and destroys threads. > > Fix this by moving the shstk_free() to a new exit_thread_early() call which > runs immediately prior to exit_mm(). We don't do this right at the start of > do_exit() due to the handling for klling init. This could trigger some > incompatibility if there is code that looks at the shadow stack of a thread > which has exited but this seems much less likely than the leaking of shadow > stacks due to thread deletion. > > Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack") > Signed-off-by: Mark Brown <broonie@kernel.org> > --- > It is entirely possible I am missing something here, I don't have a > system that allows me to test shadow stack support directly and have > only checked this by inspection and tested with my arm64 GCS series. > If this makes sense it'll need to become a dependency for GCS. The common cleanup case is via deactivate_mm()->shstk_free(), which happens when the MM is still attached. But there is also an exit_thread() caller in the fork failure patch (see copy_process()). So by my inspection, the exit_thread_early() is not needed because of the deactivate_mm() path that happens earlier in do_exit() via exit_mm(). But since this patch also removes the shstk_free() from the copy_process() error path, I think we would need clarity that it is unneeded there. A quick search through the arm series and I don't see deactivate_mm() implementation, and instead a separate cleanup solution. Could that be the reason why you saw the leak on arm? Considering the trickiness of the auto allocated shadow stacks lifecycle, I think it would be great if all the implementations had common logic. If possible at least. BTW, two more notes on this whole area: 1. 99% sure glibc has some tests that catch leaks like hypothesized here, by watching for memory grown after repeated thread exits. IIRC I introduced a shadow stack leak at some point during development that failed the test. 2. Weijiang (CCed) is working on a fix for case in the opposite direction. An error path that attempts to free the shadow stack twice and triggers a warning.
On Wed, Sep 11, 2024 at 06:01:00PM +0000, Edgecombe, Rick P wrote: > On Tue, 2024-09-10 at 23:56 +0100, Mark Brown wrote: > > When using shadow stacks the kernel will transparently allocate a shadow > > stack for each thread. The intention is that this will be freed when the > > thread exits but currently this doesn't actually happen. The deallocation > > is done by shstk_free() which is called from exit_thread() and has a guard > > to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually > > do anything since in do_exit() we call exit_mm() prior to thread_exit() and > > exit_mm() disassociates the task from the mm and clears tsk->mm. The result > > is that no shadow stacks will be freed until the process exits, leaking > > memory for any process which creates and destroys threads. ... > > It is entirely possible I am missing something here, I don't have a > > system that allows me to test shadow stack support directly and have > > only checked this by inspection and tested with my arm64 GCS series. > > If this makes sense it'll need to become a dependency for GCS. > The common cleanup case is via deactivate_mm()->shstk_free(), which happens when > the MM is still attached. But there is also an exit_thread() caller in the fork > failure patch (see copy_process()). Ah, yes - glad I was missing something! I saw exit_thread() doing the right thing in the error paths but not deactivate_mm(). > A quick search through the arm series and I don't see deactivate_mm() > implementation, and instead a separate cleanup solution. Could that be the > reason why you saw the leak on arm? Considering the trickiness of the auto > allocated shadow stacks lifecycle, I think it would be great if all the > implementations had common logic. If possible at least. Yes, it's because we don't have a deactivate_mm() implementation and I didn't add one (or lost it at some point), effectively this patch is just adding deactivate_mm() by another name. The hook being a macro definition in the header caught me out I think, I was probably just using regular grep not git grep when I went looking. I was actually considering what some factoring out might look like in the context of the clone3() work, as incremental work on top of landing the ABI so we try to avoid introducing yet more rounds of discussion to delay that. map_shadow_stack() as well. > BTW, two more notes on this whole area: > 1. 99% sure glibc has some tests that catch leaks like hypothesized here, by > watching for memory grown after repeated thread exits. IIRC I introduced a > shadow stack leak at some point during development that failed the test. Guess how this was noticed! It's tst-create-detached.c, with IIRC tweaks to the environment (I think it was turning overcommit off was the main thing) to make the test actually detect something itself. > 2. Weijiang (CCed) is working on a fix for case in the opposite direction. An > error path that attempts to free the shadow stack twice and triggers a warning. Ack.
© 2016 - 2024 Red Hat, Inc.