x86/shstk: Free the thread shadow stack before disassociating from the mm

[PATCH] x86/shstk: Free the thread shadow stack before disassociating from the mm

Posted by Mark Brown 2 months, 2 weeks ago

When using shadow stacks the kernel will transparently allocate a shadow
stack for each thread. The intention is that this will be freed when the
thread exits but currently this doesn't actually happen. The deallocation
is done by shstk_free() which is called from exit_thread() and has a guard
to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually
do anything since in do_exit() we call exit_mm() prior to thread_exit() and
exit_mm() disassociates the task from the mm and clears tsk->mm. The result
is that no shadow stacks will be freed until the process exits, leaking
memory for any process which creates and destroys threads.

Fix this by moving the shstk_free() to a new exit_thread_early() call which
runs immediately prior to exit_mm(). We don't do this right at the start of
do_exit() due to the handling for klling init. This could trigger some
incompatibility if there is code that looks at the shadow stack of a thread
which has exited but this seems much less likely than the leaking of shadow
stacks due to thread deletion.

Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack")
Signed-off-by: Mark Brown <broonie@kernel.org>
---
It is entirely possible I am missing something here, I don't have a
system that allows me to test shadow stack support directly and have
only checked this by inspection and tested with my arm64 GCS series.
If this makes sense it'll need to become a dependency for GCS.
---
 arch/Kconfig               | 5 +++++
 arch/x86/Kconfig           | 1 +
 arch/x86/kernel/process.c  | 9 ++++++++-
 include/linux/sched/task.h | 7 +++++++
 kernel/exit.c              | 2 ++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..2fa3aedc1d23 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1037,6 +1037,11 @@ config HAVE_EXIT_THREAD
 	help
 	  An architecture implements exit_thread.
 
+config HAVE_EXIT_THREAD_EARLY
+	bool
+	help
+	  An architecture implements exit_thread_early.
+
 config ARCH_MMAP_RND_BITS_MIN
 	int
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 007bab9f2a0e..916616026111 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -225,6 +225,7 @@ config X86
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EISA
 	select HAVE_EXIT_THREAD
+	select HAVE_EXIT_THREAD_EARLY
 	select HAVE_GUP_FAST
 	select HAVE_FENTRY			if X86_64 || DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91..af3cabe08185 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -110,6 +110,14 @@ void arch_release_task_struct(struct task_struct *tsk)
 }
 #endif
 
+/*
+ * Free thread data structures etc..
+ */
+void exit_thread_early(struct task_struct *tsk)
+{
+	shstk_free(tsk);
+}
+
 /*
  * Free thread data structures etc..
  */
@@ -123,7 +131,6 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
-	shstk_free(tsk);
 	fpu__drop(fpu);
 }
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d362aacf9f89..cbafc6ba4e3e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -88,6 +88,13 @@ static inline void exit_thread(struct task_struct *tsk)
 {
 }
 #endif
+#ifdef CONFIG_HAVE_EXIT_THREAD_EARLY
+extern void exit_thread_early(struct task_struct *tsk);
+#else
+static inline void exit_thread_early(struct task_struct *tsk)
+{
+}
+#endif
 extern __noreturn void do_group_exit(int);
 
 extern void exit_files(struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..bc2c2a3da9fb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -866,6 +866,8 @@ void __noreturn do_exit(long code)
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
 
+	exit_thread_early(tsk);
+
 	exit_mm();
 
 	if (group_dead)

---
base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
change-id: 20240910-x86-fix-shstk-leak-add636c1d0e4

Best regards,
-- 
Mark Brown <broonie@kernel.org>

Re: [PATCH] x86/shstk: Free the thread shadow stack before disassociating from the mm

Posted by Edgecombe, Rick P 2 months, 2 weeks ago

On Tue, 2024-09-10 at 23:56 +0100, Mark Brown wrote:
> When using shadow stacks the kernel will transparently allocate a shadow
> stack for each thread. The intention is that this will be freed when the
> thread exits but currently this doesn't actually happen. The deallocation
> is done by shstk_free() which is called from exit_thread() and has a guard
> to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually
> do anything since in do_exit() we call exit_mm() prior to thread_exit() and
> exit_mm() disassociates the task from the mm and clears tsk->mm. The result
> is that no shadow stacks will be freed until the process exits, leaking
> memory for any process which creates and destroys threads.
> 
> Fix this by moving the shstk_free() to a new exit_thread_early() call which
> runs immediately prior to exit_mm(). We don't do this right at the start of
> do_exit() due to the handling for klling init. This could trigger some
> incompatibility if there is code that looks at the shadow stack of a thread
> which has exited but this seems much less likely than the leaking of shadow
> stacks due to thread deletion.
> 
> Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack")
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
> It is entirely possible I am missing something here, I don't have a
> system that allows me to test shadow stack support directly and have
> only checked this by inspection and tested with my arm64 GCS series.
> If this makes sense it'll need to become a dependency for GCS.

The common cleanup case is via deactivate_mm()->shstk_free(), which happens when
the MM is still attached. But there is also an exit_thread() caller in the fork
failure patch (see copy_process()).

So by my inspection, the exit_thread_early() is not needed because of the
deactivate_mm() path that happens earlier in do_exit() via exit_mm(). But since
this patch also removes the shstk_free() from the copy_process() error path, I
think we would need clarity that it is unneeded there.

A quick search through the arm series and I don't see deactivate_mm()
implementation, and instead a separate cleanup solution. Could that be the
reason why you saw the leak on arm? Considering the trickiness of the auto
allocated shadow stacks lifecycle, I think it would be great if all the
implementations had common logic. If possible at least.

BTW, two more notes on this whole area:
1. 99% sure glibc has some tests that catch leaks like hypothesized here, by
watching for memory grown after repeated thread exits. IIRC I introduced a
shadow stack leak at some point during development that failed the test.
2. Weijiang (CCed) is working on a fix for case in the opposite direction. An
error path that attempts to free the shadow stack twice and triggers a warning.

Re: [PATCH] x86/shstk: Free the thread shadow stack before disassociating from the mm

Posted by Mark Brown 2 months, 2 weeks ago

On Wed, Sep 11, 2024 at 06:01:00PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2024-09-10 at 23:56 +0100, Mark Brown wrote:

> > When using shadow stacks the kernel will transparently allocate a shadow
> > stack for each thread. The intention is that this will be freed when the
> > thread exits but currently this doesn't actually happen. The deallocation
> > is done by shstk_free() which is called from exit_thread() and has a guard
> > to check for !tsk->mm due to the use of vm_unmap(). This doesn't actually
> > do anything since in do_exit() we call exit_mm() prior to thread_exit() and
> > exit_mm() disassociates the task from the mm and clears tsk->mm. The result
> > is that no shadow stacks will be freed until the process exits, leaking
> > memory for any process which creates and destroys threads.

...

> > It is entirely possible I am missing something here, I don't have a
> > system that allows me to test shadow stack support directly and have
> > only checked this by inspection and tested with my arm64 GCS series.
> > If this makes sense it'll need to become a dependency for GCS.

> The common cleanup case is via deactivate_mm()->shstk_free(), which happens when
> the MM is still attached. But there is also an exit_thread() caller in the fork
> failure patch (see copy_process()).

Ah, yes - glad I was missing something!  I saw exit_thread() doing the
right thing in the error paths but not deactivate_mm().

> A quick search through the arm series and I don't see deactivate_mm()
> implementation, and instead a separate cleanup solution. Could that be the
> reason why you saw the leak on arm? Considering the trickiness of the auto
> allocated shadow stacks lifecycle, I think it would be great if all the
> implementations had common logic. If possible at least.

Yes, it's because we don't have a deactivate_mm() implementation and I
didn't add one (or lost it at some point), effectively this patch is
just adding deactivate_mm() by another name.  The hook being a macro
definition in the header caught me out I think, I was probably just
using regular grep not git grep when I went looking.

I was actually considering what some factoring out might look like in
the context of the clone3() work, as incremental work on top of landing
the ABI so we try to avoid introducing yet more rounds of discussion to
delay that.  map_shadow_stack() as well.

> BTW, two more notes on this whole area:
> 1. 99% sure glibc has some tests that catch leaks like hypothesized here, by
> watching for memory grown after repeated thread exits. IIRC I introduced a
> shadow stack leak at some point during development that failed the test.

Guess how this was noticed!  It's tst-create-detached.c, with IIRC
tweaks to the environment (I think it was turning overcommit off was the
main thing) to make the test actually detect something itself.

> 2. Weijiang (CCed) is working on a fix for case in the opposite direction. An
> error path that attempts to free the shadow stack twice and triggers a warning.

Ack.