[RFC][PATCH] exec: Move cred computation under exec_update_lock

Eric W. Biederman posted 1 patch 1 week, 4 days ago
fs/exec.c                    | 88 ++++++++++++++----------------------
fs/fs_struct.c               |  1 -
fs/proc/base.c               |  4 +-
include/linux/fs_struct.h    |  1 -
include/linux/sched/signal.h |  6 ---
init/init_task.c             |  1 -
kernel/cred.c                |  2 +-
kernel/fork.c                |  8 +---
kernel/ptrace.c              |  4 +-
kernel/seccomp.c             | 12 ++---
10 files changed, 45 insertions(+), 82 deletions(-)
[RFC][PATCH] exec: Move cred computation under exec_update_lock
Posted by Eric W. Biederman 1 week, 4 days ago

Instead of computing the new cred before we pass the point of no
return compute the new cred just before we use it.

This allows the removal of fs_struct->in_exec and cred_guard_mutex.

I am not certain why we wanted to compute the cred for the new
executable so early.  Perhaps I missed something but I did not see any
common errors being signaled.   So I don't think we loose anything by
computing the new cred later.

We gain a lot.

We stop holding the cred_guard_mutex over places where the code sleeps
and waits for userspace.  These places include the waiting for the
tracer in PTRACE_EVENT_EXIT, "put_user(0, tsk->clear_child_tid)" in
mm_release, and "get_user(futex_offset, ...") in exit_robust_mutex.

We can remove fs_struct->in_exec.  The case where it was used simply
never comes up, when we compute the cred after de_thread completes.

We remove the possibility of a hang between a tracer calling
PTRACE_ATTACH/PTRACE_SIEZE and the kernel waiting for the tracer
in PTRACE_EVENT_EXIT.

---
Oleg, Kees, Bernd, Can you see anything I am missing?

The code compiles but I haven't test it yet.

I thought I was going to move commit_creds before de_thread, but that
would have taken commit_cred out of exec_update_lock (which introduces
races).

However I can't see any drawbacks of going the other direction.


 fs/exec.c                    | 88 ++++++++++++++----------------------
 fs/fs_struct.c               |  1 -
 fs/proc/base.c               |  4 +-
 include/linux/fs_struct.h    |  1 -
 include/linux/sched/signal.h |  6 ---
 init/init_task.c             |  1 -
 kernel/cred.c                |  2 +-
 kernel/fork.c                |  8 +---
 kernel/ptrace.c              |  4 +-
 kernel/seccomp.c             | 12 ++---
 10 files changed, 45 insertions(+), 82 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e08d5d..5ae96584dab0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1090,6 +1090,9 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 	perf_event_comm(tsk, exec);
 }
 
+static int prepare_bprm_creds(struct linux_binprm *bprm);
+static void check_unsafe_exec(struct linux_binprm *bprm);
+
 /*
  * Calling this is the point of no return. None of the failures will be
  * seen by userspace since either the process is already taking a fatal
@@ -1101,10 +1104,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	struct task_struct *me = current;
 	int retval;
 
-	/* Once we are committed compute the creds */
-	retval = bprm_creds_from_file(bprm);
-	if (retval)
-		return retval;
 
 	/*
 	 * This tracepoint marks the point before flushing the old exec where
@@ -1123,8 +1122,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	retval = de_thread(me);
 	if (retval)
 		goto out;
-	/* see the comment in check_unsafe_exec() */
-	current->fs->in_exec = 0;
 	/*
 	 * Cancel any io_uring activity across execve
 	 */
@@ -1251,6 +1248,25 @@ int begin_new_exec(struct linux_binprm * bprm)
 	WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
 	flush_signal_handlers(me, 0);
 
+	retval = prepare_bprm_creds(bprm);
+	if (retval)
+		goto out_unlock;
+
+	/*
+	 * Check for unsafe execution states before exec_binprm(), which
+	 * will call back into begin_new_exec(), into bprm_creds_from_file(),
+	 * where setuid-ness is evaluated.
+	 */
+	check_unsafe_exec(bprm);
+
+	/* Set the unchanging part of bprm->cred */
+	retval = security_bprm_creds_for_exec(bprm);
+
+	/* Once we are committed compute the creds */
+	retval = bprm_creds_from_file(bprm);
+	if (retval)
+		goto out_unlock;
+
 	retval = set_cred_ucounts(bprm->cred);
 	if (retval < 0)
 		goto out_unlock;
@@ -1272,9 +1288,9 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (get_dumpable(me->mm) != SUID_DUMP_USER)
 		perf_event_exit_task(me);
 	/*
-	 * cred_guard_mutex must be held at least to this point to prevent
+	 * exec_update_lock must be held at least to this point to prevent
 	 * ptrace_attach() from altering our determination of the task's
-	 * credentials; any time after this it may be unlocked.
+	 * credentials.
 	 */
 	security_bprm_committed_creds(bprm);
 
@@ -1291,8 +1307,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 
 out_unlock:
 	up_write(&me->signal->exec_update_lock);
-	if (!bprm->cred)
-		mutex_unlock(&me->signal->cred_guard_mutex);
 
 out:
 	return retval;
@@ -1336,7 +1350,6 @@ void setup_new_exec(struct linux_binprm * bprm)
 	 */
 	me->mm->task_size = TASK_SIZE;
 	up_write(&me->signal->exec_update_lock);
-	mutex_unlock(&me->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
@@ -1351,21 +1364,15 @@ void finalize_exec(struct linux_binprm *bprm)
 EXPORT_SYMBOL(finalize_exec);
 
 /*
- * Prepare credentials and lock ->cred_guard_mutex.
- * setup_new_exec() commits the new creds and drops the lock.
- * Or, if exec fails before, free_bprm() should release ->cred
- * and unlock.
+ * Prepare credentials.  begin_new_exec() commits the new creds.
+ * Or, if exec fails before, free_bprm() should release ->cred.
  */
 static int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
-		return -ERESTARTNOINTR;
-
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
 
-	mutex_unlock(&current->signal->cred_guard_mutex);
 	return -ENOMEM;
 }
 
@@ -1386,9 +1393,7 @@ static void free_bprm(struct linux_binprm *bprm)
 	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		/* in case exec fails before de_thread() succeeds */
-		current->fs->in_exec = 0;
-		mutex_unlock(&current->signal->cred_guard_mutex);
+		/* in case exec fails before commit_creds succeeds */
 		abort_creds(bprm->cred);
 	}
 	do_close_execat(bprm->file);
@@ -1486,13 +1491,12 @@ EXPORT_SYMBOL(bprm_change_interp);
 
 /*
  * determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must hold ->exec_update_lock to protect against
  *   PTRACE_ATTACH or seccomp thread-sync
  */
 static void check_unsafe_exec(struct linux_binprm *bprm)
 {
-	struct task_struct *p = current, *t;
-	unsigned n_fs;
+	struct task_struct *p = current;
 
 	if (p->ptrace)
 		bprm->unsafe |= LSM_UNSAFE_PTRACE;
@@ -1509,25 +1513,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 	 * suid exec because the differently privileged task
 	 * will be able to manipulate the current directory, etc.
 	 * It would be nice to force an unshare instead...
-	 *
-	 * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
-	 * from another sub-thread until de_thread() succeeds, this
-	 * state is protected by cred_guard_mutex we hold.
 	 */
-	n_fs = 1;
-	read_seqlock_excl(&p->fs->seq);
-	rcu_read_lock();
-	for_other_threads(p, t) {
-		if (t->fs == p->fs)
-			n_fs++;
-	}
-	rcu_read_unlock();
-
-	/* "users" and "in_exec" locked for copy_fs() */
-	if (p->fs->users > n_fs)
+	if (p->fs->users > 1)
 		bprm->unsafe |= LSM_UNSAFE_SHARE;
-	else
-		p->fs->in_exec = 1;
 	read_sequnlock_excl(&p->fs->seq);
 }
 
@@ -1731,25 +1719,15 @@ static int bprm_execve(struct linux_binprm *bprm)
 {
 	int retval;
 
-	retval = prepare_bprm_creds(bprm);
-	if (retval)
-		return retval;
+	if (bprm->is_check)
+		return 0;
 
-	/*
-	 * Check for unsafe execution states before exec_binprm(), which
-	 * will call back into begin_new_exec(), into bprm_creds_from_file(),
-	 * where setuid-ness is evaluated.
-	 */
-	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 	sched_mm_cid_before_execve(current);
 
 	sched_exec();
 
-	/* Set the unchanging part of bprm->cred */
-	retval = security_bprm_creds_for_exec(bprm);
-	if (retval || bprm->is_check)
-		goto out;
+
 
 	retval = exec_binprm(bprm);
 	if (retval < 0)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 28be762ac1c6..945bc0916f65 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -109,7 +109,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
 	/* We don't need to lock fs - think why ;-) */
 	if (fs) {
 		fs->users = 1;
-		fs->in_exec = 0;
 		seqlock_init(&fs->seq);
 		fs->umask = old->umask;
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878e3d97..7041fb4d1689 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2834,14 +2834,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 	}
 
 	/* Guard against adverse ptrace interaction */
-	rv = mutex_lock_interruptible(&current->signal->cred_guard_mutex);
+	rv = down_write_killable(&current->signal->exec_update_lock);
 	if (rv < 0)
 		goto out_free;
 
 	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
 				  file->f_path.dentry->d_name.name, page,
 				  count);
-	mutex_unlock(&current->signal->cred_guard_mutex);
+	up_write(&current->signal->exec_update_lock);
 out_free:
 	kfree(page);
 out:
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index baf200ab5c77..29d0f7d57743 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -10,7 +10,6 @@ struct fs_struct {
 	int users;
 	seqlock_t seq;
 	int umask;
-	int in_exec;
 	struct path root, pwd;
 } __randomize_layout;
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7d6449982822..7e9259c8fb2b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -241,12 +241,6 @@ struct signal_struct {
 	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
 					 * killed by the oom killer */
 
-	struct mutex cred_guard_mutex;	/* guard against foreign influences on
-					 * credential calculations
-					 * (notably. ptrace)
-					 * Deprecated do not use in new code.
-					 * Use exec_update_lock instead.
-					 */
 	struct rw_semaphore exec_update_lock;	/* Held while task_struct is
 						 * being updated during exec,
 						 * and may have inconsistent
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..4813bffe217e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -30,7 +30,6 @@ static struct signal_struct init_signals = {
 #ifdef CONFIG_CGROUPS
 	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
 #endif
-	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
 	.posix_timers		= HLIST_HEAD_INIT,
diff --git a/kernel/cred.c b/kernel/cred.c
index dbf6b687dc5c..80e376ce005f 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -252,7 +252,7 @@ EXPORT_SYMBOL(prepare_creds);
 
 /*
  * Prepare credentials for current to perform an execve()
- * - The caller must hold ->cred_guard_mutex
+ * - The caller must hold ->exec_update_lock
  */
 struct cred *prepare_exec_creds(void)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..996c649b9a4c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1555,11 +1555,6 @@ static int copy_fs(u64 clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		read_seqlock_excl(&fs->seq);
-		/* "users" and "in_exec" locked for check_unsafe_exec() */
-		if (fs->in_exec) {
-			read_sequnlock_excl(&fs->seq);
-			return -EAGAIN;
-		}
 		fs->users++;
 		read_sequnlock_excl(&fs->seq);
 		return 0;
@@ -1699,7 +1694,6 @@ static int copy_signal(u64 clone_flags, struct task_struct *tsk)
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-	mutex_init(&sig->cred_guard_mutex);
 	init_rwsem(&sig->exec_update_lock);
 
 	return 0;
@@ -1710,7 +1704,7 @@ static void copy_seccomp(struct task_struct *p)
 #ifdef CONFIG_SECCOMP
 	/*
 	 * Must be called with sighand->lock held, which is common to
-	 * all threads in the group. Holding cred_guard_mutex is not
+	 * all threads in the group. Holding exec_update_lock is not
 	 * needed because this new task is not yet running and cannot
 	 * be racing exec.
 	 */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..8140d4bfc279 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,8 +444,8 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 * SUID, SGID and LSM creds get determined differently
 	 * under ptrace.
 	 */
-	scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
-			   &task->signal->cred_guard_mutex) {
+	scoped_cond_guard (rwsem_read_intr, return -ERESTARTNOINTR,
+			   &task->signal->exec_update_lock) {
 
 		scoped_guard (task_lock, task) {
 			retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 25f62867a16d..87de8d47d876 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -479,7 +479,7 @@ static int is_ancestor(struct seccomp_filter *parent,
 /**
  * seccomp_can_sync_threads: checks if all threads can be synchronized
  *
- * Expects sighand and cred_guard_mutex locks to be held.
+ * Expects sighand and exec_update_lock locks to be held.
  *
  * Returns 0 on success, -ve on error, or the pid of a thread which was
  * either not in the correct seccomp mode or did not have an ancestral
@@ -489,7 +489,7 @@ static inline pid_t seccomp_can_sync_threads(void)
 {
 	struct task_struct *thread, *caller;
 
-	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
+	BUG_ON(!rwsem_is_locked(&current->signal->exec_update_lock));
 	assert_spin_locked(&current->sighand->siglock);
 
 	/* Validate all threads being eligible for synchronization. */
@@ -590,7 +590,7 @@ void seccomp_filter_release(struct task_struct *tsk)
  *
  * @flags: SECCOMP_FILTER_FLAG_* flags to set during sync.
  *
- * Expects sighand and cred_guard_mutex locks to be held, and for
+ * Expects sighand and exec_update_lock locks to be held, and for
  * seccomp_can_sync_threads() to have returned success already
  * without dropping the locks.
  *
@@ -599,7 +599,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
 {
 	struct task_struct *thread, *caller;
 
-	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
+	BUG_ON(!rwsem_is_locked(&current->signal->exec_update_lock));
 	assert_spin_locked(&current->sighand->siglock);
 
 	/*
@@ -2011,7 +2011,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
-	    mutex_lock_killable(&current->signal->cred_guard_mutex))
+	    down_read_killable(&current->signal->exec_update_lock))
 		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
@@ -2034,7 +2034,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 out:
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
-		mutex_unlock(&current->signal->cred_guard_mutex);
+		up_read(&current->signal->exec_update_lock);
 out_put_fd:
 	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
 		if (ret) {
-- 
2.41.0
Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
Posted by Roberto Sassu 6 days, 13 hours ago
On Thu, 2025-11-20 at 14:57 -0600, Eric W. Biederman wrote:
> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
> 
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
> 
> I am not certain why we wanted to compute the cred for the new
> executable so early.  Perhaps I missed something but I did not see any
> common errors being signaled.   So I don't think we loose anything by
> computing the new cred later.
> 
> We gain a lot.
> 
> We stop holding the cred_guard_mutex over places where the code sleeps
> and waits for userspace.  These places include the waiting for the
> tracer in PTRACE_EVENT_EXIT, "put_user(0, tsk->clear_child_tid)" in
> mm_release, and "get_user(futex_offset, ...") in exit_robust_mutex.
> 
> We can remove fs_struct->in_exec.  The case where it was used simply
> never comes up, when we compute the cred after de_thread completes.
> 
> We remove the possibility of a hang between a tracer calling
> PTRACE_ATTACH/PTRACE_SIEZE and the kernel waiting for the tracer
> in PTRACE_EVENT_EXIT.
> 
> ---
> Oleg, Kees, Bernd, Can you see anything I am missing?

+ Mimi, linux-integrity (would be nice if we are in CC when linux-
security-module is in CC).

Apologies for not answering earlier, it seems I don't receive the
emails from the linux-security-module mailing list (thanks Serge for
letting me know!).

I tested your patch but there are a few warnings like this:

[    2.702374] =====================================
[    2.702854] WARNING: bad unlock balance detected!
[    2.703350] 6.18.0-rc6+ #409 Not tainted
[    2.703755] -------------------------------------
[    2.704241] init/1 is trying to release lock (init_fs.seq) at:
[    2.704829] [<ffffffff81836100>] begin_new_exec+0xfe0/0x1710
[    2.705421] but there are no more locks to release!
[    2.705931] 
[    2.705931] other info that might help us debug this:
[    2.706610] 1 lock held by init/1:
[    2.706958]  #0: ffff88810083e538 (&sig->exec_update_lock){+.+.}-{4:4}, at: begin_new_exec+0x769/0x1710

and then the system hangs.

I see two main effects of this patch. First, the bprm_check_security
hook implementations will not see bprm->cred populated. That was a
problem before we made this patch:

https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/

to work around the problem of not calculating the final DAC credentials
early enough (well, we actually had to change our CREDS_CHECK hook
behavior).

The second, I could not check. If I remember well, unlike the
capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
based on the first file being executed (thus the script, not the
interpreter). Is this patch keeping the same behavior despite preparing
the credentials when the final binary is found?

Thanks

Roberto

> The code compiles but I haven't test it yet.
> 
> I thought I was going to move commit_creds before de_thread, but that
> would have taken commit_cred out of exec_update_lock (which introduces
> races).
> 
> However I can't see any drawbacks of going the other direction.
> 
> 
>  fs/exec.c                    | 88 ++++++++++++++----------------------
>  fs/fs_struct.c               |  1 -
>  fs/proc/base.c               |  4 +-
>  include/linux/fs_struct.h    |  1 -
>  include/linux/sched/signal.h |  6 ---
>  init/init_task.c             |  1 -
>  kernel/cred.c                |  2 +-
>  kernel/fork.c                |  8 +---
>  kernel/ptrace.c              |  4 +-
>  kernel/seccomp.c             | 12 ++---
>  10 files changed, 45 insertions(+), 82 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 4298e7e08d5d..5ae96584dab0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1090,6 +1090,9 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>  	perf_event_comm(tsk, exec);
>  }
>  
> +static int prepare_bprm_creds(struct linux_binprm *bprm);
> +static void check_unsafe_exec(struct linux_binprm *bprm);
> +
>  /*
>   * Calling this is the point of no return. None of the failures will be
>   * seen by userspace since either the process is already taking a fatal
> @@ -1101,10 +1104,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	struct task_struct *me = current;
>  	int retval;
>  
> -	/* Once we are committed compute the creds */
> -	retval = bprm_creds_from_file(bprm);
> -	if (retval)
> -		return retval;
>  
>  	/*
>  	 * This tracepoint marks the point before flushing the old exec where
> @@ -1123,8 +1122,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	retval = de_thread(me);
>  	if (retval)
>  		goto out;
> -	/* see the comment in check_unsafe_exec() */
> -	current->fs->in_exec = 0;
>  	/*
>  	 * Cancel any io_uring activity across execve
>  	 */
> @@ -1251,6 +1248,25 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
>  	flush_signal_handlers(me, 0);
>  
> +	retval = prepare_bprm_creds(bprm);
> +	if (retval)
> +		goto out_unlock;
> +
> +	/*
> +	 * Check for unsafe execution states before exec_binprm(), which
> +	 * will call back into begin_new_exec(), into bprm_creds_from_file(),
> +	 * where setuid-ness is evaluated.
> +	 */
> +	check_unsafe_exec(bprm);
> +
> +	/* Set the unchanging part of bprm->cred */
> +	retval = security_bprm_creds_for_exec(bprm);
> +
> +	/* Once we are committed compute the creds */
> +	retval = bprm_creds_from_file(bprm);
> +	if (retval)
> +		goto out_unlock;
> +
>  	retval = set_cred_ucounts(bprm->cred);
>  	if (retval < 0)
>  		goto out_unlock;
> @@ -1272,9 +1288,9 @@ int begin_new_exec(struct linux_binprm * bprm)
>  	if (get_dumpable(me->mm) != SUID_DUMP_USER)
>  		perf_event_exit_task(me);
>  	/*
> -	 * cred_guard_mutex must be held at least to this point to prevent
> +	 * exec_update_lock must be held at least to this point to prevent
>  	 * ptrace_attach() from altering our determination of the task's
> -	 * credentials; any time after this it may be unlocked.
> +	 * credentials.
>  	 */
>  	security_bprm_committed_creds(bprm);
>  
> @@ -1291,8 +1307,6 @@ int begin_new_exec(struct linux_binprm * bprm)
>  
>  out_unlock:
>  	up_write(&me->signal->exec_update_lock);
> -	if (!bprm->cred)
> -		mutex_unlock(&me->signal->cred_guard_mutex);
>  
>  out:
>  	return retval;
> @@ -1336,7 +1350,6 @@ void setup_new_exec(struct linux_binprm * bprm)
>  	 */
>  	me->mm->task_size = TASK_SIZE;
>  	up_write(&me->signal->exec_update_lock);
> -	mutex_unlock(&me->signal->cred_guard_mutex);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
>  
> @@ -1351,21 +1364,15 @@ void finalize_exec(struct linux_binprm *bprm)
>  EXPORT_SYMBOL(finalize_exec);
>  
>  /*
> - * Prepare credentials and lock ->cred_guard_mutex.
> - * setup_new_exec() commits the new creds and drops the lock.
> - * Or, if exec fails before, free_bprm() should release ->cred
> - * and unlock.
> + * Prepare credentials.  begin_new_exec() commits the new creds.
> + * Or, if exec fails before, free_bprm() should release ->cred.
>   */
>  static int prepare_bprm_creds(struct linux_binprm *bprm)
>  {
> -	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
> -		return -ERESTARTNOINTR;
> -
>  	bprm->cred = prepare_exec_creds();
>  	if (likely(bprm->cred))
>  		return 0;
>  
> -	mutex_unlock(&current->signal->cred_guard_mutex);
>  	return -ENOMEM;
>  }
>  
> @@ -1386,9 +1393,7 @@ static void free_bprm(struct linux_binprm *bprm)
>  	}
>  	free_arg_pages(bprm);
>  	if (bprm->cred) {
> -		/* in case exec fails before de_thread() succeeds */
> -		current->fs->in_exec = 0;
> -		mutex_unlock(&current->signal->cred_guard_mutex);
> +		/* in case exec fails before commit_creds succeeds */
>  		abort_creds(bprm->cred);
>  	}
>  	do_close_execat(bprm->file);
> @@ -1486,13 +1491,12 @@ EXPORT_SYMBOL(bprm_change_interp);
>  
>  /*
>   * determine how safe it is to execute the proposed program
> - * - the caller must hold ->cred_guard_mutex to protect against
> + * - the caller must hold ->exec_update_lock to protect against
>   *   PTRACE_ATTACH or seccomp thread-sync
>   */
>  static void check_unsafe_exec(struct linux_binprm *bprm)
>  {
> -	struct task_struct *p = current, *t;
> -	unsigned n_fs;
> +	struct task_struct *p = current;
>  
>  	if (p->ptrace)
>  		bprm->unsafe |= LSM_UNSAFE_PTRACE;
> @@ -1509,25 +1513,9 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
>  	 * suid exec because the differently privileged task
>  	 * will be able to manipulate the current directory, etc.
>  	 * It would be nice to force an unshare instead...
> -	 *
> -	 * Otherwise we set fs->in_exec = 1 to deny clone(CLONE_FS)
> -	 * from another sub-thread until de_thread() succeeds, this
> -	 * state is protected by cred_guard_mutex we hold.
>  	 */
> -	n_fs = 1;
> -	read_seqlock_excl(&p->fs->seq);
> -	rcu_read_lock();
> -	for_other_threads(p, t) {
> -		if (t->fs == p->fs)
> -			n_fs++;
> -	}
> -	rcu_read_unlock();
> -
> -	/* "users" and "in_exec" locked for copy_fs() */
> -	if (p->fs->users > n_fs)
> +	if (p->fs->users > 1)
>  		bprm->unsafe |= LSM_UNSAFE_SHARE;
> -	else
> -		p->fs->in_exec = 1;
>  	read_sequnlock_excl(&p->fs->seq);
>  }
>  
> @@ -1731,25 +1719,15 @@ static int bprm_execve(struct linux_binprm *bprm)
>  {
>  	int retval;
>  
> -	retval = prepare_bprm_creds(bprm);
> -	if (retval)
> -		return retval;
> +	if (bprm->is_check)
> +		return 0;
>  
> -	/*
> -	 * Check for unsafe execution states before exec_binprm(), which
> -	 * will call back into begin_new_exec(), into bprm_creds_from_file(),
> -	 * where setuid-ness is evaluated.
> -	 */
> -	check_unsafe_exec(bprm);
>  	current->in_execve = 1;
>  	sched_mm_cid_before_execve(current);
>  
>  	sched_exec();
>  
> -	/* Set the unchanging part of bprm->cred */
> -	retval = security_bprm_creds_for_exec(bprm);
> -	if (retval || bprm->is_check)
> -		goto out;
> +
>  
>  	retval = exec_binprm(bprm);
>  	if (retval < 0)
> diff --git a/fs/fs_struct.c b/fs/fs_struct.c
> index 28be762ac1c6..945bc0916f65 100644
> --- a/fs/fs_struct.c
> +++ b/fs/fs_struct.c
> @@ -109,7 +109,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
>  	/* We don't need to lock fs - think why ;-) */
>  	if (fs) {
>  		fs->users = 1;
> -		fs->in_exec = 0;
>  		seqlock_init(&fs->seq);
>  		fs->umask = old->umask;
>  
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 6299878e3d97..7041fb4d1689 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2834,14 +2834,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
>  	}
>  
>  	/* Guard against adverse ptrace interaction */
> -	rv = mutex_lock_interruptible(&current->signal->cred_guard_mutex);
> +	rv = down_write_killable(&current->signal->exec_update_lock);
>  	if (rv < 0)
>  		goto out_free;
>  
>  	rv = security_setprocattr(PROC_I(inode)->op.lsmid,
>  				  file->f_path.dentry->d_name.name, page,
>  				  count);
> -	mutex_unlock(&current->signal->cred_guard_mutex);
> +	up_write(&current->signal->exec_update_lock);
>  out_free:
>  	kfree(page);
>  out:
> diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
> index baf200ab5c77..29d0f7d57743 100644
> --- a/include/linux/fs_struct.h
> +++ b/include/linux/fs_struct.h
> @@ -10,7 +10,6 @@ struct fs_struct {
>  	int users;
>  	seqlock_t seq;
>  	int umask;
> -	int in_exec;
>  	struct path root, pwd;
>  } __randomize_layout;
>  
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 7d6449982822..7e9259c8fb2b 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -241,12 +241,6 @@ struct signal_struct {
>  	struct mm_struct *oom_mm;	/* recorded mm when the thread group got
>  					 * killed by the oom killer */
>  
> -	struct mutex cred_guard_mutex;	/* guard against foreign influences on
> -					 * credential calculations
> -					 * (notably. ptrace)
> -					 * Deprecated do not use in new code.
> -					 * Use exec_update_lock instead.
> -					 */
>  	struct rw_semaphore exec_update_lock;	/* Held while task_struct is
>  						 * being updated during exec,
>  						 * and may have inconsistent
> diff --git a/init/init_task.c b/init/init_task.c
> index a55e2189206f..4813bffe217e 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -30,7 +30,6 @@ static struct signal_struct init_signals = {
>  #ifdef CONFIG_CGROUPS
>  	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
>  #endif
> -	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>  	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>  #ifdef CONFIG_POSIX_TIMERS
>  	.posix_timers		= HLIST_HEAD_INIT,
> diff --git a/kernel/cred.c b/kernel/cred.c
> index dbf6b687dc5c..80e376ce005f 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -252,7 +252,7 @@ EXPORT_SYMBOL(prepare_creds);
>  
>  /*
>   * Prepare credentials for current to perform an execve()
> - * - The caller must hold ->cred_guard_mutex
> + * - The caller must hold ->exec_update_lock
>   */
>  struct cred *prepare_exec_creds(void)
>  {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3da0f08615a9..996c649b9a4c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1555,11 +1555,6 @@ static int copy_fs(u64 clone_flags, struct task_struct *tsk)
>  	if (clone_flags & CLONE_FS) {
>  		/* tsk->fs is already what we want */
>  		read_seqlock_excl(&fs->seq);
> -		/* "users" and "in_exec" locked for check_unsafe_exec() */
> -		if (fs->in_exec) {
> -			read_sequnlock_excl(&fs->seq);
> -			return -EAGAIN;
> -		}
>  		fs->users++;
>  		read_sequnlock_excl(&fs->seq);
>  		return 0;
> @@ -1699,7 +1694,6 @@ static int copy_signal(u64 clone_flags, struct task_struct *tsk)
>  	sig->oom_score_adj = current->signal->oom_score_adj;
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  
> -	mutex_init(&sig->cred_guard_mutex);
>  	init_rwsem(&sig->exec_update_lock);
>  
>  	return 0;
> @@ -1710,7 +1704,7 @@ static void copy_seccomp(struct task_struct *p)
>  #ifdef CONFIG_SECCOMP
>  	/*
>  	 * Must be called with sighand->lock held, which is common to
> -	 * all threads in the group. Holding cred_guard_mutex is not
> +	 * all threads in the group. Holding exec_update_lock is not
>  	 * needed because this new task is not yet running and cannot
>  	 * be racing exec.
>  	 */
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 75a84efad40f..8140d4bfc279 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -444,8 +444,8 @@ static int ptrace_attach(struct task_struct *task, long request,
>  	 * SUID, SGID and LSM creds get determined differently
>  	 * under ptrace.
>  	 */
> -	scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
> -			   &task->signal->cred_guard_mutex) {
> +	scoped_cond_guard (rwsem_read_intr, return -ERESTARTNOINTR,
> +			   &task->signal->exec_update_lock) {
>  
>  		scoped_guard (task_lock, task) {
>  			retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 25f62867a16d..87de8d47d876 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -479,7 +479,7 @@ static int is_ancestor(struct seccomp_filter *parent,
>  /**
>   * seccomp_can_sync_threads: checks if all threads can be synchronized
>   *
> - * Expects sighand and cred_guard_mutex locks to be held.
> + * Expects sighand and exec_update_lock locks to be held.
>   *
>   * Returns 0 on success, -ve on error, or the pid of a thread which was
>   * either not in the correct seccomp mode or did not have an ancestral
> @@ -489,7 +489,7 @@ static inline pid_t seccomp_can_sync_threads(void)
>  {
>  	struct task_struct *thread, *caller;
>  
> -	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> +	BUG_ON(!rwsem_is_locked(&current->signal->exec_update_lock));
>  	assert_spin_locked(&current->sighand->siglock);
>  
>  	/* Validate all threads being eligible for synchronization. */
> @@ -590,7 +590,7 @@ void seccomp_filter_release(struct task_struct *tsk)
>   *
>   * @flags: SECCOMP_FILTER_FLAG_* flags to set during sync.
>   *
> - * Expects sighand and cred_guard_mutex locks to be held, and for
> + * Expects sighand and exec_update_lock locks to be held, and for
>   * seccomp_can_sync_threads() to have returned success already
>   * without dropping the locks.
>   *
> @@ -599,7 +599,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  {
>  	struct task_struct *thread, *caller;
>  
> -	BUG_ON(!mutex_is_locked(&current->signal->cred_guard_mutex));
> +	BUG_ON(!rwsem_is_locked(&current->signal->exec_update_lock));
>  	assert_spin_locked(&current->sighand->siglock);
>  
>  	/*
> @@ -2011,7 +2011,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> -	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> +	    down_read_killable(&current->signal->exec_update_lock))
>  		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
> @@ -2034,7 +2034,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  out:
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> -		mutex_unlock(&current->signal->cred_guard_mutex);
> +		up_read(&current->signal->exec_update_lock);
>  out_put_fd:
>  	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
>  		if (ret) {
Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
Posted by Eric W. Biederman 9 hours ago
Roberto Sassu <roberto.sassu@huaweicloud.com> writes:

> + Mimi, linux-integrity (would be nice if we are in CC when linux-
> security-module is in CC).
>
> Apologies for not answering earlier, it seems I don't receive the
> emails from the linux-security-module mailing list (thanks Serge for
> letting me know!).
>
> I see two main effects of this patch. First, the bprm_check_security
> hook implementations will not see bprm->cred populated. That was a
> problem before we made this patch:
>
> https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/

Thanks, that is definitely needed.

Does calling process_measurement(CREDS_CHECK) on only the final file
pass review?  Do you know of any cases where that will break things?

As it stands I don't think it should be assumed that any LSM has
computed it's final creds until bprm_creds_from_file.  Not just the
uid and gid.

If the patch you posted for review works that helps sort that mess out.

> to work around the problem of not calculating the final DAC credentials
> early enough (well, we actually had to change our CREDS_CHECK hook
> behavior).
>
> The second, I could not check. If I remember well, unlike the
> capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
> based on the first file being executed (thus the script, not the
> interpreter). Is this patch keeping the same behavior despite preparing
> the credentials when the final binary is found?

The patch I posted was.

My brain is still reeling from the realization that our security modules
have the implicit assumption that it is safe to calculate their security
information from shell scripts.

In the first half of the 90's I remember there was lots of effort to try
and make setuid shell scripts and setuid perl scripts work, and the
final conclusion was it was a lost cause.

Now I look at security_bprm_creds_for_exec and security_bprm_check which
both have the implicit assumption that it is indeed safe to compute the
credentials from a shell script.

When passing a file descriptor to execat we have
BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
which reduces some of the races.

However when just plain executing a shell script we pass the filename of
the shell script as a command line argument, and expect the shell to
open the filename again.  This has been a time of check to time of use
race for decades, and one of the reasons we don't have setuid shell
scripts.

Yet the IMA implementation (without the above mentioned patch) assumes
the final creds will be calculated before security_bprm_check is called,
and security_bprm_creds_for_exec busily calculate the final creds.

For some of the security modules I believe anyone can set any label they
want on a file and they remain secure (At which point I don't understand
the point of having labels on files).  I don't believe that is the case
for selinux, or in general.

So just to remove the TOCTOU race the security_bprm_creds_for_exec
and security_bprm_check hooks need to be removed, after moving their
code into something like security_bprm_creds_from_file.

Or am I missing something and even with the TOCTOU race are setuid shell
scripts somehow safe now?

Eric
Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
Posted by Roberto Sassu 8 hours ago
On Mon, 2025-12-01 at 10:06 -0600, Eric W. Biederman wrote:
> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
> 
> > + Mimi, linux-integrity (would be nice if we are in CC when linux-
> > security-module is in CC).
> > 
> > Apologies for not answering earlier, it seems I don't receive the
> > emails from the linux-security-module mailing list (thanks Serge for
> > letting me know!).
> > 
> > I see two main effects of this patch. First, the bprm_check_security
> > hook implementations will not see bprm->cred populated. That was a
> > problem before we made this patch:
> > 
> > https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
> 
> Thanks, that is definitely needed.
> 
> Does calling process_measurement(CREDS_CHECK) on only the final file
> pass review?  Do you know of any cases where that will break things?

We intentionally changed the behavior of CREDS_CHECK to be invoked only
for the final file. We are monitoring for bug reports, if we receive
complains from people that the patch breaks their expectation we will
revisit the issue.

Any LSM implementing bprm_check_security looking for brpm->cred would
be affected by recalculating the DAC credentials for the final binary.

> As it stands I don't think it should be assumed that any LSM has
> computed it's final creds until bprm_creds_from_file.  Not just the
> uid and gid.

Uhm, I can be wrong, but most LSMs calculate their state change in
bprm_creds_for_exec (git grep bprm_creds_for_exec|grep LSM_HOOK_INIT).

> If the patch you posted for review works that helps sort that mess out.

Well, it works because we changed the expectation :)

> > to work around the problem of not calculating the final DAC credentials
> > early enough (well, we actually had to change our CREDS_CHECK hook
> > behavior).
> > 
> > The second, I could not check. If I remember well, unlike the
> > capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
> > based on the first file being executed (thus the script, not the
> > interpreter). Is this patch keeping the same behavior despite preparing
> > the credentials when the final binary is found?
> 
> The patch I posted was.
> 
> My brain is still reeling from the realization that our security modules
> have the implicit assumption that it is safe to calculate their security
> information from shell scripts.

If I'm interpreting this behavior correctly (please any LSM maintainer
could comment on it), the intent is just to transition to a different
security context where a different set of rules could apply (since we
are executing a script).

Imagine if for every script, the security transition is based on the
interpreter, it would be hard to differentiate between scripts and
associate to the respective processes different security labels.

> In the first half of the 90's I remember there was lots of effort to try
> and make setuid shell scripts and setuid perl scripts work, and the
> final conclusion was it was a lost cause.

Definitely I lack a lot of context...

> Now I look at security_bprm_creds_for_exec and security_bprm_check which
> both have the implicit assumption that it is indeed safe to compute the
> credentials from a shell script.
> 
> When passing a file descriptor to execat we have
> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
> which reduces some of the races.
> 
> However when just plain executing a shell script we pass the filename of
> the shell script as a command line argument, and expect the shell to
> open the filename again.  This has been a time of check to time of use
> race for decades, and one of the reasons we don't have setuid shell
> scripts.

Yes, it would be really nice to fix it!

> Yet the IMA implementation (without the above mentioned patch) assumes
> the final creds will be calculated before security_bprm_check is called,
> and security_bprm_creds_for_exec busily calculate the final creds.
> 
> For some of the security modules I believe anyone can set any label they
> want on a file and they remain secure (At which point I don't understand
> the point of having labels on files).  I don't believe that is the case
> for selinux, or in general.

A simple example for SELinux. Suppose that the parent process has type
initrc_t, then the SELinux policy configures the following transitions
based on the label of the first file executed (sesearch -T -s initrc_t
-c process):

type_transition initrc_t NetworkManager_dispatcher_exec_t:process NetworkManager_dispatcher_t;
type_transition initrc_t NetworkManager_exec_t:process NetworkManager_t;
type_transition initrc_t NetworkManager_initrc_exec_t:process initrc_t;
type_transition initrc_t NetworkManager_priv_helper_exec_t:process NetworkManager_priv_helper_t;
type_transition initrc_t abrt_dump_oops_exec_t:process abrt_dump_oops_t;
type_transition initrc_t abrt_exec_t:process abrt_t;
[...]

(there are 747 rules in my system).

If the transition would be based on the interpreter label, it would be
hard to express with rules.

If the transition does not occur for any reason the parent process
policy would still apply, but maybe it would not have the necessary
permissions for the execution of the script.

> So just to remove the TOCTOU race the security_bprm_creds_for_exec
> and security_bprm_check hooks need to be removed, after moving their
> code into something like security_bprm_creds_from_file.
> 
> Or am I missing something and even with the TOCTOU race are setuid shell
> scripts somehow safe now?

Take this with a looot of salt, if there is a TOCTOU race, the script
will be executed with a security context that does not belong to it.
But the transition already happened. Not sure if it is safe.

I also don't know how the TOCTOU race could be solved, but I also would
like it to be fixed. I'm available to comment on any proposal!

Roberto
Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
Posted by Eric W. Biederman 6 hours ago
Roberto Sassu <roberto.sassu@huaweicloud.com> writes:

> On Mon, 2025-12-01 at 10:06 -0600, Eric W. Biederman wrote:
>> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
>> 
>> > + Mimi, linux-integrity (would be nice if we are in CC when linux-
>> > security-module is in CC).
>> > 
>> > Apologies for not answering earlier, it seems I don't receive the
>> > emails from the linux-security-module mailing list (thanks Serge for
>> > letting me know!).
>> > 
>> > I see two main effects of this patch. First, the bprm_check_security
>> > hook implementations will not see bprm->cred populated. That was a
>> > problem before we made this patch:
>> > 
>> > https://patchew.org/linux/20251008113503.2433343-1-roberto.sassu@huaweicloud.com/
>> 
>> Thanks, that is definitely needed.
>> 
>> Does calling process_measurement(CREDS_CHECK) on only the final file
>> pass review?  Do you know of any cases where that will break things?
>
> We intentionally changed the behavior of CREDS_CHECK to be invoked only
> for the final file. We are monitoring for bug reports, if we receive
> complains from people that the patch breaks their expectation we will
> revisit the issue.
>
> Any LSM implementing bprm_check_security looking for brpm->cred would
> be affected by recalculating the DAC credentials for the final binary.
>
>> As it stands I don't think it should be assumed that any LSM has
>> computed it's final creds until bprm_creds_from_file.  Not just the
>> uid and gid.
>
> Uhm, I can be wrong, but most LSMs calculate their state change in
> bprm_creds_for_exec (git grep bprm_creds_for_exec|grep LSM_HOOK_INIT).
>
>> If the patch you posted for review works that helps sort that mess out.
>
> Well, it works because we changed the expectation :)

I just haven't seen that code land in Linus's tree yet so I am a bit
cautious in adopting that.  It is definitely needed as the behavior
of IMA as v6.18 simply does not work in general.

>> > to work around the problem of not calculating the final DAC credentials
>> > early enough (well, we actually had to change our CREDS_CHECK hook
>> > behavior).
>> > 
>> > The second, I could not check. If I remember well, unlike the
>> > capability LSM, SELinux/Apparmor/SMACK calculate the final credentials
>> > based on the first file being executed (thus the script, not the
>> > interpreter). Is this patch keeping the same behavior despite preparing
>> > the credentials when the final binary is found?
>> 
>> The patch I posted was.
>> 
>> My brain is still reeling from the realization that our security modules
>> have the implicit assumption that it is safe to calculate their security
>> information from shell scripts.
>
> If I'm interpreting this behavior correctly (please any LSM maintainer
> could comment on it), the intent is just to transition to a different
> security context where a different set of rules could apply (since we
> are executing a script).
>
> Imagine if for every script, the security transition is based on the
> interpreter, it would be hard to differentiate between scripts and
> associate to the respective processes different security labels.
>
>> In the first half of the 90's I remember there was lots of effort to try
>> and make setuid shell scripts and setuid perl scripts work, and the
>> final conclusion was it was a lost cause.
>
> Definitely I lack a lot of context...

From the usenet comp.unix.faq that was probably updated in 1994:
    http://www.faqs.org/faqs/unix-faq/faq/part4/section-7.html

I have been trying to remember enough details by looking it up, but the
short version is that one of the big problems is there is a race between
the kernel doing it's thing and the shell opening the shell script.

Clever people have been able to take advantage of that race and insert
arbitrary code in that window for the shell to execute.  All you have to
do is google for how to find a reproducer if the one in the link above
is not enough.

>> Now I look at security_bprm_creds_for_exec and security_bprm_check which
>> both have the implicit assumption that it is indeed safe to compute the
>> credentials from a shell script.
>> 
>> When passing a file descriptor to execat we have
>> BINPRM_FLAGS_PATH_INACCESSIBLE and use /dev/fd/NNN as the filename
>> which reduces some of the races.
>> 
>> However when just plain executing a shell script we pass the filename of
>> the shell script as a command line argument, and expect the shell to
>> open the filename again.  This has been a time of check to time of use
>> race for decades, and one of the reasons we don't have setuid shell
>> scripts.
>
> Yes, it would be really nice to fix it!

After 30 years I really don't expect that is even a reasonable request.

I think we are solidly into "Don't do that then", and the LSM security
hooks are definitely doing that.

There is the partial solution of passing /dev/fd instead of passing the
name of the script.  I suspect that would break things.  I don't
remember why that was never adopted.

I think even with the TOCTOU race fixed there were other serious issues.

I really think it behooves any security module people who want to use
the shell script as the basis of their security decisions to research
all of the old well known issues and describe how they don't apply.

All I have energy for is to point out it is broken as is and to start
moving code down into bprm_creds_from_file to avoid the race.

Right now as far as I can tell anything based upon the script itself
is worthless junk so changing that would not be breaking anything that
wasn't already broken.

>> Yet the IMA implementation (without the above mentioned patch) assumes
>> the final creds will be calculated before security_bprm_check is called,
>> and security_bprm_creds_for_exec busily calculate the final creds.
>> 
>> For some of the security modules I believe anyone can set any label they
>> want on a file and they remain secure (At which point I don't understand
>> the point of having labels on files).  I don't believe that is the case
>> for selinux, or in general.
>
> A simple example for SELinux. Suppose that the parent process has type
> initrc_t, then the SELinux policy configures the following transitions
> based on the label of the first file executed (sesearch -T -s initrc_t
> -c process):
>
> type_transition initrc_t NetworkManager_dispatcher_exec_t:process NetworkManager_dispatcher_t;
> type_transition initrc_t NetworkManager_exec_t:process NetworkManager_t;
> type_transition initrc_t NetworkManager_initrc_exec_t:process initrc_t;
> type_transition initrc_t NetworkManager_priv_helper_exec_t:process NetworkManager_priv_helper_t;
> type_transition initrc_t abrt_dump_oops_exec_t:process abrt_dump_oops_t;
> type_transition initrc_t abrt_exec_t:process abrt_t;
> [...]
>
> (there are 747 rules in my system).
>
> If the transition would be based on the interpreter label, it would be
> hard to express with rules.

Which is a problem for the people making the rules engine.  Because
30 years of experience with this problem says basing anything on the
script is already broken.

I understand the frustration, but it requires a new way of launching
shell scripts to even begin to make it secure.

> If the transition does not occur for any reason the parent process
> policy would still apply, but maybe it would not have the necessary
> permissions for the execution of the script.

Yep.

>> So just to remove the TOCTOU race the security_bprm_creds_for_exec
>> and security_bprm_check hooks need to be removed, after moving their
>> code into something like security_bprm_creds_from_file.
>> 
>> Or am I missing something and even with the TOCTOU race are setuid shell
>> scripts somehow safe now?
>
> Take this with a looot of salt, if there is a TOCTOU race, the script
> will be executed with a security context that does not belong to it.
> But the transition already happened. Not sure if it is safe.

Historically it hasn't been safe.

> I also don't know how the TOCTOU race could be solved, but I also would
> like it to be fixed. I'm available to comment on any proposal!

I am hoping someone who helped put these security hooks where they are
will speak up, and tell me what I am missing.

All I have the energy for right now is to point out security policies
based upon shell scripts appear to be security policies that only
protect you from well behaved programs.

Eric
Re: Are setuid shell scripts safe? (Implied by security_bprm_creds_for_exec)
Posted by David Laight 3 hours ago
On Mon, 01 Dec 2025 12:53:10 -0600
"Eric W. Biederman" <ebiederm@xmission.com> wrote:

> Roberto Sassu <roberto.sassu@huaweicloud.com> writes:
...
> There is the partial solution of passing /dev/fd instead of passing the
> name of the script.  I suspect that would break things.  I don't
> remember why that was never adopted.

I thought that was what was done - and stopped the problem of a user
flipping a symlink between a suid script and one the user had written.

It has only ever been done for suid scripts when the uid actually changes.
Which makes it possible to set the permissions so that owner can't
run the script!
(The kernel only needs 'x' access, the shell needs 'r' access, so with 'x+s'
the owner can't execute the script but everyone else can.)

There is a much older problem that probably only affected the original 1970s
'sh' (not even the SVSV/Sunos version) that quoted redirects on the command
line would get actioned when the parameter was substituted - which I think
means the original 'sh' did post-substitution syntax analysis (the same
as cmd.exe still does).
That doesn't affect any shells used since the early 1980s.

	David
Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
Posted by Oleg Nesterov 1 week, 1 day ago
Eric,

sorry for delay, I am on PTO, didn't read emails this week...

On 11/20, Eric W. Biederman wrote:
>
> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
>
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>
> I am not certain why we wanted to compute the cred for the new
> executable so early.  Perhaps I missed something but I did not see any
> common errors being signaled.   So I don't think we loose anything by
> computing the new cred later.
>
> We gain a lot.

Yes. I LIKE your approach after a quick glance. And I swear, I thought about
it too ;)

But is it correct? I don't know. I'll try to actually read your patch next
week (I am on PTO untill the end of November), but I am not sure I can
provide a valuable feedback.

One "obvious" problem is that, after this patch, the execing process can crash
in a case when currently exec() returns an error...

Oleg.
Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
Posted by Eric W. Biederman 1 week, 1 day ago
Oleg Nesterov <oleg@redhat.com> writes:

> Eric,
>
> sorry for delay, I am on PTO, didn't read emails this week...
>
> On 11/20, Eric W. Biederman wrote:
>>
>> Instead of computing the new cred before we pass the point of no
>> return compute the new cred just before we use it.
>>
>> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>>
>> I am not certain why we wanted to compute the cred for the new
>> executable so early.  Perhaps I missed something but I did not see any
>> common errors being signaled.   So I don't think we loose anything by
>> computing the new cred later.
>>
>> We gain a lot.
>
> Yes. I LIKE your approach after a quick glance. And I swear, I thought about
> it too ;)
>
> But is it correct? I don't know. I'll try to actually read your patch next
> week (I am on PTO untill the end of November), but I am not sure I can
> provide a valuable feedback.
>
> One "obvious" problem is that, after this patch, the execing process can crash
> in a case when currently exec() returns an error...

Yes.

I have been testing and looking at it, and I have found a few issues,
and I am trying to see if I can resolve them.

The good news is that with the advent of AT_EXECVE_CHECK we have a
really clear API boundary between errors that must be diagnosed
and errors of happenstance like running out of memory.

The bad news is that the implementation of AT_EXECVE_CHECK seems to been
rather hackish especially with respect to security_bprm_creds_for_exec.

What I am hoping for is to get the 3 causes of errors of brpm->unsafe
( LSM_UNSAFE_SHARE, LSM_UNSAFE_PTRACE, and LSM_UNSAFE_NO_NEW_PRIVS )
handled cleanly outside of the cred_guard_mutex, and simply
retested when it is time to build the credentials of the new process.

In practice that should get the same failures modes as we have now
but it would get SIGSEGV in rare instances where things changed
during exec.  That feels acceptable.



I thought of one other approach that might be enough to put the issue to
bed if cleaning up exec is too much work.  We could have ptrace_attach
use a trylock and fail when it doesn't succeed.  That would solve the
worst of the symptoms.

I think this would be a complete patch:

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84efad40f..5dd2144e5789 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -444,7 +444,7 @@ static int ptrace_attach(struct task_struct *task, long request,
 	 * SUID, SGID and LSM creds get determined differently
 	 * under ptrace.
 	 */
-	scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR,
+	scoped_cond_guard (mutex_try, return -EAGAIN,
 			   &task->signal->cred_guard_mutex) {
 
 		scoped_guard (task_lock, task) {
-- 
2.41.0


Eric
Re: [RFC][PATCH] exec: Move cred computation under exec_update_lock
Posted by Eric W. Biederman 1 week, 4 days ago
"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Instead of computing the new cred before we pass the point of no
> return compute the new cred just before we use it.
>
> This allows the removal of fs_struct->in_exec and cred_guard_mutex.
>
> I am not certain why we wanted to compute the cred for the new
> executable so early.  Perhaps I missed something but I did not see any
> common errors being signaled.   So I don't think we loose anything by
> computing the new cred later.

I should add that the permission checks happen in open_exec,
everything that follows credential wise is just about representing in
struct cred the credentials the new executable will have.

So I am really at a loss why we have had this complicated way of
computing of computed the credentials all of these years full of
time of check to time of use problems.

Eric