kernel/sys: Optimize do_prlimit lock scope to reduce contention

[PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

Posted by Zhen Ni 1 year, 5 months ago

Refines the lock scope in the do_prlimit function to reduce
contention on task_lock(tsk->group_leader). The lock now protects only
sections that access or modify shared resources (rlim). Permission
checks (capable) and security validations (security_task_setrlimit)
are placed outside the lock, as they do not modify rlim and are
independent of shared data protection.

The security_task_setrlimit function is a Linux Security Module (LSM)
hook that evaluates resource limit changes based on security policies.
It does not alter the rlim data structure, as confirmed by existing
LSM implementations (e.g., SELinux and AppArmor). Thus, this function
does not require locking, ensuring correctness while improving
concurrency.

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---
 kernel/sys.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index c4c701c6f0b4..ef99b654e8d8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1481,18 +1481,20 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 
 	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
 	rlim = tsk->signal->rlim + resource;
-	task_lock(tsk->group_leader);
 	if (new_rlim) {
 		/*
 		 * Keep the capable check against init_user_ns until cgroups can
 		 * contain all limits.
 		 */
 		if (new_rlim->rlim_max > rlim->rlim_max &&
-				!capable(CAP_SYS_RESOURCE))
-			retval = -EPERM;
-		if (!retval)
-			retval = security_task_setrlimit(tsk, resource, new_rlim);
+		    !capable(CAP_SYS_RESOURCE))
+			return -EPERM;
+		retval = security_task_setrlimit(tsk, resource, new_rlim);
+		if (retval)
+			return retval;
 	}
+
+	task_lock(tsk->group_leader);
 	if (!retval) {
 		if (old_rlim)
 			*old_rlim = *rlim;
-- 
2.20.1

Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

Posted by Andrew Morton 1 year, 5 months ago

On Wed, 20 Nov 2024 21:21:56 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:

> Refines the lock scope in the do_prlimit function to reduce
> contention on task_lock(tsk->group_leader). The lock now protects only
> sections that access or modify shared resources (rlim). Permission
> checks (capable) and security validations (security_task_setrlimit)
> are placed outside the lock, as they do not modify rlim and are
> independent of shared data protection.

Let's cc linux-security-module@vger.kernel.org, as we're proposing
altering their locking environment!

> The security_task_setrlimit function is a Linux Security Module (LSM)
> hook that evaluates resource limit changes based on security policies.
> It does not alter the rlim data structure, as confirmed by existing
> LSM implementations (e.g., SELinux and AppArmor). Thus, this function
> does not require locking, ensuring correctness while improving
> concurrency.

Seems sane.

Does any code call do_prlimit() frequently enough for this to matter?

> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1481,18 +1481,20 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>  
>  	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
>  	rlim = tsk->signal->rlim + resource;
> -	task_lock(tsk->group_leader);
>  	if (new_rlim) {
>  		/*
>  		 * Keep the capable check against init_user_ns until cgroups can
>  		 * contain all limits.
>  		 */
>  		if (new_rlim->rlim_max > rlim->rlim_max &&
> -				!capable(CAP_SYS_RESOURCE))
> -			retval = -EPERM;
> -		if (!retval)
> -			retval = security_task_setrlimit(tsk, resource, new_rlim);
> +		    !capable(CAP_SYS_RESOURCE))
> +			return -EPERM;
> +		retval = security_task_setrlimit(tsk, resource, new_rlim);
> +		if (retval)
> +			return retval;
>  	}
> +
> +	task_lock(tsk->group_leader);
>  	if (!retval) {
>  		if (old_rlim)
>  			*old_rlim = *rlim;

Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

Posted by Oleg Nesterov 1 year, 5 months ago

On 11/27, Andrew Morton wrote:
>
> On Wed, 20 Nov 2024 21:21:56 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:
>
> > The security_task_setrlimit function is a Linux Security Module (LSM)
> > hook that evaluates resource limit changes based on security policies.
> > It does not alter the rlim data structure, as confirmed by existing
> > LSM implementations (e.g., SELinux and AppArmor). Thus, this function
> > does not require locking, ensuring correctness while improving
> > concurrency.
>
> Seems sane.
>
> Does any code call do_prlimit() frequently enough for this to matter?

I have the same question...

> > -	task_lock(tsk->group_leader);
> >  	if (new_rlim) {
> >  		/*
> >  		 * Keep the capable check against init_user_ns until cgroups can
> >  		 * contain all limits.
> >  		 */
> >  		if (new_rlim->rlim_max > rlim->rlim_max &&
> > -				!capable(CAP_SYS_RESOURCE))
> > -			retval = -EPERM;
> > -		if (!retval)
> > -			retval = security_task_setrlimit(tsk, resource, new_rlim);
> > +		    !capable(CAP_SYS_RESOURCE))
> > +			return -EPERM;
> > +		retval = security_task_setrlimit(tsk, resource, new_rlim);
> > +		if (retval)
> > +			return retval;
> >  	}
> > +
> > +	task_lock(tsk->group_leader);

The problem is that task_lock(tsk->group_leader) doesn't look right with or
without this patch. I'll try to make a fix on weekend.

If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
not stable, do_prlimit() can race with mt exec and take the wrong lock.

Oleg.

Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

Posted by Oleg Nesterov 1 year, 5 months ago

On 11/28, Oleg Nesterov wrote:
>
> The problem is that task_lock(tsk->group_leader) doesn't look right with or
> without this patch. I'll try to make a fix on weekend.
>
> If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
> not stable, do_prlimit() can race with mt exec and take the wrong lock.

... and task_unlock(tsk->group_leader) is simply unsafe.

perhaps something like below, but it doesn't look nice, I'll try to think
more. And grep, may be there are more lockless users of tsk->group_leader
when tsk != current.

Oleg.

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1464,6 +1464,7 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
 static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 		      struct rlimit *new_rlim, struct rlimit *old_rlim)
 {
+	struct task_struct *leader;
 	struct rlimit *rlim;
 	int retval = 0;
 
@@ -1481,7 +1482,14 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 
 	/* Holding a refcount on tsk protects tsk->signal from disappearing. */
 	rlim = tsk->signal->rlim + resource;
-	task_lock(tsk->group_leader);
+
+	if (tsk != current)
+		read_lock(&tasklist_lock);
+	leader = READ_ONCE(tsk->group_leader);
+	task_lock(leader);
+	if (tsk != current)
+		read_unlock(&tasklist_lock);
+
 	if (new_rlim) {
 		/*
 		 * Keep the capable check against init_user_ns until cgroups can
@@ -1499,7 +1507,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 		if (new_rlim)
 			*rlim = *new_rlim;
 	}
-	task_unlock(tsk->group_leader);
+	task_unlock(leader);
 
 	/*
 	 * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not

Re: [PATCH] kernel/sys: Optimize do_prlimit lock scope to reduce contention

Posted by Oleg Nesterov 1 year, 5 months ago

On 11/28, Oleg Nesterov wrote:
>
> On 11/28, Oleg Nesterov wrote:
> >
> > The problem is that task_lock(tsk->group_leader) doesn't look right with or
> > without this patch. I'll try to make a fix on weekend.
> >
> > If the caller is sys_prlimit64() and tsk != current, then ->group_leader is
> > not stable, do_prlimit() can race with mt exec and take the wrong lock.
>
> ... and task_unlock(tsk->group_leader) is simply unsafe.
>
> perhaps something like below,

No, this is wrong too,

> I'll try to think more.

Yes...

Oleg.