RLIMIT_NPROC in ucounts fixups

[RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Posted by Michal Koutný 4 years, 4 months ago

This series is a result of looking deeper into breakage of
tools/testing/selftests/rlimits/rlimits-per-userns.c after 
https://lore.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com/
is applied.

The description of the original problem that lead to RLIMIT_NPROC et al.
ucounts rewrite could be ambiguously interpretted as supporting either
the case of:
- never-fork service or
- fork (RLIMIT_NPROC-1) times service.

The scenario is weird anyway given existence of pids controller.

The realization of that scenario relies not only on tracking number of
processes per user_ns but also newly allows the root to override limit through
set*uid. The commit message didn't mention that, so it's unclear if it
was the intention too.

I also noticed that the RLIMIT_NPROC enforcing in fork seems subject to TOCTOU
race (check(nr_tasks),...,nr_tasks++) so the limit is rather advisory (but
that's not a new thing related to ucounts rewrite).

This series is RFC to discuss relevance of the subtle changes RLIMIT_NPROC to
ucounts rewrite introduced.

Michal Koutný (6):
  set_user: Perform RLIMIT_NPROC capability check against new user
    credentials
  set*uid: Check RLIMIT_PROC against new credentials
  cred: Count tasks by their real uid into RLIMIT_NPROC
  ucounts: Allow root to override RLIMIT_NPROC
  selftests: Challenge RLIMIT_NPROC in user namespaces
  selftests: Test RLIMIT_NPROC in clone-created user namespaces

 fs/exec.c                                     |   2 +-
 include/linux/cred.h                          |   2 +-
 kernel/cred.c                                 |  29 ++-
 kernel/fork.c                                 |   2 +-
 kernel/sys.c                                  |  20 +-
 kernel/ucount.c                               |   3 +
 kernel/user_namespace.c                       |   2 +-
 .../selftests/rlimits/rlimits-per-userns.c    | 233 +++++++++++++++---
 8 files changed, 229 insertions(+), 64 deletions(-)

-- 
2.34.1

Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> writes:

> This series is a result of looking deeper into breakage of
> tools/testing/selftests/rlimits/rlimits-per-userns.c after 
> https://lore.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com/
> is applied.
>
> The description of the original problem that lead to RLIMIT_NPROC et al.
> ucounts rewrite could be ambiguously interpretted as supporting either
> the case of:
> - never-fork service or
> - fork (RLIMIT_NPROC-1) times service.
>
> The scenario is weird anyway given existence of pids controller.
>
> The realization of that scenario relies not only on tracking number of
> processes per user_ns but also newly allows the root to override limit through
> set*uid. The commit message didn't mention that, so it's unclear if it
> was the intention too.
>
> I also noticed that the RLIMIT_NPROC enforcing in fork seems subject to TOCTOU
> race (check(nr_tasks),...,nr_tasks++) so the limit is rather advisory (but
> that's not a new thing related to ucounts rewrite).
>
> This series is RFC to discuss relevance of the subtle changes RLIMIT_NPROC to
> ucounts rewrite introduced.

A quick reply (because I don't have a lot of time at the moment).

I agree with the issues your first patch before this series addresses
and the issues the first 3 patches address.

I have not looked at the tests.

I actually disagree with most of your fixes.  Both because of
intrusiveness and because of awkwardness.  My basic problem with
your fixes is I don't think they leave the code in a more maintainable
state.

Hopefully later today I can propose some alternative fixes and we can
continue the discussion.

One thing I think you misunderstood is the capability checks in set_user
have always been there.  There is a very good argument they are badly
placed so are not exactly checking the correct credentials.  Especially
now.

Your patch 4/6 I don't think makes sense.  It has always been the
case that root without capabilities is subject to the rlimit.  If  you
are in a user namespace you are root without capabilities.

Eric

[PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný recently found some bugs in the enforcement of
RLIMIT_NPROC in the recent ucount rlimit implementation.

I saw some additional bugs and some cleaner ways to fix the problem so
instead of starting with his fixes these are my own.

I am aiming to send the first 5 of these to Linus once they have been
reviewed.  Two more are fixes in principle but I don't think do anything
in practice.  The last one is just a cleanup to prevent future
divergence of RLIMIT_NPROC logic.

Eric W. Biederman (8):
      ucounts: Fix RLIMIT_NPROC regression
      ucounts: Fix set_cred_ucounts
      ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve
      ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC
      ucounts: Handle wrapping in is_ucounts_overlimit
      ucounts: Handle inc_rlimit_ucounts wrapping in fork
      rlimit: For RLIMIT_NPROC test the child not the parent for capabilites
      ucounts: Use the same code to enforce RLIMIT_NPROC in fork and exec

 fs/exec.c                    | 12 +++++-------
 include/linux/sched.h        |  2 +-
 include/linux/sched/signal.h |  2 ++
 kernel/cred.c                | 24 +++++++++++-------------
 kernel/fork.c                | 32 ++++++++++++++++++++++++--------
 kernel/sys.c                 | 14 --------------
 kernel/ucount.c              |  3 ++-
 kernel/user_namespace.c      |  2 ++
 8 files changed, 47 insertions(+), 44 deletions(-)

Eric

Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Posted by Shuah Khan 4 years, 4 months ago

On 2/10/22 7:01 PM, Eric W. Biederman wrote:
> 
> Michal Koutný recently found some bugs in the enforcement of
> RLIMIT_NPROC in the recent ucount rlimit implementation.
> 
> I saw some additional bugs and some cleaner ways to fix the problem so
> instead of starting with his fixes these are my own.
> 
> I am aiming to send the first 5 of these to Linus once they have been
> reviewed.  Two more are fixes in principle but I don't think do anything
> in practice.  The last one is just a cleanup to prevent future
> divergence of RLIMIT_NPROC logic.
> 
> Eric W. Biederman (8):
>        ucounts: Fix RLIMIT_NPROC regression
>        ucounts: Fix set_cred_ucounts
>        ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve
>        ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC

Should this be "Only exempt"?
  
>        ucounts: Handle wrapping in is_ucounts_overlimit
>        ucounts: Handle inc_rlimit_ucounts wrapping in fork
>        rlimit: For RLIMIT_NPROC test the child not the parent for capabilites
>        ucounts: Use the same code to enforce RLIMIT_NPROC in fork and exec
> 
>   fs/exec.c                    | 12 +++++-------
>   include/linux/sched.h        |  2 +-
>   include/linux/sched/signal.h |  2 ++
>   kernel/cred.c                | 24 +++++++++++-------------
>   kernel/fork.c                | 32 ++++++++++++++++++++++++--------
>   kernel/sys.c                 | 14 --------------
>   kernel/ucount.c              |  3 ++-
>   kernel/user_namespace.c      |  2 ++
>   8 files changed, 47 insertions(+), 44 deletions(-)
> 
> Eric
> 

Do we need updates to selftests - Michal's patch series included changes to
selftests/exec

thanks,
-- Shuah

Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Posted by Eric W. Biederman 4 years, 4 months ago

Shuah Khan <skhan@linuxfoundation.org> writes:

> On 2/10/22 7:01 PM, Eric W. Biederman wrote:
>> Michal Koutný recently found some bugs in the enforcement of
>> RLIMIT_NPROC in the recent ucount rlimit implementation.
>> I saw some additional bugs and some cleaner ways to fix the problem so
>> instead of starting with his fixes these are my own.
>> I am aiming to send the first 5 of these to Linus once they have been
>> reviewed.  Two more are fixes in principle but I don't think do anything
>> in practice.  The last one is just a cleanup to prevent future
>> divergence of RLIMIT_NPROC logic.
>> Eric W. Biederman (8):
>>        ucounts: Fix RLIMIT_NPROC regression
>>        ucounts: Fix set_cred_ucounts
>>        ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve
>>        ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC
>
> Should this be "Only exempt"?

Yes.
>  >        ucounts: Handle wrapping in is_ucounts_overlimit
>>        ucounts: Handle inc_rlimit_ucounts wrapping in fork
>>        rlimit: For RLIMIT_NPROC test the child not the parent for capabilites
>>        ucounts: Use the same code to enforce RLIMIT_NPROC in fork and exec
>>   fs/exec.c                    | 12 +++++-------
>>   include/linux/sched.h        |  2 +-
>>   include/linux/sched/signal.h |  2 ++
>>   kernel/cred.c                | 24 +++++++++++-------------
>>   kernel/fork.c                | 32 ++++++++++++++++++++++++--------
>>   kernel/sys.c                 | 14 --------------
>>   kernel/ucount.c              |  3 ++-
>>   kernel/user_namespace.c      |  2 ++
>>   8 files changed, 47 insertions(+), 44 deletions(-)
>> Eric
>> 
>
> Do we need updates to selftests - Michal's patch series included changes to
> selftests/exec

selftests would be good.

Eric

Re: [PATCH 0/8] ucounts: RLIMIT_NPROC fixes

Posted by Michal Koutný 4 years, 4 months ago

On Fri, Feb 11, 2022 at 11:22:13AM -0700, Shuah Khan <skhan@linuxfoundation.org> wrote:
> Do we need updates to selftests - Michal's patch series included changes to
> selftests/exec

In my understanding the original rlimits-per-userns.c covers an invalid
use case -- clone(0);setuid();unshare(CLONE_NEWUSER) -- where the
created user_ns is owned by unprivileged user and the global
RLIMIT_NPROC cannot be breached.

My patched variant retains this use-case (should fail) and adds
clone(CLONE_NEWUSER);setuid() [1] variant which should be the valid
use-case for per-user per-user-ns RLIMIT_NPROC.

Michal

[1] In this situation theoretically equivalent to clone(0);unshare(CLONE_NEWUSER);setuid().

[PATCH v2 0/5] ucounts: RLIMIT_NPROC fixes

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný recently found some bugs in the enforcement of
RLIMIT_NPROC in the recent ucount rlimit implementation.

I saw some additional bugs and some cleaner ways to fix the problem so
instead of starting with his fixes these are my own.

I have rewritten about half my fixes since the last time this was
posted.  There is this notion (not entirely wrong) that the code should
be consistent and make sense.  When I dug in I discovered that has not
been the case for the last 20 years.  Fixing the long standing
inconsistencies is something that seems to warrent wider vetting on
linux-api.

So with this set of patches I have developed a very conservative
approach changing only what is necessary to fix the bugs that I can
see clearly.  Cleanups and anything that is making the code more
consistent can follow after we have the code working as it has
historically.

Anyone who can please take a look and tell me if I am doing something silly.

Eric W. Biederman (5):
      rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
      ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
      ucounts: Base set_cred_ucounts changes on the real user
      ucounts: Move RLIMIT_NPROC handling after set_user
      ucounts: Handle wrapping in is_ucounts_overlimit

 kernel/cred.c   |  9 ++-------
 kernel/fork.c   | 10 +++++-----
 kernel/sys.c    | 20 ++++++++++++++------
 kernel/ucount.c |  3 ++-
 4 files changed, 23 insertions(+), 19 deletions(-)

Eric

[GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17

Posted by Eric W. Biederman 4 years, 4 months ago

Linus,

Please pull the ucount-rlimit-fixes-for-v5.17 branch from the git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17

HEAD: 0cbae9e24fa7d6c6e9f828562f084da82217a0c5 ucounts: Handle wrapping in is_ucounts_overlimit

Michal Koutný recently found some bugs in the enforcement of
RLIMIT_NPROC in the recent ucount rlimit implementation.

I saw some additional bugs and some cleaner ways to fix the problem so
instead of starting with his fixes these are my own.

I have rewritten about half my fixes since the last time this was
posted. There is this notion (not entirely wrong) that the code should
be consistent and make sense. When I dug in I discovered that has not
been the case for the last 20 years. Fixing the long standing
inconsistencies is something that seems to warrent wider vetting on
linux-api.

So with this set of patches I have developed a very conservative
approach changing only what is necessary to fix the bugs that I can
see clearly. Cleanups and anything that is making the code more
consistent can follow after we have the code working as it has
historically.

I had hoped to let this sit in linux-next for a few days just to be
doubly certain all is well. But these patches are all trivial and
linux-next is on holiday.

v2: https://lkml.kernel.org/r/87ilteiz4a.fsf_-_@email.froward.int.ebiederm.org>
v1: https://lkml.kernel.org/r/87o83e2mbu.fsf@email.froward.int.ebiederm.org>

Eric W. Biederman (5):
rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
ucounts: Base set_cred_ucounts changes on the real user
ucounts: Move RLIMIT_NPROC handling after set_user
ucounts: Handle wrapping in is_ucounts_overlimit

kernel/cred.c | 9 ++-------
kernel/fork.c | 10 +++++-----
kernel/sys.c | 20 ++++++++++++++------
kernel/ucount.c | 3 ++-
4 files changed, 23 insertions(+), 19 deletions(-)

p.s. I should say that the problem is not so much inconsistencies
(although those exist) but that it is very difficult to figure out what
the code should be doing in the case of RLIMIT_NPROC.

All other rlimits are only enforced where the resource is acquired
(allocated). RLIMIT_NPROC by necessity needs to be enforced in
an additional location, and our current implementation stumbled
it's way into that implementation.

Eric

Re: [GIT PULL] ucounts: RLIMIT_NPROC fixes for v5.17

Posted by pr-tracker-bot@kernel.org 4 years, 4 months ago

The pull request you sent on Fri, 18 Feb 2022 09:34:24 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2d3409ebc87f4bc4ed23bd39e78db9ffc29eec44

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

[GIT PULL] ucounts: Regression fix for v5.17

Posted by Eric W. Biederman 4 years, 3 months ago


Linus,

Please pull the ucount-rlimit-fixes-for-v5.17 branch from the git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17

  HEAD: 0ac983f512033cb7b5e210c9589768ad25b1e36b ucounts: Fix systemd LimitNPROC with private users regression

Etienne Dechamps recently found a regression caused by enforcing
RLIMIT_NPROC for root where the rlimit was not previously enforced.

Michal Koutný had previously pointed out the inconsistency in enforcing
the RLIMIT_NPROC that had been on the root owned process after the root
user creates a user namespace.

Which makes the fix for the regression simply removing the
inconsistency.


From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Thu, 24 Feb 2022 08:32:28 -0600
Subject: [PATCH] ucounts: Fix systemd LimitNPROC with private users regression

Long story short recursively enforcing RLIMIT_NPROC when it is not
enforced on the process that creates a new user namespace, causes
currently working code to fail.  There is no reason to enforce
RLIMIT_NPROC recursively when we don't enforce it normally so update
the code to detect this case.

I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
RLIMIT_NPROC is charged and checked for enforcement based upon the
real uid, using capable() which is euid based is inconsistent with reality.
Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
testing for when the real uid would match the conditions when
CAP_SYS_RESOURCE would be present if the real uid was the effective
uid.

Reported-by: Etienne Dechamps <etienne@edechamps.fr>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
Link: https://lkml.kernel.org/r/87sfs8jmpz.fsf_-_@email.froward.int.ebiederm.org
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..5481ba44a8d6 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	cred->user_ns = user_ns;
 }
 
+static unsigned long enforced_nproc_rlimit(void)
+{
+	unsigned long limit = RLIM_INFINITY;
+
+	/* Is RLIMIT_NPROC currently enforced? */
+	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
+	    (current_user_ns() != &init_user_ns))
+		limit = rlimit(RLIMIT_NPROC);
+
+	return limit;
+}
+
 /*
  * Create a new user namespace, deriving the creator from the user in the
  * passed credentials, and replacing that user with the new root user for the
@@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
 	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
 	}
-	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
-- 
2.29.2

Re: [GIT PULL] ucounts: Regression fix for v5.17

Posted by pr-tracker-bot@kernel.org 4 years, 3 months ago

The pull request you sent on Wed, 02 Mar 2022 18:12:40 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ucount-rlimit-fixes-for-v5.17

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5859a2b1991101d6b978f3feb5325dad39421f29

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

[PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user

Posted by Eric W. Biederman 4 years, 4 months ago

Solar Designer <solar@openwall.com> wrote:
> I'm not aware of anyone actually running into this issue and reporting
> it.  The systems that I personally know use suexec along with rlimits
> still run older/distro kernels, so would not yet be affected.
>
> So my mention was based on my understanding of how suexec works, and
> code review.  Specifically, Apache httpd has the setting RLimitNPROC,
> which makes it set RLIMIT_NPROC:
>
> https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc
>
> The above documentation for it includes:
>
> "This applies to processes forked from Apache httpd children servicing
> requests, not the Apache httpd children themselves. This includes CGI
> scripts and SSI exec commands, but not any processes forked from the
> Apache httpd parent, such as piped logs."
>
> In code, there are:
>
> ./modules/generators/mod_cgid.c:        ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/generators/mod_cgi.c:        ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/filters/mod_ext_filter.c:    rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc);
>
> For example, in mod_cgi.c this is in run_cgi_child().
>
> I think this means an httpd child sets RLIMIT_NPROC shortly before it
> execs suexec, which is a SUID root program.  suexec then switches to the
> target user and execs the CGI script.
>
> Before 2863643fb8b9, the setuid() in suexec would set the flag, and the
> target user's process count would be checked against RLIMIT_NPROC on
> execve().  After 2863643fb8b9, the setuid() in suexec wouldn't set the
> flag because setuid() is (naturally) called when the process is still
> running as root (thus, has those limits bypass capabilities), and
> accordingly execve() would not check the target user's process count
> against RLIMIT_NPROC.

In commit 2863643fb8b9 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to
make it more consistent with fork.  Unfortunately because of call site
differences those capables calls were checking the credentials of the
user before set*id() instead of after set*id().

This breaks enforcement of RLIMIT_NPROC for applications that set the
rlimit and then call set*id() while holding a full set of
capabilities.  The capabilities are only changed in the new credential
in security_task_fix_setuid().

The code in apache suexec appears to follow this pattern.

Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits
(RLIMIT_NPROC)") where this check was added describes the targes of this
capability check as:

  2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
      calls setuid, the setuid should fail if the user would then be running
      more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't.  This patch
      adds an appropriate test.  With this patch, and per-user process limit
      imposed in cgiwrap really works.

So the original use case also of this check also appears to match the broken
pattern.

Restore the enforcement of RLIMIT_NPROC by removing the bad capable
checks added in set_user.  This unfortunately restores the
inconsistencies state the code has been in for the last 11 years, but
dealing with the inconsistencies looks like a larger problem.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/
Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com
Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/sys.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index ecc4cf019242..8dd938a3d2bf 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -480,8 +480,7 @@ static int set_user(struct cred *new)
 	 * failure to the execve() stage.
 	 */
 	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-			new_user != INIT_USER &&
-			!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
+			new_user != INIT_USER)
 		current->flags |= PF_NPROC_EXCEEDED;
 	else
 		current->flags &= ~PF_NPROC_EXCEEDED;
-- 
2.29.2

Re: [PATCH v2 1/5] rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user

Posted by Solar Designer 4 years, 4 months ago

On Wed, Feb 16, 2022 at 09:58:28AM -0600, Eric W. Biederman wrote:
> Solar Designer <solar@openwall.com> wrote:
> > I'm not aware of anyone actually running into this issue and reporting
> > it.  The systems that I personally know use suexec along with rlimits
> > still run older/distro kernels, so would not yet be affected.
> >
> > So my mention was based on my understanding of how suexec works, and
> > code review.  Specifically, Apache httpd has the setting RLimitNPROC,
> > which makes it set RLIMIT_NPROC:
> >
> > https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc
> >
> > The above documentation for it includes:
> >
> > "This applies to processes forked from Apache httpd children servicing
> > requests, not the Apache httpd children themselves. This includes CGI
> > scripts and SSI exec commands, but not any processes forked from the
> > Apache httpd parent, such as piped logs."
> >
> > In code, there are:
> >
> > ./modules/generators/mod_cgid.c:        ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> > ./modules/generators/mod_cgi.c:        ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> > ./modules/filters/mod_ext_filter.c:    rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc);
> >
> > For example, in mod_cgi.c this is in run_cgi_child().
> >
> > I think this means an httpd child sets RLIMIT_NPROC shortly before it
> > execs suexec, which is a SUID root program.  suexec then switches to the
> > target user and execs the CGI script.
> >
> > Before 2863643fb8b9, the setuid() in suexec would set the flag, and the
> > target user's process count would be checked against RLIMIT_NPROC on
> > execve().  After 2863643fb8b9, the setuid() in suexec wouldn't set the
> > flag because setuid() is (naturally) called when the process is still
> > running as root (thus, has those limits bypass capabilities), and
> > accordingly execve() would not check the target user's process count
> > against RLIMIT_NPROC.
> 
> In commit 2863643fb8b9 ("set_user: add capability check when
> rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to
> make it more consistent with fork.  Unfortunately because of call site
> differences those capables calls were checking the credentials of the

s/capables/capable/

> user before set*id() instead of after set*id().
> 
> This breaks enforcement of RLIMIT_NPROC for applications that set the
> rlimit and then call set*id() while holding a full set of
> capabilities.  The capabilities are only changed in the new credential
> in security_task_fix_setuid().
> 
> The code in apache suexec appears to follow this pattern.
> 
> Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits
> (RLIMIT_NPROC)") where this check was added describes the targes of this
> capability check as:
> 
>   2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
>       calls setuid, the setuid should fail if the user would then be running
>       more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't.  This patch
>       adds an appropriate test.  With this patch, and per-user process limit
>       imposed in cgiwrap really works.
> 
> So the original use case also of this check also appears to match the broken
> pattern.

Duplicate "also" - drop one.

> Restore the enforcement of RLIMIT_NPROC by removing the bad capable
> checks added in set_user.  This unfortunately restores the
> inconsistencies state the code has been in for the last 11 years, but

s/inconsistencies/inconsistent/

> dealing with the inconsistencies looks like a larger problem.
> 
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/
> Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com
> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
> History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/sys.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ecc4cf019242..8dd938a3d2bf 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -480,8 +480,7 @@ static int set_user(struct cred *new)
>  	 * failure to the execve() stage.
>  	 */
>  	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
> -			new_user != INIT_USER &&
> -			!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> +			new_user != INIT_USER)
>  		current->flags |= PF_NPROC_EXCEEDED;
>  	else
>  		current->flags &= ~PF_NPROC_EXCEEDED;

Reviewed-by: Solar Designer <solar@openwall.com>

Alexander

[PATCH v2 2/5] ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> wrote:

> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386bc ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.

This can be fixed either by fixing the test or by moving the increment
to be before the test.  Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.

In the case of CLONE_NEWUSER the ucounts in the task_cred changes.
The function is_ucounts_overlimit needs to use the final version of
the ucounts for the new process.  Which means moving the
is_ucounts_overlimit test after copy_creds is necessary.

Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts.  The change of the test in
fork was bad because it was before the increment.  The test in
set_user was wrong and the change to ucounts fixed it.  So this
fix only restores the old behavior in one lcation not two.

Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Cc: stable@vger.kernel.org
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..17d8a8c85e3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
 #ifdef CONFIG_PROVE_LOCKING
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
+	retval = copy_creds(p, clone_flags);
+	if (retval < 0)
+		goto bad_fork_free;
+
 	retval = -EAGAIN;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
 		if (p->real_cred->user != INIT_USER &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
-			goto bad_fork_free;
+			goto bad_fork_cleanup_count;
 	}
 	current->flags &= ~PF_NPROC_EXCEEDED;

-	retval = copy_creds(p, clone_flags);
-	if (retval < 0)
-		goto bad_fork_free;
-
 	/*
 	 * If multiple threads are within copy_process(), then this check
 	 * triggers too late. This doesn't hurt, the check is only there
-- 
2.29.2

[PATCH v2 3/5] ucounts: Base set_cred_ucounts changes on the real user

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> wrote:
> Tasks are associated to multiple users at once. Historically and as per
> setrlimit(2) RLIMIT_NPROC is enforce based on real user ID.
>
> The commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> made the accounting structure "indexed" by euid and hence potentially
> account tasks differently.
>
> The effective user ID may be different e.g. for setuid programs but
> those are exec'd into already existing task (i.e. below limit), so
> different accounting is moot.
>
> Some special setresuid(2) users may notice the difference, justifying
> this fix.

I looked at cred->ucount and it is only used for rlimit operations
that were previously stored in cred->user.  Making the fact
cred->ucount can refer to a different user from cred->user a bug,
affecting all uses of cred->ulimit not just RLIMIT_NPROC.

Fix set_cred_ucounts to always use the real uid not the effective uid.

Further simplify set_cred_ucounts by noticing that set_cred_ucounts
somehow retained a draft version of the check to see if alloc_ucounts
was needed that checks the new->user and new->user_ns against the
current_real_cred().  Remove that draft version of the check.

All that matters for setting the cred->ucounts are the user_ns and uid
fields in the cred.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/cred.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 473d17c431f3..933155c96922 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -665,21 +665,16 @@ EXPORT_SYMBOL(cred_fscmp);
 
 int set_cred_ucounts(struct cred *new)
 {
-	struct task_struct *task = current;
-	const struct cred *old = task->real_cred;
 	struct ucounts *new_ucounts, *old_ucounts = new->ucounts;
 
-	if (new->user == old->user && new->user_ns == old->user_ns)
-		return 0;
-
 	/*
 	 * This optimization is needed because alloc_ucounts() uses locks
 	 * for table lookups.
 	 */
-	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
+	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->uid))
 		return 0;
 
-	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid)))
+	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->uid)))
 		return -EAGAIN;
 
 	new->ucounts = new_ucounts;
-- 
2.29.2

[PATCH v2 4/5] ucounts: Move RLIMIT_NPROC handling after set_user

Posted by Eric W. Biederman 4 years, 4 months ago

During set*id() which cred->ucounts to charge the the current process
to is not known until after set_cred_ucounts.  So move the
RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call
flag_nproc_exceeded after set_cred_ucounts.

This is very much an arbitrary subset of the places where we currently
change the RLIMIT_NPROC accounting, designed to preserve the existing
logic.

Fixing the existing logic will be the subject of another series of
changes.

Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/sys.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 8dd938a3d2bf..97dc9e5d6bf9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,6 +472,16 @@ static int set_user(struct cred *new)
 	if (!new_user)
 		return -EAGAIN;
 
+	free_uid(new->user);
+	new->user = new_user;
+	return 0;
+}
+
+static void flag_nproc_exceeded(struct cred *new)
+{
+	if (new->ucounts == current_ucounts())
+		return;
+
 	/*
 	 * We don't fail in case of NPROC limit excess here because too many
 	 * poorly written programs don't check set*uid() return code, assuming
@@ -480,14 +490,10 @@ static int set_user(struct cred *new)
 	 * failure to the execve() stage.
 	 */
 	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-			new_user != INIT_USER)
+			new->user != INIT_USER)
 		current->flags |= PF_NPROC_EXCEEDED;
 	else
 		current->flags &= ~PF_NPROC_EXCEEDED;
-
-	free_uid(new->user);
-	new->user = new_user;
-	return 0;
 }
 
 /*
@@ -562,6 +568,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 	if (retval < 0)
 		goto error;
 
+	flag_nproc_exceeded(new);
 	return commit_creds(new);
 
 error:
@@ -624,6 +631,7 @@ long __sys_setuid(uid_t uid)
 	if (retval < 0)
 		goto error;
 
+	flag_nproc_exceeded(new);
 	return commit_creds(new);
 
 error:
@@ -703,6 +711,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 	if (retval < 0)
 		goto error;
 
+	flag_nproc_exceeded(new);
 	return commit_creds(new);
 
 error:
-- 
2.29.2

[PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Eric W. Biederman 4 years, 4 months ago

While examining is_ucounts_overlimit and reading the various messages
I realized that is_ucounts_overlimit fails to deal with counts that
may have wrapped.

Being wrapped should be a transitory state for counts and they should
never be wrapped for long, but it can happen so handle it.

Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/ucount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 65b597431c86..06ea04d44685 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
 	if (rlimit > LONG_MAX)
 		max = LONG_MAX;
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		if (get_ucounts_value(iter, type) > max)
+		long val = get_ucounts_value(iter, type);
+		if (val < 0 || val > max)
 			return true;
 		max = READ_ONCE(iter->ns->ucount_max[type]);
 	}
-- 
2.29.2

Re: [PATCH v2 5/5] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Shuah Khan 4 years, 4 months ago

On 2/16/22 8:58 AM, Eric W. Biederman wrote:
> While examining is_ucounts_overlimit and reading the various messages
> I realized that is_ucounts_overlimit fails to deal with counts that
> may have wrapped.
> 
> Being wrapped should be a transitory state for counts and they should
> never be wrapped for long, but it can happen so handle it.
> 
> Cc: stable@vger.kernel.org
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>   kernel/ucount.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 65b597431c86..06ea04d44685 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
>   	if (rlimit > LONG_MAX)
>   		max = LONG_MAX;
>   	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> -		if (get_ucounts_value(iter, type) > max)
> +		long val = get_ucounts_value(iter, type);
> +		if (val < 0 || val > max)
>   			return true;
>   		max = READ_ONCE(iter->ns->ucount_max[type]);
>   	}
> 

This addresses the concerns about overflow.

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

[PATCH 1/8] ucounts: Fix RLIMIT_NPROC regression

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> wrote:

> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386bc ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.

This can be fixed either by fixing the test or by moving the increment
to be before the test.  Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.

This is necessary so that in the case of CLONE_NEWUSER the new credential
with the new ucounts is available of is_ucounts_overlimit to test.

Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts.  The change of the test in
fork was bad because it was before the increment.  The test in
set_user was wrong and the change to ucounts fixed it.  So this
fix is only restore the old behavior in one lcatio not two.

Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Cc: stable@vger.kernel.org
Reported-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..17d8a8c85e3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
 #ifdef CONFIG_PROVE_LOCKING
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
+	retval = copy_creds(p, clone_flags);
+	if (retval < 0)
+		goto bad_fork_free;
+
 	retval = -EAGAIN;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
 		if (p->real_cred->user != INIT_USER &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
-			goto bad_fork_free;
+			goto bad_fork_cleanup_count;
 	}
 	current->flags &= ~PF_NPROC_EXCEEDED;

-	retval = copy_creds(p, clone_flags);
-	if (retval < 0)
-		goto bad_fork_free;
-
 	/*
 	 * If multiple threads are within copy_process(), then this check
 	 * triggers too late. This doesn't hurt, the check is only there
-- 
2.29.2

Re: [PATCH 1/8] ucounts: Fix RLIMIT_NPROC regression

Posted by Michal Koutný 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:17PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> This can be fixed either by fixing the test or by moving the increment
> to be before the test.  Fix it my moving copy_creds which contains
> the increment before is_ucounts_overlimit.

This is simpler than my approach and I find it correct too.

> Both the test in fork and the test in set_user were semantically
> changed when the code moved to ucounts.  The change of the test in
> fork was bad because it was before the increment.
>
> The test in set_user was wrong and the change to ucounts fixed it.  So
> this fix is only restore the old behavior in one lcatio not two.

Whom should be the task accounted to in the case of set*uid? (The change
to ucounts made the check against the pre-switch user's ucounts.)

> ---
>  kernel/fork.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

Re: [PATCH 1/8] ucounts: Fix RLIMIT_NPROC regression

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> writes:

> On Thu, Feb 10, 2022 at 08:13:17PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> This can be fixed either by fixing the test or by moving the increment
>> to be before the test.  Fix it my moving copy_creds which contains
>> the increment before is_ucounts_overlimit.
>
> This is simpler than my approach and I find it correct too.
>
>> Both the test in fork and the test in set_user were semantically
>> changed when the code moved to ucounts.  The change of the test in
>> fork was bad because it was before the increment.
>>
>> The test in set_user was wrong and the change to ucounts fixed it.  So
>> this fix is only restore the old behavior in one lcatio not two.
>
> Whom should be the task accounted to in the case of set*uid? (The change
> to ucounts made the check against the pre-switch user's ucounts.)

It needs to be post-switch in the case of set*id().

I have that fixed in the next version of my patchset.

>> ---
>>  kernel/fork.c | 10 +++++-----
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

[PATCH 2/8] ucounts: Fix set_cred_ucounts

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> wrote:
> Tasks are associated to multiple users at once. Historically and as per
> setrlimit(2) RLIMIT_NPROC is enforce based on real user ID.
>
> The commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> made the accounting structure "indexed" by euid and hence potentially
> account tasks differently.
>
> The effective user ID may be different e.g. for setuid programs but
> those are exec'd into already existing task (i.e. below limit), so
> different accounting is moot.
>
> Some special setresuid(2) users may notice the difference, justifying
> this fix.

I looked at the cred->ucount is only used for rlimit operations that
were previously stored in cred->user.  Making the fact cred->ucount
can refer to a different user from cred->user a bug working will all
rlimits not just RLIMIT_NPROC.

So fix set_cred_ucounts to always use the real uid not the effective uid.

Further simplify set_cred_ucounts by noticing that set_cred_ucounts
somehow retained a draft version of the check to see if alloc_ucounts
was needed that checks the new->user and new->user_ns against the
current_real_cred(), when nothing matters for setting the ucounts
field of a struct cred except the other fields in that same struct
cred.

So delete the confusing and wrong check against the
current_real_cred(), and all of it's intermediate variables.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com
Reported-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/cred.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 473d17c431f3..933155c96922 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -665,21 +665,16 @@ EXPORT_SYMBOL(cred_fscmp);
 
 int set_cred_ucounts(struct cred *new)
 {
-	struct task_struct *task = current;
-	const struct cred *old = task->real_cred;
 	struct ucounts *new_ucounts, *old_ucounts = new->ucounts;
 
-	if (new->user == old->user && new->user_ns == old->user_ns)
-		return 0;
-
 	/*
 	 * This optimization is needed because alloc_ucounts() uses locks
 	 * for table lookups.
 	 */
-	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
+	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->uid))
 		return 0;
 
-	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid)))
+	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->uid)))
 		return -EAGAIN;
 
 	new->ucounts = new_ucounts;
-- 
2.29.2

Re: [PATCH 2/8] ucounts: Fix set_cred_ucounts

Posted by Michal Koutný 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:18PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 473d17c431f3..933155c96922 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -665,21 +665,16 @@ EXPORT_SYMBOL(cred_fscmp);
>  
>  int set_cred_ucounts(struct cred *new)
>  {
> -	struct task_struct *task = current;
> -	const struct cred *old = task->real_cred;
>  	struct ucounts *new_ucounts, *old_ucounts = new->ucounts;
>  
> -	if (new->user == old->user && new->user_ns == old->user_ns)
> -		return 0;
> -
>  	/*
>  	 * This optimization is needed because alloc_ucounts() uses locks
>  	 * for table lookups.
>  	 */
> -	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
> +	if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->uid))
>  		return 0;
>  
> -	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid)))
> +	if (!(new_ucounts = alloc_ucounts(new->user_ns, new->uid)))
>  		return -EAGAIN;
>  
>  	new->ucounts = new_ucounts;

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> wrote:
> The check is currently against the current->cred but since those are
> going to change and we want to check RLIMIT_NPROC condition after the
> switch, supply the capability check with the new cred.
> But since we're checking new_user being INIT_USER any new cred's
> capability-based allowance may be redundant when the check fails and the
> alternative solution would be revert of the commit 2863643fb8b9
> ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")

As of commit 2863643fb8b9 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
should check RLIMIT_NPROC is buggy, as it tests the capabilites from
before the credential change and not aftwards.

As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
ucounts") examining the rlimit is buggy as cred->ucounts has not yet
been properly set in the new credential.

Make the code correct and more robust moving the test to see if
execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
of the rest of the logic into execve() itself.

As the flag only indicateds that RLIMIT_NPROC should be checked
in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-2-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220207121800.5079-3-mkoutny@suse.com
Reported-by: Michal Koutný <mkoutny@suse.com>
Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c             | 15 ++++++++-------
 include/linux/sched.h |  2 +-
 kernel/cred.c         | 13 +++++++++----
 kernel/fork.c         |  2 +-
 kernel/sys.c          | 14 --------------
 5 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..1e7f757cbc2c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1875,20 +1875,21 @@ static int do_execveat_common(int fd, struct filename *filename,
 		return PTR_ERR(filename);
 
 	/*
-	 * We move the actual failure in case of RLIMIT_NPROC excess from
-	 * set*uid() to execve() because too many poorly written programs
-	 * don't check setuid() return code.  Here we additionally recheck
-	 * whether NPROC limit is still exceeded.
+	 * After calling set*uid() is RLIMT_NPROC exceeded?
+	 * This can not be checked in set*uid() because too many programs don't
+	 * check the setuid() return code.
 	 */
-	if ((current->flags & PF_NPROC_EXCEEDED) &&
-	    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
+	if ((current->flags & PF_NPROC_CHECK) &&
+	    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
+	    (current_user() != INIT_USER) &&
+	    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
 		retval = -EAGAIN;
 		goto out_ret;
 	}
 
 	/* We're below the limit (still or again), so we don't want to make
 	 * further execve() calls fail. */
-	current->flags &= ~PF_NPROC_EXCEEDED;
+	current->flags &= ~PF_NPROC_CHECK;
 
 	bprm = alloc_bprm(fd, filename);
 	if (IS_ERR(bprm)) {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..6605a262a6be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1678,7 +1678,7 @@ extern struct pid *cad_pid;
 #define PF_DUMPCORE		0x00000200	/* Dumped core */
 #define PF_SIGNALED		0x00000400	/* Killed by a signal */
 #define PF_MEMALLOC		0x00000800	/* Allocating memory */
-#define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
+#define PF_NPROC_CHECK		0x00001000	/* Check in execve if RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
 #define PF_FROZEN		0x00010000	/* Frozen for system suspend */
diff --git a/kernel/cred.c b/kernel/cred.c
index 933155c96922..229cff081167 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -490,13 +490,18 @@ int commit_creds(struct cred *new)
 	if (!gid_eq(new->fsgid, old->fsgid))
 		key_fsgid_changed(new);
 
-	/* do it
-	 * RLIMIT_NPROC limits on user->processes have already been checked
-	 * in set_user().
+	/*
+	 * Remember if the NPROC limit may be exceeded.  The set*uid() functions
+	 * can not fail if the NPROC limit is exceeded as too many programs
+	 * don't check the return code.  Instead enforce the NPROC limit for
+	 * programs doing set*uid()+execve by harmlessly defering the failure
+	 * to the execve() stage.
 	 */
 	alter_cred_subscribers(new, 2);
-	if (new->user != old->user || new->user_ns != old->user_ns)
+	if (new->user != old->user || new->user_ns != old->user_ns) {
 		inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+		task->flags |= PF_NPROC_CHECK;
+	}
 	rcu_assign_pointer(task->real_cred, new);
 	rcu_assign_pointer(task->cred, new);
 	if (new->user != old->user || new->user_ns != old->user_ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index 17d8a8c85e3b..2b6a28a86325 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
 			goto bad_fork_cleanup_count;
 	}
-	current->flags &= ~PF_NPROC_EXCEEDED;
+	current->flags &= ~PF_NPROC_CHECK;
 
 	/*
 	 * If multiple threads are within copy_process(), then this check
diff --git a/kernel/sys.c b/kernel/sys.c
index ecc4cf019242..b1ed21d79f3b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,20 +472,6 @@ static int set_user(struct cred *new)
 	if (!new_user)
 		return -EAGAIN;
 
-	/*
-	 * We don't fail in case of NPROC limit excess here because too many
-	 * poorly written programs don't check set*uid() return code, assuming
-	 * it never fails if called by root.  We may still enforce NPROC limit
-	 * for programs doing set*uid()+execve() by harmlessly deferring the
-	 * failure to the execve() stage.
-	 */
-	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-			new_user != INIT_USER &&
-			!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
-		current->flags |= PF_NPROC_EXCEEDED;
-	else
-		current->flags &= ~PF_NPROC_EXCEEDED;
-
 	free_uid(new->user);
 	new->user = new_user;
 	return 0;
-- 
2.29.2

Re: [PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Solar Designer 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
> As of commit 2863643fb8b9 ("set_user: add capability check when
> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
> before the credential change and not aftwards.
> 
> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
> been properly set in the new credential.
> 
> Make the code correct and more robust moving the test to see if
> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
> of the rest of the logic into execve() itself.
> 
> As the flag only indicateds that RLIMIT_NPROC should be checked
> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
> 
> Cc: stable@vger.kernel.org
> Link: https://lkml.kernel.org/r/20220207121800.5079-2-mkoutny@suse.com
> Link: https://lkml.kernel.org/r/20220207121800.5079-3-mkoutny@suse.com
> Reported-by: Michal Koutn?? <mkoutny@suse.com>
> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

On one hand, this looks good.

On the other, you asked about the Apache httpd suexec scenario in the
other thread, and here's what this means for it (per my code review):

In that scenario, we have two execve(): first from httpd to suexec, then
from suexec to the CGI script.  Previously, the limit check only
occurred on the setuid() call by suexec, and its effect was deferred
until execve() of the script.  Now wouldn't it occur on both execve()
calls, because commit_creds() is also called on execve() (such as in
case the program is SUID, which suexec actually is)?  Since the check is
kind of against real uid (not the euid=0 that suexec gains), it'd apply
the limit against httpd pseudo-user's process count.  While it could be
a reasonable kernel policy to impose this limit in more places, this is
a change of behavior for Apache httpd, and is not the intended behavior
there.  However, I think the answer to my question earlier in this
paragraph is actually a "no", the check wouldn't occur on the execve()
of suexec, because "new->user != old->user" would be false.  Right?

As an alternative, you could keep setting the (renamed and reused) flag
in set_user().  That would avoid the (non-)issue I described above - but
again, your patch is probably fine as-is.

I do see it's logical to have these two lines next to each other:

>  		inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
> +		task->flags |= PF_NPROC_CHECK;

Of course, someone would need to actually test this.

Alexander

Re: [PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Eric W. Biederman 4 years, 4 months ago

Solar Designer <solar@openwall.com> writes:

> On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
>> As of commit 2863643fb8b9 ("set_user: add capability check when
>> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
>> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
>> before the credential change and not aftwards.
>> 
>> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
>> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
>> been properly set in the new credential.
>> 
>> Make the code correct and more robust moving the test to see if
>> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
>> of the rest of the logic into execve() itself.
>> 
>> As the flag only indicateds that RLIMIT_NPROC should be checked
>> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
>> 
>> Cc: stable@vger.kernel.org
>> Link: https://lkml.kernel.org/r/20220207121800.5079-2-mkoutny@suse.com
>> Link: https://lkml.kernel.org/r/20220207121800.5079-3-mkoutny@suse.com
>> Reported-by: Michal Koutn?? <mkoutny@suse.com>
>> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> On one hand, this looks good.
>
> On the other, you asked about the Apache httpd suexec scenario in the
> other thread, and here's what this means for it (per my code review):
>
> In that scenario, we have two execve(): first from httpd to suexec, then
> from suexec to the CGI script.  Previously, the limit check only
> occurred on the setuid() call by suexec, and its effect was deferred
> until execve() of the script.  Now wouldn't it occur on both execve()
> calls, because commit_creds() is also called on execve() (such as in
> case the program is SUID, which suexec actually is)?

Yes.  Moving the check into commit_creds means that the exec after a
suid exec will perform an RLIMIT_NPROC check and could possibly fail.  I
would call that a bug.  Anything happening in execve should be checked
and handled in execve as execve can fail.

It also points out that our permission checks for increasing
RLIMIT_NPROC are highly inconsistent.

One set of permissions in fork().
Another set of permissions in set*id() and delayed until execve.
No permission checks for the uid change in execve.

Every time I look into the previous behavior of RLIMIT_NPROC I seem
to find issues.  Currently I am planning a posting to linux-api
so sorting out what when RLIMIT_NPROC should be enforced and how
RLIMIT_NPROC gets accounted receives review.  I am also planning a
feature branch to deal with the historical goofiness.

I really like how cleanly this patch seems to be.  Unfortunately it is
wrong.

> Since the check is
> kind of against real uid (not the euid=0 that suexec gains), it'd apply
> the limit against httpd pseudo-user's process count.  While it could be
> a reasonable kernel policy to impose this limit in more places, this is
> a change of behavior for Apache httpd, and is not the intended behavior
> there.  However, I think the answer to my question earlier in this
> paragraph is actually a "no", the check wouldn't occur on the execve()
> of suexec, because "new->user != old->user" would be false.  Right?
>
> As an alternative, you could keep setting the (renamed and reused) flag
> in set_user().  That would avoid the (non-)issue I described above - but
> again, your patch is probably fine as-is.
>
> I do see it's logical to have these two lines next to each other:
>
>>  		inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
>> +		task->flags |= PF_NPROC_CHECK;
>
> Of course, someone would need to actually test this.

That too.

I am increasingly of the opinion that the process accounting should not
be in cred.c at all.  That we just remember the who was charged with the
process when we created it, and then at exec time we can update that
charge, and verify that the new user is solid.  At exit time we can look
up who was charged with the process and decrement the count.

Of course at this point my opinion may change after I implement that.

Eric

Re: [PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Eric W. Biederman 4 years, 4 months ago

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Solar Designer <solar@openwall.com> writes:
>
>> On Thu, Feb 10, 2022 at 08:13:19PM -0600, Eric W. Biederman wrote:
>>> As of commit 2863643fb8b9 ("set_user: add capability check when
>>> rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
>>> should check RLIMIT_NPROC is buggy, as it tests the capabilites from
>>> before the credential change and not aftwards.
>>> 
>>> As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
>>> ucounts") examining the rlimit is buggy as cred->ucounts has not yet
>>> been properly set in the new credential.
>>> 
>>> Make the code correct and more robust moving the test to see if
>>> execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
>>> of the rest of the logic into execve() itself.
>>> 
>>> As the flag only indicateds that RLIMIT_NPROC should be checked
>>> in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
>>> 
>>> Cc: stable@vger.kernel.org
>>> Link: https://lkml.kernel.org/r/20220207121800.5079-2-mkoutny@suse.com
>>> Link: https://lkml.kernel.org/r/20220207121800.5079-3-mkoutny@suse.com
>>> Reported-by: Michal Koutn?? <mkoutny@suse.com>
>>> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> On one hand, this looks good.
>>
>> On the other, you asked about the Apache httpd suexec scenario in the
>> other thread, and here's what this means for it (per my code review):
>>
>> In that scenario, we have two execve(): first from httpd to suexec, then
>> from suexec to the CGI script.  Previously, the limit check only
>> occurred on the setuid() call by suexec, and its effect was deferred
>> until execve() of the script.  Now wouldn't it occur on both execve()
>> calls, because commit_creds() is also called on execve() (such as in
>> case the program is SUID, which suexec actually is)?
>
> Yes.  Moving the check into commit_creds means that the exec after a
> suid exec will perform an RLIMIT_NPROC check and could possibly fail.  I
> would call that a bug.  Anything happening in execve should be checked
> and handled in execve as execve can fail.
>
> It also points out that our permission checks for increasing
> RLIMIT_NPROC are highly inconsistent.
>
> One set of permissions in fork().
> Another set of permissions in set*id() and delayed until execve.
> No permission checks for the uid change in execve.
>
> Every time I look into the previous behavior of RLIMIT_NPROC I seem
> to find issues.  Currently I am planning a posting to linux-api
> so sorting out what when RLIMIT_NPROC should be enforced and how
> RLIMIT_NPROC gets accounted receives review.  I am also planning a
> feature branch to deal with the historical goofiness.
>
> I really like how cleanly this patch seems to be.  Unfortunately it is
> wrong.

Hmm.  Maybe not as wrong as I thought.  An suid execve does not change
the real user.

Still a bit wrong from a conservative change point of view because the
user namespace can change in setns and CLONE_NEWUSER which will change
the accounting now.  Which with the ucount rlimit stuff changes where
things should be accounted.

I am playing with the idea of changing accounting aka (cred->ucounts &
cred->user) to only change in fork (aka clone without CLONE_THREAD) and
exec.  I think that would make maintenance and  cleaning all of this up
easier.

That would also remove serious complications from RLIMIT_SIGPENDING as
well.

I thought SIGPENDING was only a multi-threaded process issue but from
one signal to the next the set*id() family functions can be called.

Eric

Re: [PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Michal Koutný 4 years, 4 months ago

On Mon, Feb 14, 2022 at 09:10:49AM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> I really like how cleanly this patch seems to be.  Unfortunately it is
> wrong.

It seems [1] so:

setuid()		// RLIMIT_NPROC is fine at this moment
...		fork()
		...
...		fork()
execve()		// eh, oh

This "punishes" the exec'ing task although the cause is elsewhere.

Michal

[1] The decoupled setuid()+execve() check can be interpretted both ways.
I understood historically the excess at the setuid() moment is relevant.

Re: [PATCH 3/8] ucounts: Fix and simplify RLIMIT_NPROC handling during setuid()+execve

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> writes:

> On Mon, Feb 14, 2022 at 09:10:49AM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> I really like how cleanly this patch seems to be.  Unfortunately it is
>> wrong.
>
> It seems [1] so:
>
> setuid()		// RLIMIT_NPROC is fine at this moment
> ...		fork()
> 		...
> ...		fork()
> execve()		// eh, oh
>
> This "punishes" the exec'ing task although the cause is elsewhere.
>
> Michal
>
> [1] The decoupled setuid()+execve() check can be interpretted both ways.
> I understood historically the excess at the setuid() moment is
> relevant.

I have been digging into this to understand why we are doing the strange
things we are doing.

Ordinarily for rlimits we are fine with letting things go over limit
until we reach a case where we need the limit (which would be fork in
the RLIMIT_NPROC case).  So things like setrlimit do not check your
counts to see if you will be over the limit.

The practical problem with fork in the unix model is that you can not
change limits or do anything for the new process until it is created
(with clone/fork).  Making it impossible to set the rlimits and change
the user before the new process is created.

The result is that in applications like apache when they run cgi scripts
(as a different user than the apache process) RLIMIT_NPROC did not work
until a check was placed into set*id() as well.  As the typical cgi
script did not fork it just did it's work and exited.

That it was discovered that allowing set*id() to fail was a footgun for
privileged processes.  And we have the existing system.

Which leads me to the starting point that set*id() checking rlimits is
a necessary but fundamentally a special case.

As long as the original use case works I think there is some latitude in
the implementation.  Maybe we set a flag and perform all of the checks
in exec.  Maybe we just send SIGKILL.  Maybe we just say it is an ugly
wart but it is our ugly wart and comment it and leave it alone.
I am leaving  that decision to a clean-up patchset.

[PATCH 4/8] ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC

Posted by Eric W. Biederman 4 years, 4 months ago

In [1] Michal Koutný <mkoutny@suse.com> reported that it does not make
sense to unconditionally exempt the INIT_USER during fork and exec
from RLIMIT_NPROC and then to impose a limit on that same user with
is_ucounts_overlimit.  So I looked into why the exeception was added.

commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") says:
>  If a setuid process swaps it's real and effective uids and then
>  forks, the fork fails if the new realuid has more processes than
>  the original process was limited to.  This is particularly a
>  problem if a user with a process limit (e.g. 256) runs a
>  setuid-root program which does setuid() + fork() (e.g. lprng) while
>  root already has more than 256 process (which is quite possible).
>
>  The root problem here is that a limit which should be a per-user
>  limit is being implemented as a per-process limit with per-process
>  (e.g. CAP_SYS_RESOURCE) controls.  Being a per-user limit, it
>  should be that the root-user can over-ride it, not just some
>  process with CAP_SYS_RESOURCE.
>
>  This patch adds a test to ignore process limits if the real user is root.

With the moving of the limits from per-user to per-user-per-user_ns it
is clear that testing a user_struct is no longer the proper test and
the test should be a test against the ucounts struct to match the
rest of the logic change that was made.

With RLIMIT_NPROC not being enforced for the global root user anywhere
else should it be enforced in is_ucounts_overlimit for a user
namespace created by the global root user?

Since this is limited to just the global root user, and RLIMIT_NPROC
is already ignored for that user I am going to vote no.

This change does imply that it becomes possible to limit all users in
a user namespace but to not enforce the rlimits on the root user or
anyone with CAP_SYS_ADMIN and CAP_SYS_RESOURCE in the user namespace.
It is not clear to me why any of those exceptions exist so I figure
we should until this is actually a problem for someone before
we relax the permission checks here.

Cc: stable@vger.kernel.org
Reported-by: Michal Koutný <mkoutny@suse.com>
[1] https://lkml.kernel.org/r/20220207121800.5079-5-mkoutny@suse.com
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c               | 2 +-
 kernel/fork.c           | 2 +-
 kernel/user_namespace.c | 2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1e7f757cbc2c..01c8c7bae9b4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 */
 	if ((current->flags & PF_NPROC_CHECK) &&
 	    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-	    (current_user() != INIT_USER) &&
+	    (current_ucounts() != &init_ucounts) &&
 	    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
 		retval = -EAGAIN;
 		goto out_ret;
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b6a28a86325..6f62d37f3650 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 	retval = -EAGAIN;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
-		if (p->real_cred->user != INIT_USER &&
+		if ((task_ucounts(p) != &init_ucounts) &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
 			goto bad_fork_cleanup_count;
 	}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..f0c04073403d 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
 		ns->ucount_max[i] = INT_MAX;
 	}
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	if (new->ucounts == &init_ucounts)
+		set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
-- 
2.29.2

Re: [PATCH 4/8] ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC

Posted by Michal Koutný 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:20PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> @@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
[...]
> -	    (current_user() != INIT_USER) &&
> +	    (current_ucounts() != &init_ucounts) &&
[...]
> @@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
[...]
> -		if (p->real_cred->user != INIT_USER &&
> +		if ((task_ucounts(p) != &init_ucounts) &&

These substitutions make sense to me.

>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>  			goto bad_fork_cleanup_count;
>  	}
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 6b2e3ca7ee99..f0c04073403d 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
>  		ns->ucount_max[i] = INT_MAX;
>  	}
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
> +	if (new->ucounts == &init_ucounts)
> +		set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));

First, I wanted to object this double fork_init() but I realized it's
relevant for newly created user_ns.

Second, I think new->ucounts would be correct at this point and the
check should be

> if (ucounts == &init_ucounts)

i.e. before set_cred_ucounts() new->ucounts may not be correct.

I'd suggest also a comment in the create_user_ns() explaining the
reason is to exempt global root from RLIMINT_NRPOC also indirectly via
descendant user_nss.

Thanks,
Michal

Re: [PATCH 4/8] ucounts: Only except the root user in init_user_ns from RLIMIT_NPROC

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> writes:

> On Thu, Feb 10, 2022 at 08:13:20PM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> @@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> [...]
>> -	    (current_user() != INIT_USER) &&
>> +	    (current_ucounts() != &init_ucounts) &&
> [...]
>> @@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
> [...]
>> -		if (p->real_cred->user != INIT_USER &&
>> +		if ((task_ucounts(p) != &init_ucounts) &&
>
> These substitutions make sense to me.
>
>>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>>  			goto bad_fork_cleanup_count;
>>  	}
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 6b2e3ca7ee99..f0c04073403d 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
>>  		ns->ucount_max[i] = INT_MAX;
>>  	}
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
>> +	if (new->ucounts == &init_ucounts)
>> +		set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
>
> First, I wanted to object this double fork_init() but I realized it's
> relevant for newly created user_ns.
>
> Second, I think new->ucounts would be correct at this point and the
> check should be
>
>> if (ucounts == &init_ucounts)
>
> i.e. before set_cred_ucounts() new->ucounts may not be correct.
>
> I'd suggest also a comment in the create_user_ns() explaining the
> reason is to exempt global root from RLIMINT_NRPOC also indirectly via
> descendant user_nss.

Yes.

This one got culled from my next version of the patchset as it is not
conservative enough.  I think it is probably the right general
direction.

On further reflection I am not convinced that it makes sense to test
user or ucounts.  They are really not fields designed to support
permission checks.

I think if we want to exempt the root user's children from the root
users rlimit using the second set_rlimit_ucount_max is the way to go.

Someone filed a bug that strongly suggests that we want the second
set_rlimit_ucount_max:
https://bugzilla.kernel.org/show_bug.cgi?id=215596

I am still trying to understand that case. 

Eric

[PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Eric W. Biederman 4 years, 4 months ago

While examining is_ucounts_overlimit and reading the various messages
I realized that is_ucounts_overlimit fails to deal with counts that
may have wrapped.

Being wrapped should be a transitory state for counts and they should
never be wrapped for long, but it can happen so handle it.

Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/ucount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 65b597431c86..06ea04d44685 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
 	if (rlimit > LONG_MAX)
 		max = LONG_MAX;
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		if (get_ucounts_value(iter, type) > max)
+		long val = get_ucounts_value(iter, type);
+		if (val < 0 || val > max)
 			return true;
 		max = READ_ONCE(iter->ns->ucount_max[type]);
 	}
-- 
2.29.2

Re: [PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Solar Designer 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:21PM -0600, Eric W. Biederman wrote:
> While examining is_ucounts_overlimit and reading the various messages
> I realized that is_ucounts_overlimit fails to deal with counts that
> may have wrapped.
> 
> Being wrapped should be a transitory state for counts and they should
> never be wrapped for long, but it can happen so handle it.
> 
> Cc: stable@vger.kernel.org
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/ucount.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 65b597431c86..06ea04d44685 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
>  	if (rlimit > LONG_MAX)
>  		max = LONG_MAX;
>  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> -		if (get_ucounts_value(iter, type) > max)
> +		long val = get_ucounts_value(iter, type);
> +		if (val < 0 || val > max)
>  			return true;
>  		max = READ_ONCE(iter->ns->ucount_max[type]);
>  	}

You probably deliberately assume "gcc -fwrapv", but otherwise:

As you're probably aware, a signed integer wrapping is undefined
behavior in C.  In the function above, "val" having wrapped to negative
assumes we had occurred UB elsewhere.  Further, there's an instance of
UB in the function itself:

bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long rlimit)
{
	struct ucounts *iter;
	long max = rlimit;
	if (rlimit > LONG_MAX)
		max = LONG_MAX;

The assignment on "long max = rlimit;" would have already been UB if
"rlimit > LONG_MAX", which is only checked afterwards.  I think the
above would be better written as:

	if (rlimit > LONG_MAX)
		rlimit = LONG_MAX;
	long max = rlimit;

considering that "rlimit" is never used further in that function.

And to more likely avoid wraparound of "val", perhaps have the limit at
a value significantly lower than LONG_MAX, like half that?  So:

	if (rlimit > LONG_MAX / 2)
		rlimit = LONG_MAX / 2;
	long max = rlimit;

And sure, also keep the "val < 0" check as defensive programming, or you
can do:

	if (rlimit > LONG_MAX / 2)
		rlimit = LONG_MAX / 2;
[...]
		if ((unsigned long)get_ucounts_value(iter, type) > rlimit)
			return true;

and drop both "val" and "max".  However, this also assumes the return
type of get_ucounts_value() doesn't become larger than "unsigned long".

I assume that once is_ucounts_overlimit() returned true, it is expected
the value would almost not grow further (except a little due to races).

I also assume there's some reason a signed type is used there.

Alexander

Re: [PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Eric W. Biederman 4 years, 4 months ago

Solar Designer <solar@openwall.com> writes:

> On Thu, Feb 10, 2022 at 08:13:21PM -0600, Eric W. Biederman wrote:
>> While examining is_ucounts_overlimit and reading the various messages
>> I realized that is_ucounts_overlimit fails to deal with counts that
>> may have wrapped.
>> 
>> Being wrapped should be a transitory state for counts and they should
>> never be wrapped for long, but it can happen so handle it.
>> 
>> Cc: stable@vger.kernel.org
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/ucount.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 65b597431c86..06ea04d44685 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
>>  	if (rlimit > LONG_MAX)
>>  		max = LONG_MAX;
>>  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>> -		if (get_ucounts_value(iter, type) > max)
>> +		long val = get_ucounts_value(iter, type);
>> +		if (val < 0 || val > max)
>>  			return true;
>>  		max = READ_ONCE(iter->ns->ucount_max[type]);
>>  	}
>
> You probably deliberately assume "gcc -fwrapv", but otherwise:
>
> As you're probably aware, a signed integer wrapping is undefined
> behavior in C.  In the function above, "val" having wrapped to negative
> assumes we had occurred UB elsewhere.  Further, there's an instance of
> UB in the function itself:

While in cases like this we pass the value in a long, the operations on
the value occur in an atomic_long_t.  As atomic_long_t is implemented in
assembly we do escape the problems of undefined behavior.

> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long rlimit)
> {
> 	struct ucounts *iter;
> 	long max = rlimit;
> 	if (rlimit > LONG_MAX)
> 		max = LONG_MAX;
>
> The assignment on "long max = rlimit;" would have already been UB if
> "rlimit > LONG_MAX", which is only checked afterwards.  I think the
> above would be better written as:
>
> 	if (rlimit > LONG_MAX)
> 		rlimit = LONG_MAX;
> 	long max = rlimit;
>
> considering that "rlimit" is never used further in that function.

Thank you for spotting that.  That looks like a good idea.  Even if it
works in this case it is better to establish patterns that are not
problematic if copy and pasted elsewhere.

> And to more likely avoid wraparound of "val", perhaps have the limit at
> a value significantly lower than LONG_MAX, like half that?  So:

For the case of RLIMIT_NPROC the real world limit is PID_MAX_LIMIT
which is 2^22.

Beyond that the code deliberately uses all values with the high bit/sign
bit set to flag that things went too high.  So the code already reserves
half of the values.

> I assume that once is_ucounts_overlimit() returned true, it is expected
> the value would almost not grow further (except a little due to races).

Pretty much. The function essentially only exists so that we can
handle the weirdness of RLIMIT_NPROC.  Now that I have discovered the
weirdness of RLIMIT_NPROC is old historical sloppiness I expect the
proper solution is to rework how RLIMIT_NPROC operates and to remove
is_ucounts_overlimit all together.   I have to figure out what a proper
RLIMIT_NPROC check looks like in proc.

Eric

Re: [PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Eric W. Biederman 4 years, 4 months ago

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Solar Designer <solar@openwall.com> writes:
>
>> On Thu, Feb 10, 2022 at 08:13:21PM -0600, Eric W. Biederman wrote:
>>> While examining is_ucounts_overlimit and reading the various messages
>>> I realized that is_ucounts_overlimit fails to deal with counts that
>>> may have wrapped.
>>> 
>>> Being wrapped should be a transitory state for counts and they should
>>> never be wrapped for long, but it can happen so handle it.
>>> 
>>> Cc: stable@vger.kernel.org
>>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  kernel/ucount.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>>> index 65b597431c86..06ea04d44685 100644
>>> --- a/kernel/ucount.c
>>> +++ b/kernel/ucount.c
>>> @@ -350,7 +350,8 @@ bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsign
>>>  	if (rlimit > LONG_MAX)
>>>  		max = LONG_MAX;
>>>  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>>> -		if (get_ucounts_value(iter, type) > max)
>>> +		long val = get_ucounts_value(iter, type);
>>> +		if (val < 0 || val > max)
>>>  			return true;
>>>  		max = READ_ONCE(iter->ns->ucount_max[type]);
>>>  	}
>>
>> You probably deliberately assume "gcc -fwrapv", but otherwise:
>>
>> As you're probably aware, a signed integer wrapping is undefined
>> behavior in C.  In the function above, "val" having wrapped to negative
>> assumes we had occurred UB elsewhere.  Further, there's an instance of
>> UB in the function itself:
>
> While in cases like this we pass the value in a long, the operations on
> the value occur in an atomic_long_t.  As atomic_long_t is implemented in
> assembly we do escape the problems of undefined behavior.
>
>
>> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long rlimit)
>> {
>> 	struct ucounts *iter;
>> 	long max = rlimit;
>> 	if (rlimit > LONG_MAX)
>> 		max = LONG_MAX;
>>
>> The assignment on "long max = rlimit;" would have already been UB if
>> "rlimit > LONG_MAX", which is only checked afterwards.  I think the
>> above would be better written as:
>>
>> 	if (rlimit > LONG_MAX)
>> 		rlimit = LONG_MAX;
>> 	long max = rlimit;
>>
>> considering that "rlimit" is never used further in that function.
>
> Thank you for spotting that.  That looks like a good idea.  Even if it
> works in this case it is better to establish patterns that are not
> problematic if copy and pasted elsewhere.
>
>> And to more likely avoid wraparound of "val", perhaps have the limit at
>> a value significantly lower than LONG_MAX, like half that?  So:
>
> For the case of RLIMIT_NPROC the real world limit is PID_MAX_LIMIT
> which is 2^22.
>
> Beyond that the code deliberately uses all values with the high bit/sign
> bit set to flag that things went too high.  So the code already reserves
> half of the values.
>
>> I assume that once is_ucounts_overlimit() returned true, it is expected
>> the value would almost not grow further (except a little due to races).
>
> Pretty much. The function essentially only exists so that we can
> handle the weirdness of RLIMIT_NPROC.  Now that I have discovered the
> weirdness of RLIMIT_NPROC is old historical sloppiness I expect the
> proper solution is to rework how RLIMIT_NPROC operates and to remove
> is_ucounts_overlimit all together.   I have to figure out what a proper
> RLIMIT_NPROC check looks like in proc.
                                   ^^^^ execve

Eric

Re: [PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by Michal Koutný 4 years, 4 months ago

On Mon, Feb 14, 2022 at 09:23:09AM -0600, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Pretty much. The function essentially only exists so that we can
> handle the weirdness of RLIMIT_NPROC.
> Now that I have discovered the
> weirdness of RLIMIT_NPROC is old historical sloppiness I expect the
> proper solution is to rework how RLIMIT_NPROC operates and to remove
> is_ucounts_overlimit all together.

The fork path could make do with some kind of atomic add+check (similar
to inc_ucounts) and overflows would be sanitized by that. (Seems to
apply to other former RLIMIT_* per-user counters too.)

The is_ucounts_overlimit() and overflowable increment indeed appears
necessary only to satisfy the set*uid+execve pair.

For the sake of bug-fixing, both the patches 5/8 and 6/8 can have 
Reviewed-by: Michal Koutný <mkoutny@suse.com>

Michal

RE: [PATCH 5/8] ucounts: Handle wrapping in is_ucounts_overlimit

Posted by David Laight 4 years, 4 months ago

From: Solar Designer
> Sent: 12 February 2022 22:37
...
> bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long rlimit)
> {
> 	struct ucounts *iter;
> 	long max = rlimit;
> 	if (rlimit > LONG_MAX)
> 		max = LONG_MAX;
> 
> The assignment on "long max = rlimit;" would have already been UB if
> "rlimit > LONG_MAX", which is only checked afterwards.  I think the
> above would be better written as:

I'm pretty sure assignments and casts of negative values to unsigned
types are actually well defined.
Although the actual value may differ for ones-compliment and
sign-overpunch systems.
But I suspect Linux requires twos-compliment negative numbers.

(In much the same way as it requires that NULL be the all zero
bit pattern - although a load of annoying compiler warnings are only
relevant if that isn't the case.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Eric W. Biederman 4 years, 4 months ago

Move inc_rlimit_ucounts from copy_creds into copy_process immediately
after copy_creds where it can be called exactly once.  Test for and
handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
count has wrapped.

This is good hygenine and fixes a theoretical bug.  In practice
PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
processes would ever wrap even on an architecture with a 32bit long.

Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/cred.c | 2 --
 kernel/fork.c | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 229cff081167..96d5fd6ff26f 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
 		       read_cred_subscribers(p->cred));
-		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
 		return 0;
 	}
 
@@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 #endif
 
 	p->cred = p->real_cred = get_cred(new);
-	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
 	return 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6f62d37f3650..69333078259c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_free;
 
 	retval = -EAGAIN;
+	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
+		goto bad_fork_cleanup_count;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
 		if ((task_ucounts(p) != &init_ucounts) &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
-- 
2.29.2

Re: [PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Alexey Gladkov 4 years, 4 months ago

On Thu, Feb 10, 2022 at 08:13:22PM -0600, Eric W. Biederman wrote:
> Move inc_rlimit_ucounts from copy_creds into copy_process immediately
> after copy_creds where it can be called exactly once.  Test for and
> handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
> count has wrapped.
> 
> This is good hygenine and fixes a theoretical bug.  In practice
> PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
> processes would ever wrap even on an architecture with a 32bit long.
> 
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/cred.c | 2 --
>  kernel/fork.c | 2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 229cff081167..96d5fd6ff26f 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  		kdebug("share_creds(%p{%d,%d})",
>  		       p->cred, atomic_read(&p->cred->usage),
>  		       read_cred_subscribers(p->cred));
> -		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>  		return 0;
>  	}
>  
> @@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  #endif
>  
>  	p->cred = p->real_cred = get_cred(new);
> -	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>  	alter_cred_subscribers(new, 2);
>  	validate_creds(new);
>  	return 0;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6f62d37f3650..69333078259c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
>  		goto bad_fork_free;
>  
>  	retval = -EAGAIN;
> +	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
> +		goto bad_fork_cleanup_count;
>  	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
>  		if ((task_ucounts(p) != &init_ucounts) &&
>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))

It might make sense to do something like:

	if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
		if ((task_ucounts(p) != &init_ucounts) &&
		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))

and the new function:

long inc_rlimit_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long v, unsigned long rlimit)
{
	struct ucounts *iter;
	long ret = 0;
	long max = rlimit;
	if (rlimit > LONG_MAX)
		max = LONG_MAX;
	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
		long new = atomic_long_add_return(v, &iter->ucount[type]);
		if (new < 0 || new > max)
			ret = LONG_MAX;
		else if (iter == ucounts)
			ret = new;
		max = READ_ONCE(iter->ns->ucount_max[type]);
	}
	return ret;
}

This will avoid double checking the same userns tree.

Or even modify inc_rlimit_ucounts. This function is used elsewhere like
this:


msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {


memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {


In all cases, we have max value for comparison.

-- 
Rgrds, legion

Re: [PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Eric W. Biederman 4 years, 4 months ago

Alexey Gladkov <legion@kernel.org> writes:

> On Thu, Feb 10, 2022 at 08:13:22PM -0600, Eric W. Biederman wrote:
>> Move inc_rlimit_ucounts from copy_creds into copy_process immediately
>> after copy_creds where it can be called exactly once.  Test for and
>> handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
>> count has wrapped.
>> 
>> This is good hygenine and fixes a theoretical bug.  In practice
>> PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
>> processes would ever wrap even on an architecture with a 32bit long.
>> 
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/cred.c | 2 --
>>  kernel/fork.c | 2 ++
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/cred.c b/kernel/cred.c
>> index 229cff081167..96d5fd6ff26f 100644
>> --- a/kernel/cred.c
>> +++ b/kernel/cred.c
>> @@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  		kdebug("share_creds(%p{%d,%d})",
>>  		       p->cred, atomic_read(&p->cred->usage),
>>  		       read_cred_subscribers(p->cred));
>> -		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>>  		return 0;
>>  	}
>>  
>> @@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  #endif
>>  
>>  	p->cred = p->real_cred = get_cred(new);
>> -	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>>  	alter_cred_subscribers(new, 2);
>>  	validate_creds(new);
>>  	return 0;
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 6f62d37f3650..69333078259c 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
>>  		goto bad_fork_free;
>>  
>>  	retval = -EAGAIN;
>> +	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
>> +		goto bad_fork_cleanup_count;
>>  	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
>>  		if ((task_ucounts(p) != &init_ucounts) &&
>>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>
> It might make sense to do something like:
>
> 	if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
> 		if ((task_ucounts(p) != &init_ucounts) &&
> 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>
> and the new function:
>
> long inc_rlimit_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long v, unsigned long rlimit)
> {
> 	struct ucounts *iter;
> 	long ret = 0;
> 	long max = rlimit;
> 	if (rlimit > LONG_MAX)
> 		max = LONG_MAX;
> 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> 		long new = atomic_long_add_return(v, &iter->ucount[type]);
> 		if (new < 0 || new > max)
> 			ret = LONG_MAX;
> 		else if (iter == ucounts)
> 			ret = new;
> 		max = READ_ONCE(iter->ns->ucount_max[type]);
> 	}
> 	return ret;
> }
>
> This will avoid double checking the same userns tree.
>
> Or even modify inc_rlimit_ucounts. This function is used elsewhere like
> this:
>
>
> msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
> if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
>
>
> memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
> if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
>
>
> In all cases, we have max value for comparison.

Good point.   The downside is that it means we can't use the same code
in exec.  The upside is that the code is more idiomatic.

Eric

Re: [PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Shuah Khan 4 years, 4 months ago

On 2/11/22 10:50 AM, Eric W. Biederman wrote:
> Alexey Gladkov <legion@kernel.org> writes:
> 
>> On Thu, Feb 10, 2022 at 08:13:22PM -0600, Eric W. Biederman wrote:
>>> Move inc_rlimit_ucounts from copy_creds into copy_process immediately
>>> after copy_creds where it can be called exactly once.  Test for and
>>> handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
>>> count has wrapped.
>>>
>>> This is good hygenine and fixes a theoretical bug.  In practice
>>> PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
>>> processes would ever wrap even on an architecture with a 32bit long.
>>>
>>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>   kernel/cred.c | 2 --
>>>   kernel/fork.c | 2 ++
>>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/cred.c b/kernel/cred.c
>>> index 229cff081167..96d5fd6ff26f 100644
>>> --- a/kernel/cred.c
>>> +++ b/kernel/cred.c
>>> @@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>>   		kdebug("share_creds(%p{%d,%d})",
>>>   		       p->cred, atomic_read(&p->cred->usage),
>>>   		       read_cred_subscribers(p->cred));
>>> -		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>>>   		return 0;
>>>   	}
>>>   
>>> @@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>>   #endif
>>>   
>>>   	p->cred = p->real_cred = get_cred(new);
>>> -	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>>>   	alter_cred_subscribers(new, 2);
>>>   	validate_creds(new);
>>>   	return 0;
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 6f62d37f3650..69333078259c 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
>>>   		goto bad_fork_free;
>>>   
>>>   	retval = -EAGAIN;
>>> +	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
>>> +		goto bad_fork_cleanup_count;
>>>   	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
>>>   		if ((task_ucounts(p) != &init_ucounts) &&
>>>   		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>>
>> It might make sense to do something like:
>>
>> 	if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
>> 		if ((task_ucounts(p) != &init_ucounts) &&
>> 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>>
>> and the new function:
>>
>> long inc_rlimit_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long v, unsigned long rlimit)
>> {
>> 	struct ucounts *iter;
>> 	long ret = 0;
>> 	long max = rlimit;
>> 	if (rlimit > LONG_MAX)
>> 		max = LONG_MAX;
>> 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>> 		long new = atomic_long_add_return(v, &iter->ucount[type]);
>> 		if (new < 0 || new > max)
>> 			ret = LONG_MAX;
>> 		else if (iter == ucounts)
>> 			ret = new;
>> 		max = READ_ONCE(iter->ns->ucount_max[type]);
>> 	}
>> 	return ret;
>> }
>>
>> This will avoid double checking the same userns tree.
>>
>> Or even modify inc_rlimit_ucounts. This function is used elsewhere like
>> this:
>>
>>
>> msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
>> if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
>>
>>
>> memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
>> if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
>>
>>
>> In all cases, we have max value for comparison.
> 
> Good point.   The downside is that it means we can't use the same code
> in exec.  The upside is that the code is more idiomatic.
> 

Checking on this a bit more on other callers of inc_rlimit_ucounts(),
we might have another issue:

1. mqueue_get_inode() does:
      spin_lock(&mq_lock);
      msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
                         if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
                                 dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
                                 spin_unlock(&mq_lock);


2. user_shm_lock() &shmlock_user_lock
     spin_lock(&shmlock_user_lock);
         memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);

         if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
                 dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);

3. user_namespace_sysctl_init() Doesn't check for max value.

4. copy_creds() doesn't check for max value in its 3 calls to inc_rlimit_ucounts()

You can see that each of these instances some callers dec_rlimit_ucounts().
They hold different locks. So do we have a window where LONG_MAX could
overflow and go unnoticed?

thanks,
-- Shuah

Re: [PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Alexey Gladkov 4 years, 4 months ago

On Fri, Feb 11, 2022 at 11:50:46AM -0600, Eric W. Biederman wrote:
> Alexey Gladkov <legion@kernel.org> writes:
> 
> > On Thu, Feb 10, 2022 at 08:13:22PM -0600, Eric W. Biederman wrote:
> >> Move inc_rlimit_ucounts from copy_creds into copy_process immediately
> >> after copy_creds where it can be called exactly once.  Test for and
> >> handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
> >> count has wrapped.
> >> 
> >> This is good hygenine and fixes a theoretical bug.  In practice
> >> PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
> >> processes would ever wrap even on an architecture with a 32bit long.
> >> 
> >> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  kernel/cred.c | 2 --
> >>  kernel/fork.c | 2 ++
> >>  2 files changed, 2 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/kernel/cred.c b/kernel/cred.c
> >> index 229cff081167..96d5fd6ff26f 100644
> >> --- a/kernel/cred.c
> >> +++ b/kernel/cred.c
> >> @@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  		kdebug("share_creds(%p{%d,%d})",
> >>  		       p->cred, atomic_read(&p->cred->usage),
> >>  		       read_cred_subscribers(p->cred));
> >> -		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
> >>  		return 0;
> >>  	}
> >>  
> >> @@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  #endif
> >>  
> >>  	p->cred = p->real_cred = get_cred(new);
> >> -	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
> >>  	alter_cred_subscribers(new, 2);
> >>  	validate_creds(new);
> >>  	return 0;
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 6f62d37f3650..69333078259c 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
> >>  		goto bad_fork_free;
> >>  
> >>  	retval = -EAGAIN;
> >> +	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
> >> +		goto bad_fork_cleanup_count;
> >>  	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
> >>  		if ((task_ucounts(p) != &init_ucounts) &&
> >>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> >
> > It might make sense to do something like:
> >
> > 	if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
> > 		if ((task_ucounts(p) != &init_ucounts) &&
> > 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> >
> > and the new function:
> >
> > long inc_rlimit_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long v, unsigned long rlimit)
> > {
> > 	struct ucounts *iter;
> > 	long ret = 0;
> > 	long max = rlimit;
> > 	if (rlimit > LONG_MAX)
> > 		max = LONG_MAX;
> > 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> > 		long new = atomic_long_add_return(v, &iter->ucount[type]);
> > 		if (new < 0 || new > max)
> > 			ret = LONG_MAX;
> > 		else if (iter == ucounts)
> > 			ret = new;
> > 		max = READ_ONCE(iter->ns->ucount_max[type]);
> > 	}
> > 	return ret;
> > }
> >
> > This will avoid double checking the same userns tree.
> >
> > Or even modify inc_rlimit_ucounts. This function is used elsewhere like
> > this:
> >
> >
> > msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
> > if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
> >
> >
> > memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
> > if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
> >
> >
> > In all cases, we have max value for comparison.
> 
> Good point.   The downside is that it means we can't use the same code
> in exec.  The upside is that the code is more idiomatic.

My suggestion was before I saw the 8/8 patch :)

We can make something like:

static inline bool is_nproc_overlimit(struct task_struct *task)
{
	return (task_ucounts(task) != &init_ucounts) &&
		!has_capability(task, CAP_SYS_RESOURCE) &&
		!has_capability(task, CAP_SYS_ADMIN);
}

In copy_process:

if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
	if (is_nproc_overlimit(p))
		goto bad_fork_cleanup_count;
}

In do_execveat_common:

if ((current->flags & PF_NPROC_CHECK) &&
    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
    is_nproc_overlimit(current)) {
	retval = -EAGAIN;
	goto out_ret;
}

-- 
Rgrds, legion

Re: [PATCH 6/8] ucounts: Handle inc_rlimit_ucounts wrapping in fork

Posted by Eric W. Biederman 4 years, 4 months ago

Alexey Gladkov <legion@kernel.org> writes:

> On Fri, Feb 11, 2022 at 11:50:46AM -0600, Eric W. Biederman wrote:
>> Alexey Gladkov <legion@kernel.org> writes:
>> 
>> > On Thu, Feb 10, 2022 at 08:13:22PM -0600, Eric W. Biederman wrote:
>> >> Move inc_rlimit_ucounts from copy_creds into copy_process immediately
>> >> after copy_creds where it can be called exactly once.  Test for and
>> >> handle it when inc_rlimit_ucounts returns LONG_MAX indicating the
>> >> count has wrapped.
>> >> 
>> >> This is good hygenine and fixes a theoretical bug.  In practice
>> >> PID_MAX_LIMIT is at most 2^22 so there is not a chance the number of
>> >> processes would ever wrap even on an architecture with a 32bit long.
>> >> 
>> >> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> >> ---
>> >>  kernel/cred.c | 2 --
>> >>  kernel/fork.c | 2 ++
>> >>  2 files changed, 2 insertions(+), 2 deletions(-)
>> >> 
>> >> diff --git a/kernel/cred.c b/kernel/cred.c
>> >> index 229cff081167..96d5fd6ff26f 100644
>> >> --- a/kernel/cred.c
>> >> +++ b/kernel/cred.c
>> >> @@ -358,7 +358,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>> >>  		kdebug("share_creds(%p{%d,%d})",
>> >>  		       p->cred, atomic_read(&p->cred->usage),
>> >>  		       read_cred_subscribers(p->cred));
>> >> -		inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>> >>  		return 0;
>> >>  	}
>> >>  
>> >> @@ -395,7 +394,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>> >>  #endif
>> >>  
>> >>  	p->cred = p->real_cred = get_cred(new);
>> >> -	inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>> >>  	alter_cred_subscribers(new, 2);
>> >>  	validate_creds(new);
>> >>  	return 0;
>> >> diff --git a/kernel/fork.c b/kernel/fork.c
>> >> index 6f62d37f3650..69333078259c 100644
>> >> --- a/kernel/fork.c
>> >> +++ b/kernel/fork.c
>> >> @@ -2026,6 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
>> >>  		goto bad_fork_free;
>> >>  
>> >>  	retval = -EAGAIN;
>> >> +	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
>> >> +		goto bad_fork_cleanup_count;
>> >>  	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
>> >>  		if ((task_ucounts(p) != &init_ucounts) &&
>> >>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>> >
>> > It might make sense to do something like:
>> >
>> > 	if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
>> > 		if ((task_ucounts(p) != &init_ucounts) &&
>> > 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>> >
>> > and the new function:
>> >
>> > long inc_rlimit_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, long v, unsigned long rlimit)
>> > {
>> > 	struct ucounts *iter;
>> > 	long ret = 0;
>> > 	long max = rlimit;
>> > 	if (rlimit > LONG_MAX)
>> > 		max = LONG_MAX;
>> > 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>> > 		long new = atomic_long_add_return(v, &iter->ucount[type]);
>> > 		if (new < 0 || new > max)
>> > 			ret = LONG_MAX;
>> > 		else if (iter == ucounts)
>> > 			ret = new;
>> > 		max = READ_ONCE(iter->ns->ucount_max[type]);
>> > 	}
>> > 	return ret;
>> > }
>> >
>> > This will avoid double checking the same userns tree.
>> >
>> > Or even modify inc_rlimit_ucounts. This function is used elsewhere like
>> > this:
>> >
>> >
>> > msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
>> > if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
>> >
>> >
>> > memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
>> > if (!allowed && (memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
>> >
>> >
>> > In all cases, we have max value for comparison.
>> 
>> Good point.   The downside is that it means we can't use the same code
>> in exec.  The upside is that the code is more idiomatic.
>
> My suggestion was before I saw the 8/8 patch :)
>
> We can make something like:
>
> static inline bool is_nproc_overlimit(struct task_struct *task)
> {
> 	return (task_ucounts(task) != &init_ucounts) &&
> 		!has_capability(task, CAP_SYS_RESOURCE) &&
> 		!has_capability(task, CAP_SYS_ADMIN);
> }
>
> In copy_process:
>
> if (inc_rlimit_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1, rlimit(RLIMIT_NPROC)) == LONG_MAX) {
> 	if (is_nproc_overlimit(p))
> 		goto bad_fork_cleanup_count;
> }
>
> In do_execveat_common:
>
> if ((current->flags & PF_NPROC_CHECK) &&
>     is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
>     is_nproc_overlimit(current)) {
> 	retval = -EAGAIN;
> 	goto out_ret;
> }


The more I think about it the more I suspect 8/8 is the wrong way to go.

The report is that adding the capability calls in kernel/sys.c which I
moved into execve broke apache.  As the change was about removing
inconsistencies I expect I should just start with the revert and keep
the difference between the two code paths.

My gut feel is that both the capable and the magic exception of a user
are wrong.  If I am wrong people can report a bug and the code can get
fixed.

But definitely a bug fix branch is the wrong place to be expanding what
is allowed without it clearly being a bug.

Eric

[PATCH 7/8] rlimit: For RLIMIT_NPROC test the child not the parent for capabilites

Posted by Eric W. Biederman 4 years, 4 months ago

Ever since capabilities have become user namespace relative the
capability checks to allow overrriing RLIMIT_NPROC in fork has been
wrong.  It is desirable to test the capabilities the new process will
have not to test the capabilities of the existing process.  In all
cases except when creating a user namespace this does not matter, and
even then not enforcing RLIMIT_NPROC on the root_user probably makes
such a test moot.

Still the test is wrong in principle so fix it to the more stringent test.

Especially now that RLIMIT_NPROC enforcement has become per user namespace.

Fixes: 3486740a4f32 ("userns: security: make capabilities relative to the user namespace")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 69333078259c..79661678a5bf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2030,7 +2030,8 @@ static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cleanup_count;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
 		if ((task_ucounts(p) != &init_ucounts) &&
-		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
+		    !has_capability(p, CAP_SYS_RESOURCE) &&
+		    !has_capability(p, CAP_SYS_ADMIN))
 			goto bad_fork_cleanup_count;
 	}
 	current->flags &= ~PF_NPROC_CHECK;
-- 
2.29.2

[PATCH 8/8] ucounts: Use the same code to enforce RLIMIT_NPROC in fork and exec

Posted by Eric W. Biederman 4 years, 4 months ago

Historically these pieces of code have slightly diverged and caused
problems.  To avoid that in the future create a common function to see
if RLIMIT_NPROC is over a limit and the limit should be enforced.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    |  7 ++-----
 include/linux/sched/signal.h |  2 ++
 kernel/fork.c                | 25 +++++++++++++++++++------
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 01c8c7bae9b4..8913dbb9a479 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1875,14 +1875,11 @@ static int do_execveat_common(int fd, struct filename *filename,
 		return PTR_ERR(filename);
 
 	/*
-	 * After calling set*uid() is RLIMT_NPROC exceeded?
+	 * After calling set*uid() is RLIMIT_NPROC exceeded?
 	 * This can not be checked in set*uid() because too many programs don't
 	 * check the setuid() return code.
 	 */
-	if ((current->flags & PF_NPROC_CHECK) &&
-	    is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-	    (current_ucounts() != &init_ucounts) &&
-	    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
+	if ((current->flags & PF_NPROC_CHECK) && is_nproc_overlimit()) {
 		retval = -EAGAIN;
 		goto out_ret;
 	}
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b6ecb9fc4cd2..b682131c52fa 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -742,4 +742,6 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+extern bool is_nproc_overlimit(void);
+
 #endif /* _LINUX_SCHED_SIGNAL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 79661678a5bf..a18f15053e22 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1891,6 +1891,23 @@ static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
 	mutex_unlock(&oom_adj_mutex);
 }
 
+static bool is_task_nproc_overlimit(struct task_struct *task)
+{
+	struct ucounts *ucounts = task_ucounts(task);
+	/* clone does not change RLIMIT_NPROC. The parents value is safe. */
+	unsigned long limit = rlimit(RLIMIT_NPROC);
+
+	return is_ucounts_overlimit(ucounts, UCOUNT_RLIMIT_NPROC, limit) &&
+		(ucounts != &init_ucounts) &&
+		!has_capability(task, CAP_SYS_RESOURCE) &&
+		!has_capability(task, CAP_SYS_ADMIN);
+}
+
+bool is_nproc_overlimit(void)
+{
+	return is_task_nproc_overlimit(current);
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2028,12 +2045,8 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = -EAGAIN;
 	if (inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1) == LONG_MAX)
 		goto bad_fork_cleanup_count;
-	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
-		if ((task_ucounts(p) != &init_ucounts) &&
-		    !has_capability(p, CAP_SYS_RESOURCE) &&
-		    !has_capability(p, CAP_SYS_ADMIN))
-			goto bad_fork_cleanup_count;
-	}
+	if (is_task_nproc_overlimit(p))
+		goto bad_fork_cleanup_count;
 	current->flags &= ~PF_NPROC_CHECK;
 
 	/*
-- 
2.29.2

Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Posted by Etienne Dechamps 4 years, 4 months ago

Hello there,

On 07/02/2022 12:17, Michal Koutný wrote:
> This series is a result of looking deeper into breakage of
> tools/testing/selftests/rlimits/rlimits-per-userns.c after
> https://lore.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com/
> is applied.

Pardon the intrusion, but I thought you might be interested to know that 
as a humble user I noticed actual user-visible breakage from 59ec715 
"ucounts: Fix rlimit max values check": 
https://bugzilla.kernel.org/show_bug.cgi?id=215596

I'm not sure I understand everything that's going on in this thread but 
it does seem very relevant. You guys might want to double-check the 
behavior in the particular scenario described there. I'm mostly sending 
this to make sure everything is cross-linked.

Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Posted by Michal Koutný 4 years, 4 months ago

On Sat, Feb 12, 2022 at 03:32:30PM +0000, Etienne Dechamps <etienne@edechamps.fr> wrote:
> I'm not sure I understand everything that's going on in this thread but it
> does seem very relevant. You guys might want to double-check the behavior in
> the particular scenario described there. I'm mostly sending this to make
> sure everything is cross-linked.

Thanks for the report with strace.

AFAICT, it's caused by setresuid() after unshare(), i.e. all root's
tasks are (wrongly) compared against the lowered RLIMIT_NPROC.

This is tackled by my RFC patch 2/6 [1] or Eric's variant but 3/8
(equivalent fix for this case but I haven't run that build).

Michal

[1] I could run your test (LimitNPROC=1 actually) against kernel with my
patches and the service starts.

Re: [RFC PATCH 0/6] RLIMIT_NPROC in ucounts fixups

Posted by Eric W. Biederman 4 years, 4 months ago

Michal Koutný <mkoutny@suse.com> writes:

> On Sat, Feb 12, 2022 at 03:32:30PM +0000, Etienne Dechamps <etienne@edechamps.fr> wrote:
>> I'm not sure I understand everything that's going on in this thread but it
>> does seem very relevant. You guys might want to double-check the behavior in
>> the particular scenario described there. I'm mostly sending this to make
>> sure everything is cross-linked.
>
> Thanks for the report with strace.
>
> AFAICT, it's caused by setresuid() after unshare(), i.e. all root's
> tasks are (wrongly) compared against the lowered RLIMIT_NPROC.
>
> This is tackled by my RFC patch 2/6 [1] or Eric's variant but 3/8
> (equivalent fix for this case but I haven't run that build).
>
> Michal
>
> [1] I could run your test (LimitNPROC=1 actually) against kernel with my
> patches and the service starts.


So I looked into this and our previous patchsets (but not my final one)
did resolve this.

What fixed it and what is needed to fix this is not enforcing
RLIMIT_NPROC when the user who creates the user namespace is INIT_USER.

AKA something like the patch below.  It is a regression so if at all
possible it needs to be fixed, and it is certainly possible.

The patch below feels right at first glance, but I am not convinced that
testing cred->user or cred->ucounts is the proper test so I am going to
sleep on this a little bit.

I did want everyone to know I looked into this and I am going to ensure
this gets fixed.

diff --git a/kernel/fork.c b/kernel/fork.c
index 17d8a8c85e3b..532ce5cbf851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 	retval = -EAGAIN;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
-		if (p->real_cred->user != INIT_USER &&
+		if (p->real_cred->ucounts != &init_ucounts &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
 			goto bad_fork_cleanup_count;
 	}
diff --git a/kernel/sys.c b/kernel/sys.c
index 97dc9e5d6bf9..7b5d74a7845c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -490,7 +490,7 @@ static void flag_nproc_exceeded(struct cred *new)
 	 * failure to the execve() stage.
 	 */
 	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-			new->user != INIT_USER)
+			new->ucounts != &init_ucounts)
 		current->flags |= PF_NPROC_EXCEEDED;
 	else
 		current->flags &= ~PF_NPROC_EXCEEDED;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..925fb3579ef3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
 		ns->ucount_max[i] = INT_MAX;
 	}
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	if (new->ucounts == &init_ucounts)
+		set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIM_INFINITY);
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));

How should rlimits, suid exec, and capabilities interact?

Posted by Eric W. Biederman 4 years, 4 months ago

[CC'd the security list because I really don't know who the right people
 are to drag into this discussion]
 
While looking at some issues that have cropped up with making it so
that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
stumbled upon a very old issue of how rlimits and suid exec interact
poorly.

This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two
bugs with process limits (RLIMIT_NPROC)") from
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which
essentially replaced a capable() check with a an open-coded
implementation of suser(), for RLIMIT_NPROC.

The description from Neil Brown was:

   1/ If a setuid process swaps it's real and effective uids and then forks,
     the fork fails if the new realuid has more processes
     than the original process was limited to.
     This is particularly a problem if a user with a process limit
     (e.g. 256) runs a setuid-root program which does setuid() + fork()
     (e.g. lprng) while root already has more than 256 process (which
     is quite possible).
    
     The root problem here is that a limit which should be a per-user
     limit is being implemented as a per-process limit with
     per-process (e.g. CAP_SYS_RESOURCE) controls.
     Being a per-user limit, it should be that the root-user can over-ride
     it, not just some process with CAP_SYS_RESOURCE.
    
     This patch adds a test to ignore process limits if the real user is root.

The test to see if the real user is root was:
	if (p->real_cred->user != INIT_USER) ...
which persists to this day in fs/fork.c:copy_process().

The practical problem with this test is that it works like nothing else
in the kernel, and so does not look like what it is.  Saying:
	if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ...

would at least be more recognizable.

Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because
CAP_SYS_RESOURCE is the capability that controls if you are allowed to
exceed your rlimits.

Which brings us to the practical issues of how all of these things are
wired together today.

The per-user rlimits are accounted based upon a processes real user, not
the effective user.  All other permission checks are based upon the
effective user.  This has the practical effect that uids are swapped as
above that the processes are charged to root, but use the permissions of
an ordinary user.

The problems get worse when you realize that suid exec does not reset
any of the rlimits except for RLIMIT_STACK.

The rlimits that are particularly affected and are per-user are:
RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK.

But I think failing to reset rlimits during exec has the potential to
effect any suid exec.

Does anyone have any historical knowledge or sense of how this should
work?

Right now it feels like we have coded ourselves into a corner and will
have to risk breaking userspace to get out of it.  AKA I think we need
a policy of reseting rlimits on suid exec, and I think we need to store
global rlimits based upon the effective user not the real user.  Those
changes should allow making capable calls where they belong, and
removing the much too magic user == INIT_USER test for RLIMIT_NPROC.

Eric

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Andy Lutomirski 4 years, 4 months ago

On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
>
> [CC'd the security list because I really don't know who the right people
>  are to drag into this discussion]
>
> While looking at some issues that have cropped up with making it so
> that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
> stumbled upon a very old issue of how rlimits and suid exec interact
> poorly.

Once upon a time, these resource limits were effectively the only way
to control memory consumption and consumption of historically limited
resources like processes.  (The scheduler used to have serious issues
with too many processes -- this is not so true any more.  And without
cgroups, too many processes could use too much CPU collectively.)
This all worked pretty poorly.  Now we have cgroups, fancy memory
accounting, etc.  So I'm wondering if NPROC is even useful anymore.  I
don't have a brilliant idea of how to deprecate it, but I think it
wouldn't be entirely nuts to take it much less seriously and maybe
even eventually get rid of it.

I doubt there is much existing userspace that would break if a
previously failing fork() started succeeding.

--Andy]

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Willy Tarreau 4 years, 4 months ago

Hi Andy,

On Wed, Feb 23, 2022 at 11:44:51AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> >
> > [CC'd the security list because I really don't know who the right people
> >  are to drag into this discussion]
> >
> > While looking at some issues that have cropped up with making it so
> > that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
> > stumbled upon a very old issue of how rlimits and suid exec interact
> > poorly.
> 
> Once upon a time, these resource limits were effectively the only way
> to control memory consumption and consumption of historically limited
> resources like processes.  (The scheduler used to have serious issues
> with too many processes -- this is not so true any more.  And without
> cgroups, too many processes could use too much CPU collectively.)
> This all worked pretty poorly.  Now we have cgroups, fancy memory
> accounting, etc.  So I'm wondering if NPROC is even useful anymore.  I
> don't have a brilliant idea of how to deprecate it, but I think it
> wouldn't be entirely nuts to take it much less seriously and maybe
> even eventually get rid of it.
> 
> I doubt there is much existing userspace that would break if a
> previously failing fork() started succeeding.

I strongly disagree. I've been using it for a long time as a security
measure. Setting NPROC to 0 after daemonizing remains a particularly
effective and portable method to mitigate the possible consequences of
an in-process intrusion. While I wouldn't care about approximate non-zero
values, for me it would be a significant security regression to drop the
inability to fork() when the limit is zero. Thus at least I do want to
keep that feature when NPROC is zero.

Willy

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Linus Torvalds 4 years, 4 months ago

On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Which brings us to the practical issues of how all of these things are
> wired together today.

I honestly think you should treat the limits as "approximate".

We do that for a number of reasons:

 - sometimes we have racy tests because we don't want to do excessive
locking just for a limit: nobody cares if you can go a couple of
entries past a limit because you were lucky, it's important that you
can't go *much* past the limit.

 - sometimes the limits themselves are fuzzy (example: time. it's
incremented by "ticks", but it's simply not that precise, and it
depends a bit when the ticks happen)

 - sometimes it's ambiguous who we're talking about.

I think suid execs tend to fall in that third category. Be generous.
If the limit doesn't trigger at the suid exec, nobody cares. You want
to make sure it triggers eventually.

For example, let's say that you are the admin, and you made a mistake,
and you had a runaway fork() bomb that was caught by the limits.

Optimally, you still want to be able to be able to log in (one process
that was root when it did the fork(), and did a 'setresuid()' or
similar to drop the things, and then one process that does 'sudo' to
get privileges to kill the darn fork bomb).

See how that 'user' technically went over the limit, and that was A-OK!

Basic rule: it's better to be too lenient than to be too strict.

                  Linus

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Eric W. Biederman 4 years, 4 months ago

Linus Torvalds <linus@torvalds.org> writes:

> Basic rule: it's better to be too lenient than to be too strict.

Thank you.  With that guideline I can explore the space of what is
possible.

Question: Running a suid program today charges the activity of that
program to the user who ran that program, not to the user the program
runs as.  Does anyone see a problem with charging the user the program
runs as?

The reason I want to change which user is charged with a process
(besides it making more sense in my head) is so that
"capable(CAP_SYS_RESOURCE)" can be used instead of the magic incantation
"(cred->user == INIT_USER)".

Today "capable(CAP_SYS_RESOURCE)" with respect to RLIMIT_NPROC is
effectively meaningless for suid programs because the of the mismatch of
charging the real user with the effective users credentials.

An accidental experiment happened in v5.14-rc1 in July when the ucount
rlimit code was merged.  It was only this last week when after Michal
Koutný discovered the discrepancy through code inspection I merged a
bug fix because the code was not preserving the existing behavior as
intended.

This behavior has existed in some form since Linux v1.0 when per user
process limits were added.

The original code in v1.0 was:
> static int find_empty_process(void)
> {
>        int free_task;
>        int i, tasks_free;
>        int this_user_tasks;
> 
> repeat:
>        if ((++last_pid) & 0xffff8000)
>                last_pid=1;
>        this_user_tasks = 0;
>        tasks_free = 0;
>        free_task = -EAGAIN;
>        i = NR_TASKS;
>        while (--i > 0) {
>                if (!task[i]) {
>                        free_task = i;
>                        tasks_free++;
>                        continue;
>                }
>                if (task[i]->uid == current->uid)
>                        this_user_tasks++;
>                if (task[i]->pid == last_pid || task[i]->pgrp == last_pid ||
>                    task[i]->session == last_pid)
>                        goto repeat;
>        }
>        if (tasks_free <= MIN_TASKS_LEFT_FOR_ROOT ||
>            this_user_tasks > MAX_TASKS_PER_USER)
>                if (current->uid)
>                        return -EAGAIN;
>        return free_task;
> }

Having tracked the use of real uid in limits back this far my guess
is that it was an accident of the implementation and real uid vs
effective uid had not be considered.

Does anyone know if choosing the real uid vs the effective uid for
accounting a users processes was a deliberate decision anywhere in the
history of Linux?

Linus you were talking about making it possible to login as I think a
non-root user to be able to use sudo and kill a fork bomb.

The counter case is apache having a dedicated user for running
cgi-scripts and using RLIMIT_NPROC to limit how many of those processes
can exist.  Unless I am misunderstanding something that looks exactly
like your login as non-root so you can run sudo to kill a fork-bomb.

A comment from an in-process cleanup patch explains this as best I can:
         /*
         * In general rlimits are only enforced when a new resource
         * is acquired.  That would be during fork for RLIMIT_NPROC.
         * That is insufficient for RLIMIT_NPROC as many attributes of
         * a new process must be set between fork and exec.
         *
         * A case where this matter is when apache runs forks a process
         * and calls setuid to run cgi-scripts as a different user.
         * Generating those processes through a code sequence like:
         *
         *      fork()
 	 *	setrlimit(RLIMIT_NPROC, ...)         
         *      execve()  -- suid wrapper
         *      setuid()
         *      execve()  -- cgi script
         *
         * The cgi-scripts are unlikely to fork on their own so unless
         * RLIMIT_NPROC is checked after the user change and before
         * the cgi-script starts, RLIMIT_NPROC simply will not be enforced
         * for the cgi-scripts.
         *
         * So the code tracks if between fork and exec if an operation
         * occurs that could cause the RLIMIT_NPROC check to fail.  If
         * such an operation has happened re-check RLIMIT_NPROC.
         */

Answered-Question: I was trying to ask if anyone knows of a reason why
we can't just sanitize the rlimits of the process during suid exec?
Linus your guideline would appear to allow that behavior.  Unfortunately
that looks like it would break current usage of apache suexec.

Eric

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Eric W. Biederman 4 years, 4 months ago

Linus Torvalds <linus@torvalds.org> writes:

> Basic rule: it's better to be too lenient than to be too strict.

Thank you.  With that guideline I can explore the space of what is
possible.

Question: Running a suid program today charges the activity of that
program to the user who ran that program, not to the user the program
runs as.  Does anyone see a problem with charging the user the program
runs as?

The reason I want to change who is charged with a process (besides it
making more sense in my head) is so that capable(CAP_SYS_RESOURCE) can
be used instead of the magic incantation (cred->user == INIT_USER).

An accidental experiment happened in v5.14-rc1 in July when the ucount
rlimit code was merged.  It was only this last week when after Michal
Koutný discovered the discrepency through code inspect a bug fix was
merged.

This changes the behavior that has existed in some form since Linux v1.0
when per user process limits were added.

The original code in v1.0 looked like:
> static int find_empty_process(void)
> {
>        int free_task;
>        int i, tasks_free;
>        int this_user_tasks;
> 
> repeat:
>        if ((++last_pid) & 0xffff8000)
>                last_pid=1;
>        this_user_tasks = 0;
>        tasks_free = 0;
>        free_task = -EAGAIN;
>        i = NR_TASKS;
>        while (--i > 0) {
>                if (!task[i]) {
>                        free_task = i;
>                        tasks_free++;
>                        continue;
>                }
>                if (task[i]->uid == current->uid)
>                        this_user_tasks++;
>                if (task[i]->pid == last_pid || task[i]->pgrp == last_pid ||
>                    task[i]->session == last_pid)
>                        goto repeat;
>        }
>        if (tasks_free <= MIN_TASKS_LEFT_FOR_ROOT ||
>            this_user_tasks > MAX_TASKS_PER_USER)
>                if (current->uid)
>                        return -EAGAIN;
>        return free_task;
> }

Having tracked the use of real uid in limits back this far my guess
is that it was an accident of the implementation and real uid vs
effective uid had not be considered.

Does anyone know if choosing the real uid was a deliberate decision
anywhere in the history of Linux?

Linus you were talking about making it possible to login as I think a
non-root user to be able to use sudo and kill a fork bomb.

The counter case is apache having a dedicated user for running
cgi-scripts and using RLIMIT_NPROC to limit how many of those processes
can exist.  Unless I am misunderstanding something that looks exactly
like your login as non-root so you can run sudo to kill a fork-bomb.

A comment from an in-process cleanup patch explains this as best I can:
         /*
         * In general rlimits are only enforced when a new resource
         * is acquired.  That would be during fork for RLIMIT_NPROC.
         * That is insufficient for RLIMIT_NPROC as many attributes of
         * a new process must be set between fork and exec.
         *
         * A case where this matter is when apache runs forks a process
         * and calls setuid to run cgi-scripts as a different user.
         * Generating those processes through a code sequence like:
         *
         *      fork()
 	 *	setrlimit(RLIMIT_NPROC, ...)         
         *      execve()  -- suid wrapper
         *      setuid()
         *      execve()  -- cgi script
         *
         * The cgi-scripts are unlikely to fork on their own so unless
         * RLIMIT_NPROC is checked after the user change and before
         * the cgi-script starts, RLIMIT_NPROC simply will not be enforced
         * for the cgi-scripts.
         *
         * So the code tracks if between fork and exec if an operation
         * occurs that could cause the RLIMIT_NPROC check to fail.  If
         * such an operation has happened re-check RLIMIT_NPROC.
         */

Answered-Question: I was trying to ask if anyone knows of a reason why
we can't just sanitize the rlimits of the process during suid exec?
Linus your guideline would appear to allow that behavior.  Unfortunately
that looks like it would break current usage of apache suexec.

Eric

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Linus Torvalds 4 years, 4 months ago

On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Question: Running a suid program today charges the activity of that
> program to the user who ran that program, not to the user the program
> runs as.  Does anyone see a problem with charging the user the program
> runs as?

So I think that there's actually two independent issues with limits
when you have situations like this where the actual user might be
ambiguous.

 - the "who to charge" question

 - the "how do we *check* the limit" question

and honestly, I think that when it comes to suid binaries, the first
question is fundamentally ambiguous, because it almost certainly
depends on the user.

Which to me implies that there probably isn't an answer that is always
right, and that what you should look at is that second option.

So I would actually suggest that the "execute a suid binary" should
charge the real user, but *because* it is suid, it should then not
check the limit (or, perhaps, should check the hard limit?).

You have to charge somebody, but at that point it's a bit ambiguous
whether it should be allowed.

Exactly so that if you're over a process limit (or something similar -
think "too many files open" or whatever because you screwed up and
opened everything) you could still log in as yourself (ssh/login
charges some admin thing, which probably has high limits or is
unlimited), and hopefully get shell access, and then be able to "exec
sudo" to actually get admin access that should be disabled from the
network.

The above is just one (traditional) example of a fork/open bomb case
where a user isn't really able to no longer function as himself, but
wants to fix things (maybe the user has another terminal open, but
then he can hopefully use a shell-buiiltin 'kill' instead).

And I'm not saying it's "the thing that needs to work". I'm more
making up an example.

So I'm only saying that the above actually has two examples to the two
sides of the coin: "login" lowering privileges to a user that may be
over some limit - and succeeding despite that - and 'suid' succeeding
despite the original user perhaps being over-committed.

So it's intended exactly as an example of "picking the new or the old
user would be wrong in either case if you check limits at the
transition point".

Hmm?

            Linus

RE: How should rlimits, suid exec, and capabilities interact?

Posted by David Laight 4 years, 4 months ago

From: Linus Torvalds
> Sent: 24 February 2022 01:42
> 
> On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Question: Running a suid program today charges the activity of that
> > program to the user who ran that program, not to the user the program
> > runs as.  Does anyone see a problem with charging the user the program
> > runs as?
> 
> So I think that there's actually two independent issues with limits
> when you have situations like this where the actual user might be
> ambiguous.
> 
>  - the "who to charge" question
> 
>  - the "how do we *check* the limit" question
> 
> and honestly, I think that when it comes to suid binaries, the first
> question is fundamentally ambiguous, because it almost certainly
> depends on the user.

Doesn't the rlimit check happen during the fork.
At which time you don't know that a suid exec might follow?

The problem with changing the uid is that when the process exits
you need to "uncharge" the correct uid.
So either you need to remember the original uid or setuid
has to transfer the charge (whichever uid is used).
If you transfer the charge then the setuid system call can't fail.
But a later exec can fail.

Any check will always be done against the process's own rlimit value.
Set that to zero and fork should fail regardless of which uid's
process count is checked.

Now a normal suid program only changes the effective uid.
So keeping the process charged against the real uid makes sense.

If a process changes its real uid you could change the charged uid
but you can't error if over the rlimit value.
OTOH during a later exec you can test things and exec can fail.

At least one unix I've used has three uids for each process.
The 'real uid', 'effective uid' and 'saved by exec uid'.
I suspect the process is always "charged" against the latter.
I think that exec compares the 'real' and 'saved by exec' uids
and, if different, moves the charge to the real uid (which will
check rlimit) then sets the 'saved by exec uid' to the real uid.

So an exec after a setuid() can be allowed to fail if the real user
has too many processes.
But in all other cases exec just works regardless of the process
count for any user.

> 
> Which to me implies that there probably isn't an answer that is always
> right, and that what you should look at is that second option.
> 
> So I would actually suggest that the "execute a suid binary" should
> charge the real user, but *because* it is suid, it should then not
> check the limit (or, perhaps, should check the hard limit?).
> 
> You have to charge somebody, but at that point it's a bit ambiguous
> whether it should be allowed.
> 
> Exactly so that if you're over a process limit (or something similar -
> think "too many files open" or whatever because you screwed up and
> opened everything) you could still log in as yourself (ssh/login
> charges some admin thing, which probably has high limits or is
> unlimited), and hopefully get shell access, and then be able to "exec
> sudo" to actually get admin access that should be disabled from the
> network.

You usually have to use 'rsh machine sh -i' to avoid the shell
running all its startup scripts.
But I doubt that will get you past a fork bomb.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: How should rlimits, suid exec, and capabilities interact?

Posted by Eric W. Biederman 4 years, 4 months ago

Linus Torvalds <linus@torvalds.org> writes:

> On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Question: Running a suid program today charges the activity of that
>> program to the user who ran that program, not to the user the program
>> runs as.  Does anyone see a problem with charging the user the program
>> runs as?
>
> So I think that there's actually two independent issues with limits
> when you have situations like this where the actual user might be
> ambiguous.
>
>  - the "who to charge" question
>
>  - the "how do we *check* the limit" question
>
> and honestly, I think that when it comes to suid binaries, the first
> question is fundamentally ambiguous, because it almost certainly
> depends on the user.
>
> Which to me implies that there probably isn't an answer that is always
> right, and that what you should look at is that second option.
>
> So I would actually suggest that the "execute a suid binary" should
> charge the real user, but *because* it is suid, it should then not
> check the limit (or, perhaps, should check the hard limit?).
>
> You have to charge somebody, but at that point it's a bit ambiguous
> whether it should be allowed.
>
> Exactly so that if you're over a process limit (or something similar -
> think "too many files open" or whatever because you screwed up and
> opened everything) you could still log in as yourself (ssh/login
> charges some admin thing, which probably has high limits or is
> unlimited), and hopefully get shell access, and then be able to "exec
> sudo" to actually get admin access that should be disabled from the
> network.
>
> The above is just one (traditional) example of a fork/open bomb case
> where a user isn't really able to no longer function as himself, but
> wants to fix things (maybe the user has another terminal open, but
> then he can hopefully use a shell-buiiltin 'kill' instead).
>
> And I'm not saying it's "the thing that needs to work". I'm more
> making up an example.
>
> So I'm only saying that the above actually has two examples to the two
> sides of the coin: "login" lowering privileges to a user that may be
> over some limit - and succeeding despite that - and 'suid' succeeding
> despite the original user perhaps being over-committed.
>
> So it's intended exactly as an example of "picking the new or the old
> user would be wrong in either case if you check limits at the
> transition point".
>
> Hmm?

That doesn't really clarify anything for me.  We have two checks one in
fork and one in exec and you seem to be talking about the check in exec.

The check I have problems with for a suid executable is the check in
fork.  If the new process is accounted to the previous user and we use
the permissions of the effective user for checking it that does not make
sense to me.

If we can sort out that the check in fork.  I think I have clarity about
the other cases.

The check in exec while clumsy and needing cleaning up seems to make
sense to me.  We have a transition that starts with fork and ends with
exec and has operations like setuid in between.  If something like
setuid() is called before exec we check in exec.

The case the check in exec is aimed at supporting are processes spawned
from a parent that have a different user (than the parent) and will
never call fork again.   Those processes would be fundamentally immune
to RLIMIT_NPROC if we don't check somewhere besides fork.  There is
existing code in apache to use RLIMIT_NPROC this way.

For your login case I have no problems with it in principle.  In
practice I think you have to login as root to deal with a fork bomb that
hits RLIMIT_NPROC and does not die gracefully.

What I don't see about your login example is how it is practically
different from the apache cgi script case, that the code has supported
for 20 years, and that would be a regression if stopped supporting.

If we want to stop supporting that case we can just remove all of the
RLIMIT_NPROC tests everywhere except for fork, a nice cleanup.

That still leaves me with mismatched effective vs real uid checks in
fork when the effective and real uids don't match.  Which means testing
for root with "capable(CAP_SYS_ADMIN)" does not work.  Which today is
make the code a bit of a challenge to understand and work with.

Eric

[PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Posted by Eric W. Biederman 4 years, 4 months ago


Long story short recursively enforcing RLIMIT_NPROC when it is not
enforced on the process that creates a new user namespace, causes
currently working code to fail.  There is no reason to enforce
RLIMIT_NPROC recursively when we don't enforce it normally so update
the code to detect this case.

I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
RLIMIT_NPROC is charged and checked for enforcement based upon the
real uid, using capable() wich is euid based is inconsistent with reality.
Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
testing for when the real uid would match the conditions when
CAP_SYS_RESOURCE would be present if the real uid was the effective
uid.

Reported-by: Etienne Dechamps <etienne@edechamps.fr>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

The previous conversation has given me enough clarity that I can see
which tests I am comfortable with use for this pending regression fix.

I have tested this and it works for me.  Does anyone have any concerns
with this change?

 kernel/user_namespace.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..5481ba44a8d6 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	cred->user_ns = user_ns;
 }
 
+static unsigned long enforced_nproc_rlimit(void)
+{
+	unsigned long limit = RLIM_INFINITY;
+
+	/* Is RLIMIT_NPROC currently enforced? */
+	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
+	    (current_user_ns() != &init_user_ns))
+		limit = rlimit(RLIMIT_NPROC);
+
+	return limit;
+}
+
 /*
  * Create a new user namespace, deriving the creator from the user in the
  * passed credentials, and replacing that user with the new root user for the
@@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
 	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
 	}
-	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
-- 
2.29.2

Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Posted by Kees Cook 4 years, 4 months ago

typo: Subject's LimigtNPROC -> LimitNPROC

On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
> 
> Long story short recursively enforcing RLIMIT_NPROC when it is not
> enforced on the process that creates a new user namespace, causes
> currently working code to fail.  There is no reason to enforce
> RLIMIT_NPROC recursively when we don't enforce it normally so update
> the code to detect this case.
> 
> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
> RLIMIT_NPROC is charged and checked for enforcement based upon the
> real uid, using capable() wich is euid based is inconsistent with reality.

typo: wich -> which

> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
> testing for when the real uid would match the conditions when
> CAP_SYS_RESOURCE would be present if the real uid was the effective
> uid.
> 
> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> The previous conversation has given me enough clarity that I can see
> which tests I am comfortable with use for this pending regression fix.
> 
> I have tested this and it works for me.  Does anyone have any concerns
> with this change?

I'd really love some kind of selftest that exercises the edge cases; do
you have your tests in some form that could be converted?

But otherwise, yes, this looks like the best option here.

Reviewed-by: Kees Cook <keescook@chromium.org>

> 
>  kernel/user_namespace.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 6b2e3ca7ee99..5481ba44a8d6 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>  	cred->user_ns = user_ns;
>  }
>  
> +static unsigned long enforced_nproc_rlimit(void)
> +{
> +	unsigned long limit = RLIM_INFINITY;
> +
> +	/* Is RLIMIT_NPROC currently enforced? */
> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
> +	    (current_user_ns() != &init_user_ns))
> +		limit = rlimit(RLIMIT_NPROC);
> +
> +	return limit;
> +}
> +
>  /*
>   * Create a new user namespace, deriving the creator from the user in the
>   * passed credentials, and replacing that user with the new root user for the
> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>  		ns->ucount_max[i] = INT_MAX;
>  	}
> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
> -- 
> 2.29.2
> 

-- 
Kees Cook

Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Posted by Eric W. Biederman 4 years, 4 months ago

Kees Cook <keescook@chromium.org> writes:

> typo: Subject's LimigtNPROC -> LimitNPROC
>
> On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
>> 
>> Long story short recursively enforcing RLIMIT_NPROC when it is not
>> enforced on the process that creates a new user namespace, causes
>> currently working code to fail.  There is no reason to enforce
>> RLIMIT_NPROC recursively when we don't enforce it normally so update
>> the code to detect this case.
>> 
>> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
>> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
>> RLIMIT_NPROC is charged and checked for enforcement based upon the
>> real uid, using capable() wich is euid based is inconsistent with reality.
>
> typo: wich -> which

Ahh... Typos.

>> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
>> testing for when the real uid would match the conditions when
>> CAP_SYS_RESOURCE would be present if the real uid was the effective
>> uid.
>> 
>> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
>> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> 
>> The previous conversation has given me enough clarity that I can see
>> which tests I am comfortable with use for this pending regression fix.
>> 
>> I have tested this and it works for me.  Does anyone have any concerns
>> with this change?
>
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?
>
> But otherwise, yes, this looks like the best option here.

Let's start with Michal Koutný tests.  I keep forgetting to look at
them.  This cold has really been kicking my butt.

For this issue the test case was a systemd unit file.  Which is simple
and demonstrates the real-world regression but not really minimal in the
way a kernel selftest should be.

> Reviewed-by: Kees Cook <keescook@chromium.org>
>
>> 
>>  kernel/user_namespace.c | 14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 6b2e3ca7ee99..5481ba44a8d6 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>>  	cred->user_ns = user_ns;
>>  }
>>  
>> +static unsigned long enforced_nproc_rlimit(void)
>> +{
>> +	unsigned long limit = RLIM_INFINITY;
>> +
>> +	/* Is RLIMIT_NPROC currently enforced? */
>> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
>> +	    (current_user_ns() != &init_user_ns))
>> +		limit = rlimit(RLIMIT_NPROC);
>> +
>> +	return limit;
>> +}
>> +
>>  /*
>>   * Create a new user namespace, deriving the creator from the user in the
>>   * passed credentials, and replacing that user with the new root user for the
>> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>>  		ns->ucount_max[i] = INT_MAX;
>>  	}
>> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
>> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
>> -- 
>> 2.29.2
>> 

Eric

Re: [PATCH] ucounts: Fix systemd LimigtNPROC with private users regression

Posted by Michal Koutný 4 years, 4 months ago

On Thu, Feb 24, 2022 at 08:28:41AM -0800, Kees Cook <keescook@chromium.org> wrote:
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?

There's the original
tools/testing/selftests/rlimits/rlimits-per-userns.c selftest.

I've been rewriting it to cover more situations, I'm sending it as one
monster patch (I'd need spend more time reordering my commits into some
logical patch order) if anyone wishes to try it.

I've tried it on 5c1ee569660d4a205dced9cb4d0306b907fb7599 + this Eric's
patch.

The test rlimit-per-userns-root passes
- together with that I claim this patch

Reviewed-by: Michal Koutný <mkoutny@suse.com>

The test rlimit-per-userns-nonroot fails. It's similar off-by-one
mistake as was in the fork path, but it's in the do_execveat_common():

        if ((current->flags & PF_NPROC_EXCEEDED) &&
            is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
                retval = -EAGAIN;
                goto out_ret;
        }

(If RLIMIT_NPROC should be strictly honored, setuid+execve should fail
when given uid's ucount is at the limit already.)

Funnily, the original
tools/testing/selftests/rlimits/rlimits-per-userns.c passes thanks to
the off-by-one check even though it should not pass because unshare(2)
is called after setuid(2).

Michal

-- >8 --
From be67d903f1f179f585bf302f6c2d2446f24263d7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Date: Thu, 20 Jan 2022 19:32:54 +0100
Subject: [RFC PATCH] selftests: Rewrite RLIMIT_NPROC checks (in user
 namespaces)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two test programs:
- rlimit-per-userns-root -- creates user namespaces owned by root,
- rlimit-per-userns-nonroot -- creates user namespaces owned by non-root.

The forking tree:

    main [init_user_ns]
    ` service [user_ns_1]
      ` worker 1
      ` worker 2
      ...
      ` worker k
    ...
    ` service [user_ns_n]
      ` worker 1
      ` worker 2
      ...
      ` worker k

Expectations rlimit-per-userns-root:

    n > RLIMIT_NPROC		privileged user can spawn mutliple services in different user namespaces
    (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
    k >= RLIMIT_NPROC-1		separate per-user namespace counters

Expectations rlimit-per-userns-nonroot:

    n <= RLIMIT_NPROC		global RLIMIT_NPROC is honored
    (k+1) <= RLIMIT_NPROC	limit is honored within user namespace

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 tools/testing/selftests/rlimits/Makefile      |   6 +-
 .../rlimits/rlimits-per-userns-nonroot.c      |  37 ++
 .../rlimits/rlimits-per-userns-root.c         |  34 ++
 .../selftests/rlimits/rlimits-per-userns.c    | 161 -------
 .../selftests/rlimits/service_common.c        | 400 ++++++++++++++++++
 .../selftests/rlimits/service_common.h        |  24 ++
 6 files changed, 500 insertions(+), 162 deletions(-)
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-root.c
 delete mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c
 create mode 100644 tools/testing/selftests/rlimits/service_common.c
 create mode 100644 tools/testing/selftests/rlimits/service_common.h

diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile
index 03aadb406212..8ccb92020206 100644
--- a/tools/testing/selftests/rlimits/Makefile
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-or-later
 
 CFLAGS += -Wall -O2 -g
-TEST_GEN_PROGS := rlimits-per-userns
+TEST_GEN_PROGS := rlimits-per-userns-root
+TEST_GEN_PROGS += rlimits-per-userns-nonroot
 
 include ../lib.mk
+
+$(OUTPUT)/rlimits-per-userns-root: service_common.c
+$(OUTPUT)/rlimits-per-userns-nonroot: service_common.c
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
new file mode 100644
index 000000000000..ccf021769f88
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+#include <err.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+int main(int argc, char **argv)
+{
+	struct services_ctx *ctx;
+	pid = getpid();
+
+	if (getenv(ENV_PARAM))
+		return run_service(atoi(getenv(ENV_PARAM)));
+
+	if (getuid() > 0)
+		errx(KSFT_SKIP, "This selftest must start as (global) root user.");
+
+	warnx("(pid=%d) Starting testcase", pid);
+
+	ctx = start_services(argv[0], UM_UNSHARE);
+	stop_services(ctx);
+
+	if (count_services(ctx) > THE_LIMIT)
+		errx(KSFT_FAIL, "(pid=%d): Test failed, exec'd services > RLIMIT_NPROC", pid);
+
+	if (check_services(ctx) < count_services(ctx))
+		errx(KSFT_FAIL, "(pid=%d): Test failed, failed services", pid);
+
+	warnx("(pid=%d): Test passed", pid);
+	exit(KSFT_PASS);
+}
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-root.c b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c
new file mode 100644
index 000000000000..3bf0149ac93d
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+#include <err.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+int main(int argc, char **argv)
+{
+	struct services_ctx *ctx;
+	pid = getpid();
+
+	if (getenv(ENV_PARAM))
+		return run_service(atoi(getenv(ENV_PARAM)));
+
+	if (getuid() > 0)
+		errx(KSFT_SKIP, "This selftest must start as (global) root user.");
+
+	warnx("(pid=%d) Starting testcase", pid);
+
+	ctx = start_services(argv[0], UM_CLONE_NEWUSER);
+	stop_services(ctx);
+
+	if (check_services(ctx) != NR_SERVICES)
+		errx(KSFT_FAIL, "(pid=%d): Test failed, unexpected terminations", pid);
+
+	warnx("(pid=%d): Test passed", pid);
+	exit(KSFT_PASS);
+}
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
deleted file mode 100644
index 26dc949e93ea..000000000000
--- a/tools/testing/selftests/rlimits/rlimits-per-userns.c
+++ /dev/null
@@ -1,161 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
- */
-#define _GNU_SOURCE
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/time.h>
-#include <sys/resource.h>
-#include <sys/prctl.h>
-#include <sys/stat.h>
-
-#include <unistd.h>
-#include <stdlib.h>
-#include <stdio.h>
-#include <string.h>
-#include <sched.h>
-#include <signal.h>
-#include <limits.h>
-#include <fcntl.h>
-#include <errno.h>
-#include <err.h>
-
-#define NR_CHILDS 2
-
-static char *service_prog;
-static uid_t user   = 60000;
-static uid_t group  = 60000;
-
-static void setrlimit_nproc(rlim_t n)
-{
-	pid_t pid = getpid();
-	struct rlimit limit = {
-		.rlim_cur = n,
-		.rlim_max = n
-	};
-
-	warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
-
-	if (setrlimit(RLIMIT_NPROC, &limit) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
-}
-
-static pid_t fork_child(void)
-{
-	pid_t pid = fork();
-
-	if (pid < 0)
-		err(EXIT_FAILURE, "fork");
-
-	if (pid > 0)
-		return pid;
-
-	pid = getpid();
-
-	warnx("(pid=%d): New process starting ...", pid);
-
-	if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
-		err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
-
-	signal(SIGUSR1, SIG_DFL);
-
-	warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
-
-	if (setgid(group) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
-	if (setuid(user) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
-
-	warnx("(pid=%d): Service running ...", pid);
-
-	warnx("(pid=%d): Unshare user namespace", pid);
-	if (unshare(CLONE_NEWUSER) < 0)
-		err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
-
-	char *const argv[] = { "service", NULL };
-	char *const envp[] = { "I_AM_SERVICE=1", NULL };
-
-	warnx("(pid=%d): Executing real service ...", pid);
-
-	execve(service_prog, argv, envp);
-	err(EXIT_FAILURE, "(pid=%d): execve", pid);
-}
-
-int main(int argc, char **argv)
-{
-	size_t i;
-	pid_t child[NR_CHILDS];
-	int wstatus[NR_CHILDS];
-	int childs = NR_CHILDS;
-	pid_t pid;
-
-	if (getenv("I_AM_SERVICE")) {
-		pause();
-		exit(EXIT_SUCCESS);
-	}
-
-	service_prog = argv[0];
-	pid = getpid();
-
-	warnx("(pid=%d) Starting testcase", pid);
-
-	/*
-	 * This rlimit is not a problem for root because it can be exceeded.
-	 */
-	setrlimit_nproc(1);
-
-	for (i = 0; i < NR_CHILDS; i++) {
-		child[i] = fork_child();
-		wstatus[i] = 0;
-		usleep(250000);
-	}
-
-	while (1) {
-		for (i = 0; i < NR_CHILDS; i++) {
-			if (child[i] <= 0)
-				continue;
-
-			errno = 0;
-			pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG);
-
-			if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i])))
-				continue;
-
-			if (ret < 0 && errno != ECHILD)
-				warn("(pid=%d): waitpid(%d)", pid, child[i]);
-
-			child[i] *= -1;
-			childs -= 1;
-		}
-
-		if (!childs)
-			break;
-
-		usleep(250000);
-
-		for (i = 0; i < NR_CHILDS; i++) {
-			if (child[i] <= 0)
-				continue;
-			kill(child[i], SIGUSR1);
-		}
-	}
-
-	for (i = 0; i < NR_CHILDS; i++) {
-		if (WIFEXITED(wstatus[i]))
-			warnx("(pid=%d): pid %d exited, status=%d",
-				pid, -child[i], WEXITSTATUS(wstatus[i]));
-		else if (WIFSIGNALED(wstatus[i]))
-			warnx("(pid=%d): pid %d killed by signal %d",
-				pid, -child[i], WTERMSIG(wstatus[i]));
-
-		if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1)
-			continue;
-
-		warnx("(pid=%d): Test failed", pid);
-		exit(EXIT_FAILURE);
-	}
-
-	warnx("(pid=%d): Test passed", pid);
-	exit(EXIT_SUCCESS);
-}
diff --git a/tools/testing/selftests/rlimits/service_common.c b/tools/testing/selftests/rlimits/service_common.c
new file mode 100644
index 000000000000..043c59828a03
--- /dev/null
+++ b/tools/testing/selftests/rlimits/service_common.c
@@ -0,0 +1,400 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+/*
+ * The forking tree:
+ *
+ *     main [init_user_ns]
+ *     ` service [user_ns_1]
+ *       ` worker 1
+ *       ` worker 2
+ *       ...
+ *       ` worker k
+ *     ...
+ *     ` service [user_ns_n]
+ *       ` worker 1
+ *       ` worker 2
+ *       ...
+ *       ` worker k
+ *
+ * Sequence (synchronization) diagram:
+ *   main			service
+ *   ----			-------
+ *   setrlimit()
+ *   service=clone([CLONE_NEWUSER])
+ *   define_maps()
+ *   	  	MAP_DEFINE ->
+ *   	  			setuid()
+ *   	  			[unshare(CLONE_NEWUSER)]
+ *   		<- UNSHARE
+ *   rlimit_restore()
+ *   		RLIMIT_RESTORE ->
+ *   				execve()
+ *   		POST_EXEC ->
+ *
+ * Expectations UM_UNSHARE:
+ *
+ *     n <= RLIMIT_NPROC	global RLIMIT_NPROC is honored
+ *     (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
+ *
+ * Expectations UM_CLONE_NEWUSER:
+ *
+ *     n > RLIMIT_NPROC		privileged user can spawn mutliple services in different user namespaces
+ *     k >= RLIMIT_NPROC-1	separate per-user namespace counters
+ *     (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+#define STACK_SIZE (2 * (1<<20))
+#define SERVICE_RUNTIME	250000 /* μs */
+
+static_assert(NR_SERVICES > THE_LIMIT, "Services must exceed THE_LIMIT for effective test.");
+static_assert(NR_WORKERS > THE_LIMIT-1, "Need enough workers to challenge THE_LIIMT.");
+
+static struct services_ctx {
+	int control_fd[NR_SERVICES];
+	pid_t child[NR_SERVICES];
+	int wstatus[NR_SERVICES];
+	int fork_ed;
+	int exec_ed;
+} services_ctx;
+
+static uid_t user   = 60000;
+static uid_t group  = 60000;
+static struct rlimit saved_limit;
+
+static struct {
+	int control_fd;
+	const char *pathname;
+	enum userns_mode um;
+} child_args;
+
+pid_t pid;
+
+static void setrlimit_nproc(rlim_t n)
+{
+	struct rlimit limit = {
+		.rlim_cur = n,
+		.rlim_max = n
+	};
+	if (getrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): getrlimit(RLIMIT_NPROC)", pid);
+
+	if (setrlimit(RLIMIT_NPROC, &limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+
+	warnx("(pid=%d): Set RLIMIT_NPROC=%ld", pid, n);
+}
+
+static void restore_rlimit_nproc(void)
+{
+	if (setrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC, saved)", pid);
+	warnx("(pid=%d) Restored RLIMIT_NPROC", pid);
+}
+
+enum msg_sync {
+	MAP_DEFINE,
+	UNSHARE,
+	RLIMIT_RESTORE,
+	POST_EXEC,
+};
+
+static int _sync_notify(int fd, enum msg_sync m)
+{
+	char tmp = m;
+
+	return write(fd, &tmp, 1);
+}
+static void sync_notify(int fd, enum msg_sync m)
+{
+	if (_sync_notify(fd, m) < 0)
+		warnx("(pid=%d): failed sync-write", pid);
+}
+
+static void sync_wait(int fd, enum msg_sync m)
+{
+	char tmp;
+
+	if (read(fd, &tmp, 1) < 0)
+		warn("(pid=%d): failed sync-read", pid);
+	else if (tmp != m)
+		warnx("(pid=%d): unexpected sync", pid);
+}
+
+static int define_maps(pid_t child_pid, enum userns_mode um)
+{
+	FILE *f;
+	char filename[PATH_MAX];
+
+	if (um != UM_CLONE_NEWUSER)
+		return 0;
+
+	snprintf(filename, PATH_MAX, "/proc/%i/uid_map", child_pid);
+	f = fopen(filename, "w");
+	if (fprintf(f, "%i %i 1\n", user, user) < 0)
+		return -1;
+	fclose(f);
+
+	snprintf(filename, PATH_MAX, "/proc/%i/gid_map", child_pid);
+	f = fopen(filename, "w");
+	if (fprintf(f, "%i %i 1\n", group, group) < 0)
+		return -1;
+	fclose(f);
+
+	return 0;
+}
+
+static int setup_and_exec(void *arg)
+{
+	int control_fd = child_args.control_fd;
+
+	pid = getpid();
+	warnx("(pid=%d): New process starting ...", pid);
+
+	signal(SIGUSR1, SIG_DFL);
+
+	sync_wait(control_fd, MAP_DEFINE);
+	warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+	if (setgid(group) < 0)
+		err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+	if (setuid(user) < 0)
+		err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+	warnx("(pid=%d): Service running ...", pid);
+
+	if (child_args.um == UM_UNSHARE) {
+		warnx("(pid=%d): Unshare user namespace", pid);
+		if (unshare(CLONE_NEWUSER) < 0)
+			err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+	}
+
+	sync_notify(control_fd, UNSHARE);
+	sync_wait(control_fd, RLIMIT_RESTORE);
+
+	char *param = NULL;
+	asprintf(&param, ENV_PARAM "=%i", child_args.um);
+	char *const argv[] = { "service", NULL };
+	char *const envp[] = { param, NULL };
+
+	warnx("(pid=%d): Executing real service ...", pid);
+
+	execve(child_args.pathname, argv, envp);
+
+	/* stay around until parent notifies/signals */
+	warn("(pid=%d): execve failed", pid);
+	sync_wait(control_fd, POST_EXEC);
+	pause();
+	return 0;
+}
+
+static pid_t start_child(const char *pathname, int control_fd, enum userns_mode um)
+{
+	char *stack = malloc(STACK_SIZE);
+	int flags = um == UM_CLONE_NEWUSER ? CLONE_NEWUSER : 0;
+	pid_t new_pid;
+
+	/* Pass via global variable to child */
+	child_args.control_fd = control_fd;
+	child_args.pathname = pathname;
+	child_args.um = um;
+
+	new_pid = clone(setup_and_exec, stack+STACK_SIZE-1, flags, NULL);
+
+	free(stack);
+	close(control_fd);
+	return new_pid;
+}
+
+static void dump_context(size_t n_workers)
+{
+	struct rlimit limit;
+	char user_ns[PATH_MAX];
+	ssize_t len;
+
+	if (getrlimit(RLIMIT_NPROC, &limit) < 0)
+		err(EXIT_FAILURE, "(pid=%d) failed getrlimit", pid);
+	if ((len = readlink("/proc/self/ns/user", user_ns, PATH_MAX)) < 0)
+		err(EXIT_FAILURE, "(pid=%d) failed readlink", pid);
+	user_ns[len] = 0;
+
+	warnx("(pid=%d) Service instance attempts %lu workers, limit %lu:%lu, ns=%s",
+	      pid, n_workers, limit.rlim_cur, limit.rlim_max, user_ns);
+}
+
+int run_service(enum userns_mode um)
+{
+	size_t i;
+	pid_t worker[NR_WORKERS];
+	int ret = EXIT_SUCCESS;
+
+	dump_context(NR_WORKERS);
+
+	/* test RLIMIT_NPROC inside the service, last worker should fail because of service itself */
+	for (i = 0; i < NR_WORKERS; i++) {
+		worker[i] = fork();
+		if (worker[i] == 0) {
+			/* service worker */
+			pause();
+			exit(EXIT_SUCCESS);
+		}
+		if (worker[i] < 0) {
+			warn("(pid=%d) service fork %lu failed", pid, i+1);
+			if (um == UM_CLONE_NEWUSER && !(i >= (THE_LIMIT-1) && errno == EAGAIN))
+				ret = EXIT_FAILURE;
+		} else if (i >= (THE_LIMIT-1)) {
+			warnx("(pid=%d) RLIMIT_NPROC not honored", pid);
+			ret = EXIT_FAILURE;
+		}
+	}
+
+	/* service cleanup */
+	for (i = 0; i < NR_WORKERS; i++)
+		if (worker[i] > 0)
+			kill(worker[i], SIGUSR1);
+
+	for (i = 0; i < NR_WORKERS; i++)
+		if (worker[i] > 0)
+			waitpid(worker[i], NULL, WNOHANG);
+
+	if (ret) {
+		warnx("(pid=%d) service failed, ret=%i", pid, ret);
+		return ret;
+	}
+	/* we must get here before SERVICE_RUNTIME elapses */
+	pause();
+	return EXIT_FAILURE;
+}
+
+struct services_ctx *start_services(const char *pathname, enum userns_mode um)
+{
+	size_t i;
+	int sockets[2];
+	struct services_ctx *ctx = &services_ctx;
+
+	signal(SIGPIPE, SIG_IGN);
+	setrlimit_nproc(THE_LIMIT);
+	ctx->fork_ed = 0;
+	ctx->exec_ed = 0;
+	for (i = 0; i < NR_SERVICES; i++) {
+		if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0, sockets) < 0)
+			err(KSFT_FAIL, "(pid=%d) socketpair failed", pid);
+		ctx->control_fd[i] = sockets[0];
+		ctx->child[i] = start_child(pathname, sockets[1], um);
+		ctx->wstatus[i] = 0;
+		if (ctx->child[i] < 0)
+			continue;
+		ctx->fork_ed++;
+
+		if (define_maps(ctx->child[i], um) < 0)
+			err(KSFT_FAIL, "(pid=%d) user_ns maps definition failed", pid);
+		sync_notify(ctx->control_fd[i], MAP_DEFINE);
+	}
+
+	for (i = 0; i < NR_SERVICES; i++)
+		sync_wait(ctx->control_fd[i], UNSHARE);
+	restore_rlimit_nproc();
+
+	for (i = 0; i < NR_SERVICES; i++) {
+		sync_notify(ctx->control_fd[i], RLIMIT_RESTORE);
+	}
+
+	return ctx;
+}
+
+void stop_services(struct services_ctx *ctx)
+{
+	size_t i;
+	int children = ctx->fork_ed;
+
+	/* Well behaved service would pause() and wait for our SIGUSR1, if it
+	 * failed check it early.
+	 */
+	while (1) {
+		for (i = 0; i < NR_SERVICES; i++) {
+			if (ctx->child[i] <= 0)
+				continue;
+
+			errno = 0;
+			pid_t ret = waitpid(ctx->child[i], &ctx->wstatus[i], WNOHANG | __WALL);
+
+			if (!ret)
+				continue;
+
+			if (ret < 0 && errno != ECHILD)
+				warn("(pid=%d): waitpid(%d)", pid, ctx->child[i]);
+
+			ctx->child[i] *= -1;
+			children -= 1;
+		}
+
+		if (!children)
+			break;
+
+		usleep(SERVICE_RUNTIME);
+
+		for (i = 0; i < NR_SERVICES; i++) {
+			if (ctx->child[i] <= 0)
+				continue;
+			if (_sync_notify(ctx->control_fd[i], POST_EXEC) < 0 &&
+			    (errno == EPIPE || errno == ECONNREFUSED))
+				ctx->exec_ed++;
+			close(ctx->control_fd[i]);
+			kill(ctx->child[i], SIGUSR1);
+		}
+	}
+
+	warnx("(pid=%d): stats: fork_ed=%i exec_ed=%i", pid, ctx->fork_ed, ctx->exec_ed);
+}
+
+int count_services(struct services_ctx *ctx)
+{
+	return ctx->exec_ed;
+}
+
+int check_services(struct services_ctx *ctx)
+{
+	size_t i;
+	int correct = 0;
+	
+	for (i = 0; i < NR_SERVICES; i++) {
+		if (WIFEXITED(ctx->wstatus[i]))
+			warnx("(pid=%d): pid %d exited, status=%d",
+				pid, -ctx->child[i], WEXITSTATUS(ctx->wstatus[i]));
+		else if (WIFSIGNALED(ctx->wstatus[i]))
+			warnx("(pid=%d): pid %d killed by signal %d",
+				pid, -ctx->child[i], WTERMSIG(ctx->wstatus[i]));
+
+		/* The only acceptable service termination */
+		if (WIFSIGNALED(ctx->wstatus[i]) && WTERMSIG(ctx->wstatus[i]) == SIGUSR1)
+			correct++;
+	}
+
+	return correct;
+}
+
+
diff --git a/tools/testing/selftests/rlimits/service_common.h b/tools/testing/selftests/rlimits/service_common.h
new file mode 100644
index 000000000000..4a3cd929d865
--- /dev/null
+++ b/tools/testing/selftests/rlimits/service_common.h
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <sys/types.h>
+
+#define THE_LIMIT 4
+#define NR_SERVICES 5
+#define NR_WORKERS 5
+
+#define ENV_PARAM "I_AM_SERVICE"
+
+enum userns_mode {
+	UM_UNSHARE,		/* setrlimit,clone(0),setuid,unshare,execve */
+	UM_CLONE_NEWUSER,	/* setrlimit,clone(NEWUSER),setuid,execve */
+};
+
+struct services_ctx;
+
+/* Cache current pid */
+extern pid_t pid;
+
+int run_service(enum userns_mode um);
+struct services_ctx *start_services(const char *pathname, enum userns_mode um);
+void stop_services(struct services_ctx *ctx);
+int count_services(struct services_ctx *ctx);
+int check_services(struct services_ctx *ctx);
-- 
2.34.1