[PATCH] signal: restore the override_rlimit logic

Roman Gushchin posted 1 patch 3 weeks, 3 days ago
There is a newer version of this series
include/linux/user_namespace.h | 3 ++-
kernel/signal.c                | 3 ++-
kernel/ucount.c                | 5 +++--
3 files changed, 7 insertions(+), 4 deletions(-)
[PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 3 weeks, 3 days ago
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
of signals. However now it's enforced unconditionally, even if
override_rlimit is set. This behavior change caused production issues.

For example, if the limit is reached and a process receives a SIGSEGV
signal, sigqueue_alloc fails to allocate the necessary resources for the
signal delivery, preventing the signal from being delivered with
siginfo. This prevents the process from correctly identifying the fault
address and handling the error. From the user-space perspective,
applications are unaware that the limit has been reached and that the
siginfo is effectively 'corrupted'. This can lead to unpredictable
behavior and crashes, as we observed with java applications.

Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
skip the comparison to max there if override_rlimit is set. This
effectively restores the old behavior.

Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: <stable@vger.kernel.org>
---
 include/linux/user_namespace.h | 3 ++-
 kernel/signal.c                | 3 ++-
 kernel/ucount.c                | 5 +++--
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 3625096d5f85..7183e5aca282 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
 
 long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
 bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
+			    bool override_rlimit);
 void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
 bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 4344860ffcac..cbabb2d05e0a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
 	 */
 	rcu_read_lock();
 	ucounts = task_ucounts(t);
-	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
+	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
+					    override_rlimit);
 	rcu_read_unlock();
 	if (!sigpending)
 		return NULL;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 16c0ea1cb432..046b3d57ebb4 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
 }
 
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
+			    bool override_rlimit)
 {
 	/* Caller must hold a reference to ucounts */
 	struct ucounts *iter;
@@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
 		long new = atomic_long_add_return(1, &iter->rlimit[type]);
-		if (new < 0 || new > max)
+		if (new < 0 || (!override_rlimit && (new > max)))
 			goto unwind;
 		if (iter == ucounts)
 			ret = new;
-- 
2.47.0.163.g1226f6d8fa-goog
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Alexey Gladkov 3 weeks, 1 day ago
On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
> Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> of signals. However now it's enforced unconditionally, even if
> override_rlimit is set. This behavior change caused production issues.
> 
> For example, if the limit is reached and a process receives a SIGSEGV
> signal, sigqueue_alloc fails to allocate the necessary resources for the
> signal delivery, preventing the signal from being delivered with
> siginfo. This prevents the process from correctly identifying the fault
> address and handling the error. From the user-space perspective,
> applications are unaware that the limit has been reached and that the
> siginfo is effectively 'corrupted'. This can lead to unpredictable
> behavior and crashes, as we observed with java applications.
> 
> Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> skip the comparison to max there if override_rlimit is set. This
> effectively restores the old behavior.
> 
> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Co-developed-by: Andrei Vagin <avagin@google.com>
> Signed-off-by: Andrei Vagin <avagin@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Alexey Gladkov <legion@kernel.org>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/user_namespace.h | 3 ++-
>  kernel/signal.c                | 3 ++-
>  kernel/ucount.c                | 5 +++--
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 3625096d5f85..7183e5aca282 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
>  
>  long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
>  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> +			    bool override_rlimit);
>  void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
>  bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
>  
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4344860ffcac..cbabb2d05e0a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>  	 */
>  	rcu_read_lock();
>  	ucounts = task_ucounts(t);
> -	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> +	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
> +					    override_rlimit);
>  	rcu_read_unlock();
>  	if (!sigpending)
>  		return NULL;
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 16c0ea1cb432..046b3d57ebb4 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
>  	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
>  }
>  
> -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> +			    bool override_rlimit)
>  {
>  	/* Caller must hold a reference to ucounts */
>  	struct ucounts *iter;
> @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
>  
>  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>  		long new = atomic_long_add_return(1, &iter->rlimit[type]);
> -		if (new < 0 || new > max)
> +		if (new < 0 || (!override_rlimit && (new > max)))
>  			goto unwind;
>  		if (iter == ucounts)
>  			ret = new;

It's a bad patch. If we do as you suggest, it will
do_dec_rlimit_put_ucounts() in case of overflow. This means you'll
break the counter and there will be an extra decrement in __sigqueue_free().
We can't just ignore the overflow here.

-- 
Rgrds, legion
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 3 weeks, 1 day ago
On Sat, Nov 02, 2024 at 12:28:38AM +0100, Alexey Gladkov wrote:
> On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
> > Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> > ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> > of signals. However now it's enforced unconditionally, even if
> > override_rlimit is set. This behavior change caused production issues.
> > 
> > For example, if the limit is reached and a process receives a SIGSEGV
> > signal, sigqueue_alloc fails to allocate the necessary resources for the
> > signal delivery, preventing the signal from being delivered with
> > siginfo. This prevents the process from correctly identifying the fault
> > address and handling the error. From the user-space perspective,
> > applications are unaware that the limit has been reached and that the
> > siginfo is effectively 'corrupted'. This can lead to unpredictable
> > behavior and crashes, as we observed with java applications.
> > 
> > Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> > skip the comparison to max there if override_rlimit is set. This
> > effectively restores the old behavior.
> > 
> > Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Co-developed-by: Andrei Vagin <avagin@google.com>
> > Signed-off-by: Andrei Vagin <avagin@google.com>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Alexey Gladkov <legion@kernel.org>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  include/linux/user_namespace.h | 3 ++-
> >  kernel/signal.c                | 3 ++-
> >  kernel/ucount.c                | 5 +++--
> >  3 files changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index 3625096d5f85..7183e5aca282 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
> >  
> >  long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> >  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> > -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> > +			    bool override_rlimit);
> >  void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> >  bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
> >  
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 4344860ffcac..cbabb2d05e0a 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> >  	 */
> >  	rcu_read_lock();
> >  	ucounts = task_ucounts(t);
> > -	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> > +	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
> > +					    override_rlimit);
> >  	rcu_read_unlock();
> >  	if (!sigpending)
> >  		return NULL;
> > diff --git a/kernel/ucount.c b/kernel/ucount.c
> > index 16c0ea1cb432..046b3d57ebb4 100644
> > --- a/kernel/ucount.c
> > +++ b/kernel/ucount.c
> > @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> >  	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> >  }
> >  
> > -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> > +			    bool override_rlimit)
> >  {
> >  	/* Caller must hold a reference to ucounts */
> >  	struct ucounts *iter;
> > @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> >  
> >  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> >  		long new = atomic_long_add_return(1, &iter->rlimit[type]);
> > -		if (new < 0 || new > max)
> > +		if (new < 0 || (!override_rlimit && (new > max)))
> >  			goto unwind;
> >  		if (iter == ucounts)
> >  			ret = new;
> 
> It's a bad patch. If we do as you suggest, it will
> do_dec_rlimit_put_ucounts() in case of overflow. This means you'll
> break the counter and there will be an extra decrement in __sigqueue_free().
> We can't just ignore the overflow here.

Hm, I don't think my code is changing anything in terms of the overflow handling.
The (new < 0) handling is exactly the same as it was, the only difference is
that (new > max) is allowed if override_rlimit is set. But new physically
can't be larger than LONG_MAX, so there is no actual change if the limit
is LONG_MAX.

Maybe I'm missing something here, please, clarify.

Thanks!
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Alexey Gladkov 3 weeks, 1 day ago
On Fri, Nov 01, 2024 at 11:50:54PM +0000, Roman Gushchin wrote:
> On Sat, Nov 02, 2024 at 12:28:38AM +0100, Alexey Gladkov wrote:
> > On Thu, Oct 31, 2024 at 08:04:38PM +0000, Roman Gushchin wrote:
> > > Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> > > ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> > > of signals. However now it's enforced unconditionally, even if
> > > override_rlimit is set. This behavior change caused production issues.
> > > 
> > > For example, if the limit is reached and a process receives a SIGSEGV
> > > signal, sigqueue_alloc fails to allocate the necessary resources for the
> > > signal delivery, preventing the signal from being delivered with
> > > siginfo. This prevents the process from correctly identifying the fault
> > > address and handling the error. From the user-space perspective,
> > > applications are unaware that the limit has been reached and that the
> > > siginfo is effectively 'corrupted'. This can lead to unpredictable
> > > behavior and crashes, as we observed with java applications.
> > > 
> > > Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> > > skip the comparison to max there if override_rlimit is set. This
> > > effectively restores the old behavior.
> > > 
> > > Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> > > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > > Co-developed-by: Andrei Vagin <avagin@google.com>
> > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > Cc: Kees Cook <kees@kernel.org>
> > > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > > Cc: Alexey Gladkov <legion@kernel.org>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> > >  include/linux/user_namespace.h | 3 ++-
> > >  kernel/signal.c                | 3 ++-
> > >  kernel/ucount.c                | 5 +++--
> > >  3 files changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > > index 3625096d5f85..7183e5aca282 100644
> > > --- a/include/linux/user_namespace.h
> > > +++ b/include/linux/user_namespace.h
> > > @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
> > >  
> > >  long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> > >  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> > > -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> > > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> > > +			    bool override_rlimit);
> > >  void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> > >  bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
> > >  
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 4344860ffcac..cbabb2d05e0a 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
> > >  	 */
> > >  	rcu_read_lock();
> > >  	ucounts = task_ucounts(t);
> > > -	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> > > +	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
> > > +					    override_rlimit);
> > >  	rcu_read_unlock();
> > >  	if (!sigpending)
> > >  		return NULL;
> > > diff --git a/kernel/ucount.c b/kernel/ucount.c
> > > index 16c0ea1cb432..046b3d57ebb4 100644
> > > --- a/kernel/ucount.c
> > > +++ b/kernel/ucount.c
> > > @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> > >  	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
> > >  }
> > >  
> > > -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> > > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> > > +			    bool override_rlimit)
> > >  {
> > >  	/* Caller must hold a reference to ucounts */
> > >  	struct ucounts *iter;
> > > @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> > >  
> > >  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
> > >  		long new = atomic_long_add_return(1, &iter->rlimit[type]);
> > > -		if (new < 0 || new > max)
> > > +		if (new < 0 || (!override_rlimit && (new > max)))
> > >  			goto unwind;
> > >  		if (iter == ucounts)
> > >  			ret = new;
> > 
> > It's a bad patch. If we do as you suggest, it will
> > do_dec_rlimit_put_ucounts() in case of overflow. This means you'll
> > break the counter and there will be an extra decrement in __sigqueue_free().
> > We can't just ignore the overflow here.
> 
> Hm, I don't think my code is changing anything in terms of the overflow handling.
> The (new < 0) handling is exactly the same as it was, the only difference is
> that (new > max) is allowed if override_rlimit is set. But new physically
> can't be larger than LONG_MAX, so there is no actual change if the limit
> is LONG_MAX.
> 
> Maybe I'm missing something here, please, clarify.

I re-read your patch one more time. Sorry. Yes, you're right, i am wrong.
You're just allow overlimit.

But one thing confuses me.

Now the maximum rlimits of the upper userns are being forced. Changing
rlimit to RLIM_INFINITY affects only the current userns and child userns.

But after this patch, this is not the case for RLIMIT_SIGPENDING and
within userns it is possible to ignore the restrictions of upper-level
userns which ruins the whole idea.

I agree with Eric. If you don't need upper-level userns limits, you don't
need to set them.

-- 
Rgrds, legion
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Eric W. Biederman 3 weeks, 2 days ago
Roman Gushchin <roman.gushchin@linux.dev> writes:

> Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> of signals. However now it's enforced unconditionally, even if
> override_rlimit is set.

Not true.

It added a limit on the number of siginfo structures that
a container may allocate.  Have you tried not limiting your
container?

>This behavior change caused production issues.

> For example, if the limit is reached and a process receives a SIGSEGV
> signal, sigqueue_alloc fails to allocate the necessary resources for the
> signal delivery, preventing the signal from being delivered with
> siginfo. This prevents the process from correctly identifying the fault
> address and handling the error. From the user-space perspective,
> applications are unaware that the limit has been reached and that the
> siginfo is effectively 'corrupted'. This can lead to unpredictable
> behavior and crashes, as we observed with java applications.

Note.  There are always conditions when the allocation may fail.
The structure is allocated with __GFP_ATOMIC so it is much more likely
to fail than a typical kernel memory allocation.

But I agree it does look like there is a quality of implementation issue
here.

> Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> skip the comparison to max there if override_rlimit is set. This
> effectively restores the old behavior.

Instead please just give the container and unlimited number of siginfo
structures it can play with.

The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING)
value when the user namespace is created.

Given that it took 3 and half years to report this.  I am going to
say this really looks like a userspace bug.



Beyond that your patch is actually buggy, and should not be applied.

If we want to change the semantics and ignore the maximum number of
pending signals in a container (when override_rlimit is set) then
the code should change the computation of the max value (pegging it at
LONG_MAX) and not ignore it.

As it is the patch below disables the check that keeps the ucount
counters from wrapping around.  That makes it possible for someone to
overflow those counters and get into all kinds of trouble.

Eric


> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Co-developed-by: Andrei Vagin <avagin@google.com>
> Signed-off-by: Andrei Vagin <avagin@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Alexey Gladkov <legion@kernel.org>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/user_namespace.h | 3 ++-
>  kernel/signal.c                | 3 ++-
>  kernel/ucount.c                | 5 +++--
>  3 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 3625096d5f85..7183e5aca282 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -141,7 +141,8 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
>  
>  long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
>  bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
> -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type);
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> +			    bool override_rlimit);
>  void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
>  bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
>  
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4344860ffcac..cbabb2d05e0a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -419,7 +419,8 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
>  	 */
>  	rcu_read_lock();
>  	ucounts = task_ucounts(t);
> -	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
> +	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
> +					    override_rlimit);
>  	rcu_read_unlock();
>  	if (!sigpending)
>  		return NULL;
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 16c0ea1cb432..046b3d57ebb4 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
>  	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
>  }
>  
> -long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
> +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
> +			    bool override_rlimit)
>  {
>  	/* Caller must hold a reference to ucounts */
>  	struct ucounts *iter;
> @@ -316,7 +317,7 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
>  
>  	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
>  		long new = atomic_long_add_return(1, &iter->rlimit[type]);
> -		if (new < 0 || new > max)
> +		if (new < 0 || (!override_rlimit && (new > max)))
>  			goto unwind;
>  		if (iter == ucounts)
>  			ret = new;
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 3 weeks, 2 days ago
On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
> Roman Gushchin <roman.gushchin@linux.dev> writes:
> 
> > Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> > ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> > of signals. However now it's enforced unconditionally, even if
> > override_rlimit is set.
> 
> Not true.
> 
> It added a limit on the number of siginfo structures that
> a container may allocate.  Have you tried not limiting your
> container?
> 
> >This behavior change caused production issues.
> 
> > For example, if the limit is reached and a process receives a SIGSEGV
> > signal, sigqueue_alloc fails to allocate the necessary resources for the
> > signal delivery, preventing the signal from being delivered with
> > siginfo. This prevents the process from correctly identifying the fault
> > address and handling the error. From the user-space perspective,
> > applications are unaware that the limit has been reached and that the
> > siginfo is effectively 'corrupted'. This can lead to unpredictable
> > behavior and crashes, as we observed with java applications.
> 
> Note.  There are always conditions when the allocation may fail.
> The structure is allocated with __GFP_ATOMIC so it is much more likely
> to fail than a typical kernel memory allocation.
> 
> But I agree it does look like there is a quality of implementation issue
> here.
> 
> > Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> > skip the comparison to max there if override_rlimit is set. This
> > effectively restores the old behavior.
> 
> Instead please just give the container and unlimited number of siginfo
> structures it can play with.

Well, personally I'd not use this limit too, but I don't think
"it's broken, userspace shouldn't use it" argument is valid.

> 
> The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING)
> value when the user namespace is created.
> 
> Given that it took 3 and half years to report this.  I am going to
> say this really looks like a userspace bug.

The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185.
Basically it's a leak of the rlimit value.
If a limit is set and reached in the reality, all following signals
will not have a siginfo attached, causing applications which depend on
handling SIGSEGV to crash.

> Beyond that your patch is actually buggy, and should not be applied.
> 
> If we want to change the semantics and ignore the maximum number of
> pending signals in a container (when override_rlimit is set) then
> the code should change the computation of the max value (pegging it at
> LONG_MAX) and not ignore it.

Hm, isn't the unconditional (new < 0) enough to capture the overflow?
Actually I'm not sure I understand how "long new" can be "> LONG_MAX"
anyway.

Thanks!
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Eric W. Biederman 3 weeks, 1 day ago
Roman Gushchin <roman.gushchin@linux.dev> writes:

> On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
>> Roman Gushchin <roman.gushchin@linux.dev> writes:
>> 
>> > Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
>> > ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
>> > of signals. However now it's enforced unconditionally, even if
>> > override_rlimit is set.
>> 
>> Not true.
>> 
>> It added a limit on the number of siginfo structures that
>> a container may allocate.  Have you tried not limiting your
>> container?
>> 
>> >This behavior change caused production issues.
>> 
>> > For example, if the limit is reached and a process receives a SIGSEGV
>> > signal, sigqueue_alloc fails to allocate the necessary resources for the
>> > signal delivery, preventing the signal from being delivered with
>> > siginfo. This prevents the process from correctly identifying the fault
>> > address and handling the error. From the user-space perspective,
>> > applications are unaware that the limit has been reached and that the
>> > siginfo is effectively 'corrupted'. This can lead to unpredictable
>> > behavior and crashes, as we observed with java applications.
>> 
>> Note.  There are always conditions when the allocation may fail.
>> The structure is allocated with __GFP_ATOMIC so it is much more likely
>> to fail than a typical kernel memory allocation.
>> 
>> But I agree it does look like there is a quality of implementation issue
>> here.
>> 
>> > Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
>> > skip the comparison to max there if override_rlimit is set. This
>> > effectively restores the old behavior.
>> 
>> Instead please just give the container and unlimited number of siginfo
>> structures it can play with.
>
> Well, personally I'd not use this limit too, but I don't think
> "it's broken, userspace shouldn't use it" argument is valid.

I said if you don't want the limit don't use it.

A version of "Doctor it hurts when I do this". To which the doctor
replies "Don't do that then".

I was also asking that you test with the limit disabled (at user
namespace creation time) so that you can verify that is problem.

>> The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING)
>> value when the user namespace is created.
>> 
>> Given that it took 3 and half years to report this.  I am going to
>> say this really looks like a userspace bug.
>
> The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185.
> Basically it's a leak of the rlimit value.
> If a limit is set and reached in the reality, all following signals
> will not have a siginfo attached, causing applications which depend on
> handling SIGSEGV to crash.

I will take a deeper look at the patch you are referring to.

>> Beyond that your patch is actually buggy, and should not be applied.
>> 
>> If we want to change the semantics and ignore the maximum number of
>> pending signals in a container (when override_rlimit is set) then
>> the code should change the computation of the max value (pegging it at
>> LONG_MAX) and not ignore it.
>
> Hm, isn't the unconditional (new < 0) enough to capture the overflow?
> Actually I'm not sure I understand how "long new" can be "> LONG_MAX"
> anyway.

Agreed "new < 0" should catch that, but still splitting the logic
between the calculation of max and the test of max is quite confusing.
It makes much more sense to put the logic into the calculate of max.

Eric
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Andrei Vagin 3 weeks, 1 day ago
On Fri, Nov 1, 2024 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Well, personally I'd not use this limit too, but I don't think
> > "it's broken, userspace shouldn't use it" argument is valid.
>
> I said if you don't want the limit don't use it.
>
> A version of "Doctor it hurts when I do this". To which the doctor
> replies "Don't do that then".

Unfortunately, it isn't an option here. This is a non-root user that
creates a new user-namespace. It can't change RLIMIT_SIGPENDING
beyond the current limit.

We have to distinguish between two types of signals:

* Kernel signals: These are essential for proper application behavior.
If a user process triggers a memory fault, it gets SIGSEGV and it can’t
ignore it. The number of such signals is limited by one per user thread.

* User signals: These are sent by users and can be blocked by the
receiving process, potentially leading to a large queue of pending
signals. This is where the RLIMIT_SIGPENDING limit plays its role in
preventing excessive resource consumption.

New user namespaces inherit the current sigpending rlimit, so it is
expected that  the behavior of the user-namespace limit is aligned with
the overall RLIMIT_SIGPENDING mechanism.

>
> I was also asking that you test with the limit disabled (at user
> namespace creation time) so that you can verify that is problem.
>
> >> The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING)
> >> value when the user namespace is created.
> >>
> >> Given that it took 3 and half years to report this.  I am going to
> >> say this really looks like a userspace bug.

This issue was introduced in  v5.15. The latest rhel release, 9.4, is
based on v5.14...
3.5 years is not a long time...

Thanks,
Andrei
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Alexey Gladkov 3 weeks, 1 day ago
On Fri, Nov 01, 2024 at 03:44:48PM -0700, Andrei Vagin wrote:
> On Fri, Nov 1, 2024 at 1:58 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> > > Well, personally I'd not use this limit too, but I don't think
> > > "it's broken, userspace shouldn't use it" argument is valid.
> >
> > I said if you don't want the limit don't use it.
> >
> > A version of "Doctor it hurts when I do this". To which the doctor
> > replies "Don't do that then".
> 
> Unfortunately, it isn't an option here. This is a non-root user that
> creates a new user-namespace. It can't change RLIMIT_SIGPENDING
> beyond the current limit.
> 
> We have to distinguish between two types of signals:
> 
> * Kernel signals: These are essential for proper application behavior.
> If a user process triggers a memory fault, it gets SIGSEGV and it can’t
> ignore it. The number of such signals is limited by one per user thread.
> 
> * User signals: These are sent by users and can be blocked by the
> receiving process, potentially leading to a large queue of pending
> signals. This is where the RLIMIT_SIGPENDING limit plays its role in
> preventing excessive resource consumption.
> 
> New user namespaces inherit the current sigpending rlimit, so it is
> expected that  the behavior of the user-namespace limit is aligned with
> the overall RLIMIT_SIGPENDING mechanism.

Hm. I think I understand the problem now.

+Cc Oleg Nesterov.

-- 
Rgrds, legion

Re: [PATCH] signal: restore the override_rlimit logic
Posted by Oleg Nesterov 3 weeks ago
On 11/02, Alexey Gladkov wrote:
>
> +Cc Oleg Nesterov.

Well, I tend to agree with Roman and his patch looks good to me.

But it seems that the change in inc_rlimit_get_ucounts() can be
a bit simpler and more readable, see below.

Oleg.
---

--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -307,7 +307,8 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
 }
 
-long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
+long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
+			    bool override_rlimit)
 {
 	/* Caller must hold a reference to ucounts */
 	struct ucounts *iter;
@@ -320,7 +321,8 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 			goto unwind;
 		if (iter == ucounts)
 			ret = new;
-		max = get_userns_rlimit_max(iter->ns, type);
+		if (!override_rlimit)
+			max = get_userns_rlimit_max(iter->ns, type);
 		/*
 		 * Grab an extra ucount reference for the caller when
 		 * the rlimit count was previously 0.
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 2 weeks, 6 days ago
On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
> On 11/02, Alexey Gladkov wrote:
> >
> > +Cc Oleg Nesterov.
> 
> Well, I tend to agree with Roman and his patch looks good to me.

Thanks, Oleg!

> 
> But it seems that the change in inc_rlimit_get_ucounts() can be
> a bit simpler and more readable, see below.

Eric suggested the same approach earlier in this thread. I personally
don't have a strong preference here or actually I slightly prefer my
own version because this comparison to LONG_MAX looks confusing to me.
But if you have a strong preference, I'm happy to send out v2. Please,
let me know.

Thanks!
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Oleg Nesterov 2 weeks, 6 days ago
On 11/04, Roman Gushchin wrote:
>
> On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
> >
> > But it seems that the change in inc_rlimit_get_ucounts() can be
> > a bit simpler and more readable, see below.
>
> Eric suggested the same approach earlier in this thread.

Ah, good, I didn't know ;)

> I personally
> don't have a strong preference here or actually I slightly prefer my
> own version because this comparison to LONG_MAX looks confusing to me.
> But if you have a strong preference, I'm happy to send out v2. Please,
> let me know.

Well, I won't insist.

To me the change proposed by Eric and me looks much more readable, but
of course this is subjective.

But you know, you can safely ignore me. Alexey and Eric understand this
code much better, so I leave this to you/Alexey/Eric.

Oleg.
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Alexey Gladkov 2 weeks, 6 days ago
On Mon, Nov 04, 2024 at 07:44:43PM +0100, Oleg Nesterov wrote:
> On 11/04, Roman Gushchin wrote:
> >
> > On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
> > >
> > > But it seems that the change in inc_rlimit_get_ucounts() can be
> > > a bit simpler and more readable, see below.
> >
> > Eric suggested the same approach earlier in this thread.
> 
> Ah, good, I didn't know ;)
> 
> > I personally
> > don't have a strong preference here or actually I slightly prefer my
> > own version because this comparison to LONG_MAX looks confusing to me.
> > But if you have a strong preference, I'm happy to send out v2. Please,
> > let me know.
> 
> Well, I won't insist.
> 
> To me the change proposed by Eric and me looks much more readable, but
> of course this is subjective.
> 
> But you know, you can safely ignore me. Alexey and Eric understand this
> code much better, so I leave this to you/Alexey/Eric.

Personally, I like Oleg's patch more.

-- 
Rgrds, legion
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 2 weeks, 6 days ago
On Mon, Nov 04, 2024 at 08:02:35PM +0100, Alexey Gladkov wrote:
> On Mon, Nov 04, 2024 at 07:44:43PM +0100, Oleg Nesterov wrote:
> > On 11/04, Roman Gushchin wrote:
> > >
> > > On Sun, Nov 03, 2024 at 05:50:49PM +0100, Oleg Nesterov wrote:
> > > >
> > > > But it seems that the change in inc_rlimit_get_ucounts() can be
> > > > a bit simpler and more readable, see below.
> > >
> > > Eric suggested the same approach earlier in this thread.
> > 
> > Ah, good, I didn't know ;)
> > 
> > > I personally
> > > don't have a strong preference here or actually I slightly prefer my
> > > own version because this comparison to LONG_MAX looks confusing to me.
> > > But if you have a strong preference, I'm happy to send out v2. Please,
> > > let me know.
> > 
> > Well, I won't insist.
> > 
> > To me the change proposed by Eric and me looks much more readable, but
> > of course this is subjective.
> > 
> > But you know, you can safely ignore me. Alexey and Eric understand this
> > code much better, so I leave this to you/Alexey/Eric.
> 
> Personally, I like Oleg's patch more.

Ok, I'll send out a v2 shortly.

Thanks!
Re: [PATCH] signal: restore the override_rlimit logic
Posted by Roman Gushchin 3 weeks, 1 day ago
On Fri, Nov 01, 2024 at 03:58:07PM -0500, Eric W. Biederman wrote:
> Roman Gushchin <roman.gushchin@linux.dev> writes:
> 
> > On Fri, Nov 01, 2024 at 02:51:00PM -0500, Eric W. Biederman wrote:
> >> Roman Gushchin <roman.gushchin@linux.dev> writes:
> >> 
> >> > Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of
> >> > ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class
> >> > of signals. However now it's enforced unconditionally, even if
> >> > override_rlimit is set.
> >> 
> >> Not true.
> >> 
> >> It added a limit on the number of siginfo structures that
> >> a container may allocate.  Have you tried not limiting your
> >> container?
> >> 
> >> >This behavior change caused production issues.
> >> 
> >> > For example, if the limit is reached and a process receives a SIGSEGV
> >> > signal, sigqueue_alloc fails to allocate the necessary resources for the
> >> > signal delivery, preventing the signal from being delivered with
> >> > siginfo. This prevents the process from correctly identifying the fault
> >> > address and handling the error. From the user-space perspective,
> >> > applications are unaware that the limit has been reached and that the
> >> > siginfo is effectively 'corrupted'. This can lead to unpredictable
> >> > behavior and crashes, as we observed with java applications.
> >> 
> >> Note.  There are always conditions when the allocation may fail.
> >> The structure is allocated with __GFP_ATOMIC so it is much more likely
> >> to fail than a typical kernel memory allocation.
> >> 
> >> But I agree it does look like there is a quality of implementation issue
> >> here.
> >> 
> >> > Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and
> >> > skip the comparison to max there if override_rlimit is set. This
> >> > effectively restores the old behavior.
> >> 
> >> Instead please just give the container and unlimited number of siginfo
> >> structures it can play with.
> >
> > Well, personally I'd not use this limit too, but I don't think
> > "it's broken, userspace shouldn't use it" argument is valid.
> 
> I said if you don't want the limit don't use it.
> 
> A version of "Doctor it hurts when I do this". To which the doctor
> replies "Don't do that then".
> 
> I was also asking that you test with the limit disabled (at user
> namespace creation time) so that you can verify that is problem.
> 
> >> The maximum for rlimit(RLIM_SIGPENDING) is the rlimit(RLIM_SIGPENDING)
> >> value when the user namespace is created.
> >> 
> >> Given that it took 3 and half years to report this.  I am going to
> >> say this really looks like a userspace bug.
> >
> > The trick here is another bug fixed by https://lkml.org/lkml/2024/10/31/185.
> > Basically it's a leak of the rlimit value.
> > If a limit is set and reached in the reality, all following signals
> > will not have a siginfo attached, causing applications which depend on
> > handling SIGSEGV to crash.
> 
> I will take a deeper look at the patch you are referring to.
> 
> >> Beyond that your patch is actually buggy, and should not be applied.
> >> 
> >> If we want to change the semantics and ignore the maximum number of
> >> pending signals in a container (when override_rlimit is set) then
> >> the code should change the computation of the max value (pegging it at
> >> LONG_MAX) and not ignore it.
> >
> > Hm, isn't the unconditional (new < 0) enough to capture the overflow?
> > Actually I'm not sure I understand how "long new" can be "> LONG_MAX"
> > anyway.
> 
> Agreed "new < 0" should catch that, but still splitting the logic
> between the calculation of max and the test of max is quite confusing.
> It makes much more sense to put the logic into the calculate of max.

You mean something like this?

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 046b3d57ebb4..49fcec41e5b4 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -317,11 +317,12 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,

        for (iter = ucounts; iter; iter = iter->ns->ucounts) {
                long new = atomic_long_add_return(1, &iter->rlimit[type]);
-               if (new < 0 || (!override_rlimit && (new > max)))
+               if (new < 0 || new > max)
                        goto unwind;
                if (iter == ucounts)
                        ret = new;
-               max = get_userns_rlimit_max(iter->ns, type);
+               if (!override_rlimit)
+                       max = get_userns_rlimit_max(iter->ns, type);
                /*
                 * Grab an extra ucount reference for the caller when
                 * the rlimit count was previously 0.

--

If you strongly prefer this version, I can send a v2. I like my original
version slightly better, but not a strong preference. Please, let me know.

Thanks!