[RFC next v2 0/5] ucount: add rlimit cache for ucount

Chen Ridong posted 5 patches 9 months ago
include/linux/user_namespace.h |  23 ++++-
kernel/signal.c                |   2 +-
kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
kernel/user.c                  |   2 +
kernel/user_namespace.c        |  60 ++++++++++-
5 files changed, 243 insertions(+), 25 deletions(-)
[RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Chen Ridong 9 months ago
The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucouts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

Notably, even though a new user_namespace is created upon docker startup,
the problem persists. This is because all these dockers share the same
parent node, meaning that rlimit statistics continuously modify the same
atomic variable.

Currently, when incrementing a specific rlimit within a child user
namespace by 1, the corresponding rlimit in the parent node must also be
incremented by 1. Specifically, if the ucounts corresponding to a task in
Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
the rlimit of the parent node, init_ucounts, must also be incremented by 1.
This operation should be ensured to stay within the limits set for the
user namespaces.

	init_user_ns                             init_ucounts
	^                                              ^
	|                        |                     |
	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
	|                        |                     |
	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
					^
					|
					|
					|
					ucount_b_1

What is expected is that dockers operating within separate namespaces
should remain isolated and not interfere with one another. Regrettably,
the current signal system call fails to achieve this desired level of
isolation.

Proposal:

To address the aforementioned issues, the concept of implementing a cache
for each namespace's rlimit has been proposed. If a cache is added for
each user namespace's rlimit, a certain amount of rlimits can be allocated
to a particular namespace in one go. When resources are abundant, these
resources do not need to be immediately returned to the parent node. Within
a user namespace, if there are available values in the cache, there is no
need to request additional resources from the parent node.

	init_user_ns                             init_ucounts
	^                                              ^
	|                        |                     |
	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
	|                        |                     |
	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
			^		^
			|		|
			cache_rlimit--->|
					|
					ucount_b_1


The ultimate objective of this solution is to achieve complete isolation
among namespaces. After applying this patch set, the final test results
indicate that in the signal1 test case, the performance does not
deteriorate as the number of containers increases. This effectively meets
the goal of linear scalability.

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |

Challenges:

When checking the pending signals in the parent node using the command
 cat /proc/self/status | grep SigQ, the retrieved value includes the
cached signal counts from its child nodes. As a result, the SigQ value
in the parent node fails to accurately and instantaneously reflect the
actual number of pending signals.

	# cat /proc/self/status | grep SigQ
	SigQ:	16/6187667

TODO:

Add cache for the other rlimits.

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/

Chen Ridong (5):
  user_namespace: add children list node
  usernamespace: make usernamespace rcu safe
  user_namespace: add user_ns iteration helper
  uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
  ucount: add rlimit cache for ucount

 include/linux/user_namespace.h |  23 ++++-
 kernel/signal.c                |   2 +-
 kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
 kernel/user.c                  |   2 +
 kernel/user_namespace.c        |  60 ++++++++++-
 5 files changed, 243 insertions(+), 25 deletions(-)

-- 
2.34.1
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Christian Brauner 8 months, 4 weeks ago
On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.
> 
> Notably, even though a new user_namespace is created upon docker startup,
> the problem persists. This is because all these dockers share the same
> parent node, meaning that rlimit statistics continuously modify the same
> atomic variable.
> 
> Currently, when incrementing a specific rlimit within a child user
> namespace by 1, the corresponding rlimit in the parent node must also be
> incremented by 1. Specifically, if the ucounts corresponding to a task in
> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> This operation should be ensured to stay within the limits set for the
> user namespaces.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> 					^
> 					|
> 					|
> 					|
> 					ucount_b_1
> 
> What is expected is that dockers operating within separate namespaces
> should remain isolated and not interfere with one another. Regrettably,
> the current signal system call fails to achieve this desired level of
> isolation.
> 
> Proposal:
> 
> To address the aforementioned issues, the concept of implementing a cache
> for each namespace's rlimit has been proposed. If a cache is added for
> each user namespace's rlimit, a certain amount of rlimits can be allocated
> to a particular namespace in one go. When resources are abundant, these
> resources do not need to be immediately returned to the parent node. Within
> a user namespace, if there are available values in the cache, there is no
> need to request additional resources from the parent node.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> 			^		^
> 			|		|
> 			cache_rlimit--->|
> 					|
> 					ucount_b_1
> 
> 
> The ultimate objective of this solution is to achieve complete isolation
> among namespaces. After applying this patch set, the final test results
> indicate that in the signal1 test case, the performance does not
> deteriorate as the number of containers increases. This effectively meets
> the goal of linear scalability.
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> 
> Challenges:
> 
> When checking the pending signals in the parent node using the command
>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> cached signal counts from its child nodes. As a result, the SigQ value
> in the parent node fails to accurately and instantaneously reflect the
> actual number of pending signals.
> 
> 	# cat /proc/self/status | grep SigQ
> 	SigQ:	16/6187667
> 
> TODO:
> 
> Add cache for the other rlimits.

Woah, I don't think we want to go down that route. That sounds so overly
complex. We should only do that if we absolutely have to. If we can get
away with the percpu counter and some optimizations we might be better
off in the long run.
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Chen Ridong 8 months, 4 weeks ago

On 2025/5/15 18:29, Christian Brauner wrote:
> Woah, I don't think we want to go down that route. That sounds so overly
> complex. We should only do that if we absolutely have to. If we can get
> away with the percpu counter and some optimizations we might be better
> off in the long run.

Thank you for your reply, I will send the next version with percpu_counter.

Thanks,
Ridong
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Andrew Morton 9 months ago
On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:

> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.

Did you consider simply turning that atomic_t counter into a
percpu_counter?
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Sebastian Andrzej Siewior 9 months ago
On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
> On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:
> 
> > The will-it-scale test case signal1 [1] has been observed. and the test
> > results reveal that the signal sending system call lacks linearity.
> > To further investigate this issue, we initiated a series of tests by
> > launching varying numbers of dockers and closely monitored the throughput
> > of each individual docker. The detailed test outcomes are presented as
> > follows:
> > 
> > 	| Dockers     |1      |4      |8      |16     |32     |64     |
> > 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> > 
> > The data clearly demonstrates a discernible trend: as the quantity of
> > dockers increases, the throughput per container progressively declines.
> > In-depth analysis has identified the root cause of this performance
> > degradation. The ucouts module conducts statistics on rlimit, which
> > involves a significant number of atomic operations. These atomic
> > operations, when acting on the same variable, trigger a substantial number
> > of cache misses or remote accesses, ultimately resulting in a drop in
> > performance.
> 
> Did you consider simply turning that atomic_t counter into a
> percpu_counter?

That sounds like a smaller change. Also, do these 1…64 docker container
play signal ping-pong or is there a real workload behind it?

Sebastian
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Chen Ridong 9 months ago

On 2025/5/12 18:48, Sebastian Andrzej Siewior wrote:
> On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
>> On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>> results reveal that the signal sending system call lacks linearity.
>>> To further investigate this issue, we initiated a series of tests by
>>> launching varying numbers of dockers and closely monitored the throughput
>>> of each individual docker. The detailed test outcomes are presented as
>>> follows:
>>>
>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>
>>> The data clearly demonstrates a discernible trend: as the quantity of
>>> dockers increases, the throughput per container progressively declines.
>>> In-depth analysis has identified the root cause of this performance
>>> degradation. The ucouts module conducts statistics on rlimit, which
>>> involves a significant number of atomic operations. These atomic
>>> operations, when acting on the same variable, trigger a substantial number
>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>> performance.
>>
>> Did you consider simply turning that atomic_t counter into a
>> percpu_counter?
> 
> That sounds like a smaller change. Also, do these 1…64 docker container
> play signal ping-pong or is there a real workload behind it?
> 
> Sebastian

Hi Andrew and Sebastian,

Thanks for your prompt reply. I'm currently conducting a "will-it-scale"
test on a 384-core machine configured to run up to 64 Docker
containers(max num). Even with just 32 containers, the throughput drops
by 53%, which indicates significant scaling challenges.

Your suggestion about using percpu_counter was spot on. I've since
implemented a demo to benchmark its performance. Here's the code I wrote
for testing:

static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
@@ -281,10 +289,10 @@ static void do_dec_rlimit_put_ucounts(struct
ucounts *ucounts,
 {
        struct ucounts *iter, *next;
        for (iter = ucounts; iter != last; iter = next) {
-               long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-               WARN_ON_ONCE(dec < 0);
+               percpu_counter_sub(&iter->rlimit[type], 1);
+
                next = iter->ns->ucounts;
-               if (dec == 0)
+               if (percpu_counter_compare(&iter->rlimit[type], 0) == 0)
                        put_ucounts(iter);
        }
 }
@@ -295,36 +303,40 @@ void dec_rlimit_put_ucounts(struct ucounts
*ucounts, enum rlimit_type type)
 }

 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-                           bool override_rlimit)
+                           bool override_rlimit, long limit)
 {
        /* Caller must hold a reference to ucounts */
        struct ucounts *iter;
        long max = LONG_MAX;
-       long dec, ret = 0;
+       long ret = 0;
+
+       if (override_rlimit)
+               limit = LONG_MAX;

        for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-               long new = atomic_long_add_return(1, &iter->rlimit[type]);
-               if (new < 0 || new > max)
+               max = min(limit, max);
+
+               if (!percpu_counter_limited_add(&iter->rlimit[type],
max, 1)) {
+                       ret = -1;
                        goto dec_unwind;
-               if (iter == ucounts)
-                       ret = new;
+               }
                if (!override_rlimit)
                        max = get_userns_rlimit_max(iter->ns, type);
                /*
                 * Grab an extra ucount reference for the caller when
                 * the rlimit count was previously 0.
                 */
-               if (new != 1)
+               if (percpu_counter_compare(&iter->rlimit[type], 1) != 0)
                        continue;
-               if (!get_ucounts(iter))
+               if (!get_ucounts(iter)) {
+                       ret = 0;
                        goto dec_unwind;
+               }
        }
-       return ret;
+       return 1;
 dec_unwind:
-       dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-       WARN_ON_ONCE(dec < 0);
        do_dec_rlimit_put_ucounts(ucounts, iter, type);
-       return 0;
+       return ret;
 }

As shown in demo code, the current implementation retrieves ucounts
during the initial rlimit increment and releases them when the rlimit
hits zero. This mechanism was introduced with the commits:

  fda31c50292a5062332fa0343c084bd9f46604d9 signal: avoid double atomic
 counter increments for user accounting

  d64696905554e919321e31afc210606653b8f6a4   Reimplement
RLIMIT_SIGPENDING on top of ucounts

  15bc01effefe97757ef02ca09e9d1b927ab22725 ucounts: Fix signal ucount
refcounting

It means that we have to requires summing all percpu rlimit counters
very time we increment or decrement the rlimit and this is expensive.

Running the demo code in a single Docker container yielded a throughput
of 73,970 —significantly lower than expected. Performance profiling via
perf revealed:__percpu_counter_sum is the primary performance bottleneck.

+   97.44%     0.27%  signal1_process    [k] entry_SYSCALL_64_after_hwframe
+   97.13%     1.96%  signal1_process    [k] do_syscall_64
+   91.54%     0.00%  signal1_process    [.] 0x00007fb905c8d13f
+   78.66%     0.03%  signal1_process    [k] __percpu_counter_compare
+   78.63%    68.18%  signal1_process    [k] __percpu_counter_sum
+   45.17%     0.37%  signal1_process    [k] syscall_exit_to_user_mode
+   44.95%     0.20%  signal1_process    [k] __x64_sys_tgkill
+   44.51%     0.40%  signal1_process    [k] do_send_specific
+   44.16%     0.07%  signal1_process    [k] arch_do_signal_or_restart
+   43.03%     0.37%  signal1_process    [k] do_send_sig_info
+   42.08%     0.34%  signal1_process    [k] __send_signal_locked
+   40.87%     0.03%  signal1_process    [k] sig_get_ucounts
+   40.74%     0.44%  signal1_process    [k] inc_rlimit_get_ucounts
+   40.55%     0.54%  signal1_process    [k] get_signal
+   39.81%     0.07%  signal1_process    [k] dequeue_signal
+   39.00%     0.07%  signal1_process    [k] __sigqueue_free
+   38.94%     0.27%  signal1_process    [k] do_dec_rlimit_put_ucounts
+    8.60%     8.60%  signal1_process    [k] _find_next_or_bit

However, If we can implement a mechanism to unpin ucounts for signal
pending operations (eliminating the need for get/put refcount
operations), the percpu_counter approach should effectively resolve this
scalability issue. I am trying to figure this out, and if you have any
ideas, please let know, I will do test.

Thank you for your guidance on this matter.

Best regards,
Ridong

Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Alexey Gladkov 8 months, 4 weeks ago
On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.

The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.

Do you have an example of a closer-to-life scenario where this delay
becomes a bottleneck ?

https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.
> 
> Notably, even though a new user_namespace is created upon docker startup,
> the problem persists. This is because all these dockers share the same
> parent node, meaning that rlimit statistics continuously modify the same
> atomic variable.
> 
> Currently, when incrementing a specific rlimit within a child user
> namespace by 1, the corresponding rlimit in the parent node must also be
> incremented by 1. Specifically, if the ucounts corresponding to a task in
> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> This operation should be ensured to stay within the limits set for the
> user namespaces.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> 					^
> 					|
> 					|
> 					|
> 					ucount_b_1
> 
> What is expected is that dockers operating within separate namespaces
> should remain isolated and not interfere with one another. Regrettably,
> the current signal system call fails to achieve this desired level of
> isolation.
> 
> Proposal:
> 
> To address the aforementioned issues, the concept of implementing a cache
> for each namespace's rlimit has been proposed. If a cache is added for
> each user namespace's rlimit, a certain amount of rlimits can be allocated
> to a particular namespace in one go. When resources are abundant, these
> resources do not need to be immediately returned to the parent node. Within
> a user namespace, if there are available values in the cache, there is no
> need to request additional resources from the parent node.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> 			^		^
> 			|		|
> 			cache_rlimit--->|
> 					|
> 					ucount_b_1
> 
> 
> The ultimate objective of this solution is to achieve complete isolation
> among namespaces. After applying this patch set, the final test results
> indicate that in the signal1 test case, the performance does not
> deteriorate as the number of containers increases. This effectively meets

> the goal of linear scalability.
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> 
> Challenges:
> 
> When checking the pending signals in the parent node using the command
>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> cached signal counts from its child nodes. As a result, the SigQ value
> in the parent node fails to accurately and instantaneously reflect the
> actual number of pending signals.
> 
> 	# cat /proc/self/status | grep SigQ
> 	SigQ:	16/6187667
> 
> TODO:
> 
> Add cache for the other rlimits.
> 
> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> 
> Chen Ridong (5):
>   user_namespace: add children list node
>   usernamespace: make usernamespace rcu safe
>   user_namespace: add user_ns iteration helper
>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>   ucount: add rlimit cache for ucount
> 
>  include/linux/user_namespace.h |  23 ++++-
>  kernel/signal.c                |   2 +-
>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>  kernel/user.c                  |   2 +
>  kernel/user_namespace.c        |  60 ++++++++++-
>  5 files changed, 243 insertions(+), 25 deletions(-)
> 
> -- 
> 2.34.1
> 

-- 
Rgrds, legion
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Chen Ridong 8 months, 3 weeks ago

On 2025/5/16 19:48, Alexey Gladkov wrote:
> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
> 
> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> 
> Do you have an example of a closer-to-life scenario where this delay
> becomes a bottleneck ?
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> 

Thank you for your prompt reply. Unfortunately, I do not have the
specific scenario.

Motivation:
I plan to use servers with 384 cores, and potentially even more in the
future. Therefore, I am testing these system calls to identify any
scalability bottlenecks that could arise in massively parallel
high-density computing environments.

In addition, we hope that the containers can be isolated as much as
possible to avoid interfering with each other.

Best regards,
Ridong

>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
>>
>> Notably, even though a new user_namespace is created upon docker startup,
>> the problem persists. This is because all these dockers share the same
>> parent node, meaning that rlimit statistics continuously modify the same
>> atomic variable.
>>
>> Currently, when incrementing a specific rlimit within a child user
>> namespace by 1, the corresponding rlimit in the parent node must also be
>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>> This operation should be ensured to stay within the limits set for the
>> user namespaces.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>> 					^
>> 					|
>> 					|
>> 					|
>> 					ucount_b_1
>>
>> What is expected is that dockers operating within separate namespaces
>> should remain isolated and not interfere with one another. Regrettably,
>> the current signal system call fails to achieve this desired level of
>> isolation.
>>
>> Proposal:
>>
>> To address the aforementioned issues, the concept of implementing a cache
>> for each namespace's rlimit has been proposed. If a cache is added for
>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>> to a particular namespace in one go. When resources are abundant, these
>> resources do not need to be immediately returned to the parent node. Within
>> a user namespace, if there are available values in the cache, there is no
>> need to request additional resources from the parent node.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>> 			^		^
>> 			|		|
>> 			cache_rlimit--->|
>> 					|
>> 					ucount_b_1
>>
>>
>> The ultimate objective of this solution is to achieve complete isolation
>> among namespaces. After applying this patch set, the final test results
>> indicate that in the signal1 test case, the performance does not
>> deteriorate as the number of containers increases. This effectively meets
> 
>> the goal of linear scalability.
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>
>> Challenges:
>>
>> When checking the pending signals in the parent node using the command
>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>> cached signal counts from its child nodes. As a result, the SigQ value
>> in the parent node fails to accurately and instantaneously reflect the
>> actual number of pending signals.
>>
>> 	# cat /proc/self/status | grep SigQ
>> 	SigQ:	16/6187667
>>
>> TODO:
>>
>> Add cache for the other rlimits.
>>
>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>
>> Chen Ridong (5):
>>   user_namespace: add children list node
>>   usernamespace: make usernamespace rcu safe
>>   user_namespace: add user_ns iteration helper
>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>   ucount: add rlimit cache for ucount
>>
>>  include/linux/user_namespace.h |  23 ++++-
>>  kernel/signal.c                |   2 +-
>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>  kernel/user.c                  |   2 +
>>  kernel/user_namespace.c        |  60 ++++++++++-
>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>
>> -- 
>> 2.34.1
>>
>
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Alexey Gladkov 8 months, 3 weeks ago
On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
> 
> 
> On 2025/5/16 19:48, Alexey Gladkov wrote:
> > On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> >> The will-it-scale test case signal1 [1] has been observed. and the test
> >> results reveal that the signal sending system call lacks linearity.
> > 
> > The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> > 
> > Do you have an example of a closer-to-life scenario where this delay
> > becomes a bottleneck ?
> > 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> > 
> 
> Thank you for your prompt reply. Unfortunately, I do not have the
> specific scenario.
> 
> Motivation:
> I plan to use servers with 384 cores, and potentially even more in the
> future. Therefore, I am testing these system calls to identify any
> scalability bottlenecks that could arise in massively parallel
> high-density computing environments.

But it turns out that you're proposing complex changes for something that
is essentially a non-issue. In the real world, applications don't spam
signals and I'm not sure we want to support that scenario.

> In addition, we hope that the containers can be isolated as much as
> possible to avoid interfering with each other.

But that's impossible. Even before migration to ucounts, some rlimits
(RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
bound to user_struct. I mean, atomic counter and "bottleneck" was there.
We can't remove the counters for that rlimits and they will have an
impact.

These rlimits are now counted per-namespace. In real life, docker/podman
creates a separate user namespace for each container from init_user_ns.
Usually only one additional counter is added for each rlimit in this way.

All I'm saying is that "bottleneck" with atomic counter was there before
and can't be removed anywhere.

> 
> Best regards,
> Ridong
> 
> >> To further investigate this issue, we initiated a series of tests by
> >> launching varying numbers of dockers and closely monitored the throughput
> >> of each individual docker. The detailed test outcomes are presented as
> >> follows:
> >>
> >> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >>
> >> The data clearly demonstrates a discernible trend: as the quantity of
> >> dockers increases, the throughput per container progressively declines.
> >> In-depth analysis has identified the root cause of this performance
> >> degradation. The ucouts module conducts statistics on rlimit, which
> >> involves a significant number of atomic operations. These atomic
> >> operations, when acting on the same variable, trigger a substantial number
> >> of cache misses or remote accesses, ultimately resulting in a drop in
> >> performance.
> >>
> >> Notably, even though a new user_namespace is created upon docker startup,
> >> the problem persists. This is because all these dockers share the same
> >> parent node, meaning that rlimit statistics continuously modify the same
> >> atomic variable.
> >>
> >> Currently, when incrementing a specific rlimit within a child user
> >> namespace by 1, the corresponding rlimit in the parent node must also be
> >> incremented by 1. Specifically, if the ucounts corresponding to a task in
> >> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> >> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> >> This operation should be ensured to stay within the limits set for the
> >> user namespaces.
> >>
> >> 	init_user_ns                             init_ucounts
> >> 	^                                              ^
> >> 	|                        |                     |
> >> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >> 	|                        |                     |
> >> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> >> 					^
> >> 					|
> >> 					|
> >> 					|
> >> 					ucount_b_1
> >>
> >> What is expected is that dockers operating within separate namespaces
> >> should remain isolated and not interfere with one another. Regrettably,
> >> the current signal system call fails to achieve this desired level of
> >> isolation.
> >>
> >> Proposal:
> >>
> >> To address the aforementioned issues, the concept of implementing a cache
> >> for each namespace's rlimit has been proposed. If a cache is added for
> >> each user namespace's rlimit, a certain amount of rlimits can be allocated
> >> to a particular namespace in one go. When resources are abundant, these
> >> resources do not need to be immediately returned to the parent node. Within
> >> a user namespace, if there are available values in the cache, there is no
> >> need to request additional resources from the parent node.
> >>
> >> 	init_user_ns                             init_ucounts
> >> 	^                                              ^
> >> 	|                        |                     |
> >> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >> 	|                        |                     |
> >> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> >> 			^		^
> >> 			|		|
> >> 			cache_rlimit--->|
> >> 					|
> >> 					ucount_b_1
> >>
> >>
> >> The ultimate objective of this solution is to achieve complete isolation
> >> among namespaces. After applying this patch set, the final test results
> >> indicate that in the signal1 test case, the performance does not
> >> deteriorate as the number of containers increases. This effectively meets
> > 
> >> the goal of linear scalability.
> >>
> >> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> >>
> >> Challenges:
> >>
> >> When checking the pending signals in the parent node using the command
> >>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> >> cached signal counts from its child nodes. As a result, the SigQ value
> >> in the parent node fails to accurately and instantaneously reflect the
> >> actual number of pending signals.
> >>
> >> 	# cat /proc/self/status | grep SigQ
> >> 	SigQ:	16/6187667
> >>
> >> TODO:
> >>
> >> Add cache for the other rlimits.
> >>
> >> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> >>
> >> Chen Ridong (5):
> >>   user_namespace: add children list node
> >>   usernamespace: make usernamespace rcu safe
> >>   user_namespace: add user_ns iteration helper
> >>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
> >>   ucount: add rlimit cache for ucount
> >>
> >>  include/linux/user_namespace.h |  23 ++++-
> >>  kernel/signal.c                |   2 +-
> >>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
> >>  kernel/user.c                  |   2 +
> >>  kernel/user_namespace.c        |  60 ++++++++++-
> >>  5 files changed, 243 insertions(+), 25 deletions(-)
> >>
> >> -- 
> >> 2.34.1
> >>
> > 
> 

-- 
Rgrds, legion
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Chen Ridong 8 months, 3 weeks ago

On 2025/5/20 0:32, Alexey Gladkov wrote:
> On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/5/16 19:48, Alexey Gladkov wrote:
>>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>>> results reveal that the signal sending system call lacks linearity.
>>>
>>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
>>>
>>> Do you have an example of a closer-to-life scenario where this delay
>>> becomes a bottleneck ?
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
>>>
>>
>> Thank you for your prompt reply. Unfortunately, I do not have the
>> specific scenario.
>>
>> Motivation:
>> I plan to use servers with 384 cores, and potentially even more in the
>> future. Therefore, I am testing these system calls to identify any
>> scalability bottlenecks that could arise in massively parallel
>> high-density computing environments.
> 
> But it turns out that you're proposing complex changes for something that

Using percpu_couter is not as complex as this patch set.

> is essentially a non-issue. In the real world, applications don't spam
> signals and I'm not sure we want to support that scenario.
> 
>> In addition, we hope that the containers can be isolated as much as
>> possible to avoid interfering with each other.
> 
> But that's impossible. Even before migration to ucounts, some rlimits
> (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
> bound to user_struct. I mean, atomic counter and "bottleneck" was there.
> We can't remove the counters for that rlimits and they will have an
> impact.
> 
> These rlimits are now counted per-namespace. In real life, docker/podman
> creates a separate user namespace for each container from init_user_ns.
> Usually only one additional counter is added for each rlimit in this way.
> 
The problem is that when increasing an rlimit in Docker, the limit has
to be increased in the init_user_ns even if a separate user namespace
has been created. This is why I believe this issue deserves to be fixed.


> All I'm saying is that "bottleneck" with atomic counter was there before
> and can't be removed anywhere.
> 

Yes, it can not be removed anywhere, maybe we can make it better.

Best regards,
Ridong

>>
>> Best regards,
>> Ridong
>>
>>>> To further investigate this issue, we initiated a series of tests by
>>>> launching varying numbers of dockers and closely monitored the throughput
>>>> of each individual docker. The detailed test outcomes are presented as
>>>> follows:
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>>
>>>> The data clearly demonstrates a discernible trend: as the quantity of
>>>> dockers increases, the throughput per container progressively declines.
>>>> In-depth analysis has identified the root cause of this performance
>>>> degradation. The ucouts module conducts statistics on rlimit, which
>>>> involves a significant number of atomic operations. These atomic
>>>> operations, when acting on the same variable, trigger a substantial number
>>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>>> performance.
>>>>
>>>> Notably, even though a new user_namespace is created upon docker startup,
>>>> the problem persists. This is because all these dockers share the same
>>>> parent node, meaning that rlimit statistics continuously modify the same
>>>> atomic variable.
>>>>
>>>> Currently, when incrementing a specific rlimit within a child user
>>>> namespace by 1, the corresponding rlimit in the parent node must also be
>>>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>>>> This operation should be ensured to stay within the limits set for the
>>>> user namespaces.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>>>> 					^
>>>> 					|
>>>> 					|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>> What is expected is that dockers operating within separate namespaces
>>>> should remain isolated and not interfere with one another. Regrettably,
>>>> the current signal system call fails to achieve this desired level of
>>>> isolation.
>>>>
>>>> Proposal:
>>>>
>>>> To address the aforementioned issues, the concept of implementing a cache
>>>> for each namespace's rlimit has been proposed. If a cache is added for
>>>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>>>> to a particular namespace in one go. When resources are abundant, these
>>>> resources do not need to be immediately returned to the parent node. Within
>>>> a user namespace, if there are available values in the cache, there is no
>>>> need to request additional resources from the parent node.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>>>> 			^		^
>>>> 			|		|
>>>> 			cache_rlimit--->|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>>
>>>> The ultimate objective of this solution is to achieve complete isolation
>>>> among namespaces. After applying this patch set, the final test results
>>>> indicate that in the signal1 test case, the performance does not
>>>> deteriorate as the number of containers increases. This effectively meets
>>>
>>>> the goal of linear scalability.
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>>>
>>>> Challenges:
>>>>
>>>> When checking the pending signals in the parent node using the command
>>>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>>>> cached signal counts from its child nodes. As a result, the SigQ value
>>>> in the parent node fails to accurately and instantaneously reflect the
>>>> actual number of pending signals.
>>>>
>>>> 	# cat /proc/self/status | grep SigQ
>>>> 	SigQ:	16/6187667
>>>>
>>>> TODO:
>>>>
>>>> Add cache for the other rlimits.
>>>>
>>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>>>
>>>> Chen Ridong (5):
>>>>   user_namespace: add children list node
>>>>   usernamespace: make usernamespace rcu safe
>>>>   user_namespace: add user_ns iteration helper
>>>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>>>   ucount: add rlimit cache for ucount
>>>>
>>>>  include/linux/user_namespace.h |  23 ++++-
>>>>  kernel/signal.c                |   2 +-
>>>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>>>  kernel/user.c                  |   2 +
>>>>  kernel/user_namespace.c        |  60 ++++++++++-
>>>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>>>
>>>> -- 
>>>> 2.34.1
>>>>
>>>
>>
>
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Alexey Gladkov 8 months, 3 weeks ago
On Wed, May 21, 2025 at 09:32:12AM +0800, Chen Ridong wrote:
> 
> 
> On 2025/5/20 0:32, Alexey Gladkov wrote:
> > On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
> >>
> >>
> >> On 2025/5/16 19:48, Alexey Gladkov wrote:
> >>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> >>>> The will-it-scale test case signal1 [1] has been observed. and the test
> >>>> results reveal that the signal sending system call lacks linearity.
> >>>
> >>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> >>>
> >>> Do you have an example of a closer-to-life scenario where this delay
> >>> becomes a bottleneck ?
> >>>
> >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> >>>
> >>
> >> Thank you for your prompt reply. Unfortunately, I do not have the
> >> specific scenario.
> >>
> >> Motivation:
> >> I plan to use servers with 384 cores, and potentially even more in the
> >> future. Therefore, I am testing these system calls to identify any
> >> scalability bottlenecks that could arise in massively parallel
> >> high-density computing environments.
> > 
> > But it turns out that you're proposing complex changes for something that
> 
> Using percpu_couter is not as complex as this patch set.
> 
> > is essentially a non-issue. In the real world, applications don't spam
> > signals and I'm not sure we want to support that scenario.
> > 
> >> In addition, we hope that the containers can be isolated as much as
> >> possible to avoid interfering with each other.
> > 
> > But that's impossible. Even before migration to ucounts, some rlimits
> > (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
> > bound to user_struct. I mean, atomic counter and "bottleneck" was there.
> > We can't remove the counters for that rlimits and they will have an
> > impact.
> > 
> > These rlimits are now counted per-namespace. In real life, docker/podman
> > creates a separate user namespace for each container from init_user_ns.
> > Usually only one additional counter is added for each rlimit in this way.
> > 
> The problem is that when increasing an rlimit in Docker, the limit has
> to be increased in the init_user_ns even if a separate user namespace
> has been created. This is why I believe this issue deserves to be fixed.

man setrlimit(2):

  An unprivileged process may set only its soft limit to a value in the
  range from 0 up to the hard limit, and (irreversibly) lower its hard
  limit. A privileged process may make arbitrary changes to either limit
  value.

This is a well-documented behavior. But it works in all user namespaces.
If a unprivileged user has rlimits in init_user_ns, they should not be
exceeded even if that process uses a different user namespeces.

So even if you increase rlimits in the new user namespace as root (in the
new userns), it doesn't mean that rlimits in the parent user namespace
cease to apply or should somehow increase. You still have limitations from
the upper user namespace.

I don't see a issue here.

> > All I'm saying is that "bottleneck" with atomic counter was there before
> > and can't be removed anywhere.
> > 
> 
> Yes, it can not be removed anywhere, maybe we can make it better.

Yes, we probably can, but we need to have a reason to complicate the code.
And we're still talking about a synthetic test.

> 
> Best regards,
> Ridong
> 
> >>
> >> Best regards,
> >> Ridong
> >>
> >>>> To further investigate this issue, we initiated a series of tests by
> >>>> launching varying numbers of dockers and closely monitored the throughput
> >>>> of each individual docker. The detailed test outcomes are presented as
> >>>> follows:
> >>>>
> >>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >>>>
> >>>> The data clearly demonstrates a discernible trend: as the quantity of
> >>>> dockers increases, the throughput per container progressively declines.
> >>>> In-depth analysis has identified the root cause of this performance
> >>>> degradation. The ucouts module conducts statistics on rlimit, which
> >>>> involves a significant number of atomic operations. These atomic
> >>>> operations, when acting on the same variable, trigger a substantial number
> >>>> of cache misses or remote accesses, ultimately resulting in a drop in
> >>>> performance.
> >>>>
> >>>> Notably, even though a new user_namespace is created upon docker startup,
> >>>> the problem persists. This is because all these dockers share the same
> >>>> parent node, meaning that rlimit statistics continuously modify the same
> >>>> atomic variable.
> >>>>
> >>>> Currently, when incrementing a specific rlimit within a child user
> >>>> namespace by 1, the corresponding rlimit in the parent node must also be
> >>>> incremented by 1. Specifically, if the ucounts corresponding to a task in
> >>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> >>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> >>>> This operation should be ensured to stay within the limits set for the
> >>>> user namespaces.
> >>>>
> >>>> 	init_user_ns                             init_ucounts
> >>>> 	^                                              ^
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> >>>> 					^
> >>>> 					|
> >>>> 					|
> >>>> 					|
> >>>> 					ucount_b_1
> >>>>
> >>>> What is expected is that dockers operating within separate namespaces
> >>>> should remain isolated and not interfere with one another. Regrettably,
> >>>> the current signal system call fails to achieve this desired level of
> >>>> isolation.
> >>>>
> >>>> Proposal:
> >>>>
> >>>> To address the aforementioned issues, the concept of implementing a cache
> >>>> for each namespace's rlimit has been proposed. If a cache is added for
> >>>> each user namespace's rlimit, a certain amount of rlimits can be allocated
> >>>> to a particular namespace in one go. When resources are abundant, these
> >>>> resources do not need to be immediately returned to the parent node. Within
> >>>> a user namespace, if there are available values in the cache, there is no
> >>>> need to request additional resources from the parent node.
> >>>>
> >>>> 	init_user_ns                             init_ucounts
> >>>> 	^                                              ^
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> >>>> 			^		^
> >>>> 			|		|
> >>>> 			cache_rlimit--->|
> >>>> 					|
> >>>> 					ucount_b_1
> >>>>
> >>>>
> >>>> The ultimate objective of this solution is to achieve complete isolation
> >>>> among namespaces. After applying this patch set, the final test results
> >>>> indicate that in the signal1 test case, the performance does not
> >>>> deteriorate as the number of containers increases. This effectively meets
> >>>
> >>>> the goal of linear scalability.
> >>>>
> >>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >>>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> >>>>
> >>>> Challenges:
> >>>>
> >>>> When checking the pending signals in the parent node using the command
> >>>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> >>>> cached signal counts from its child nodes. As a result, the SigQ value
> >>>> in the parent node fails to accurately and instantaneously reflect the
> >>>> actual number of pending signals.
> >>>>
> >>>> 	# cat /proc/self/status | grep SigQ
> >>>> 	SigQ:	16/6187667
> >>>>
> >>>> TODO:
> >>>>
> >>>> Add cache for the other rlimits.
> >>>>
> >>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> >>>>
> >>>> Chen Ridong (5):
> >>>>   user_namespace: add children list node
> >>>>   usernamespace: make usernamespace rcu safe
> >>>>   user_namespace: add user_ns iteration helper
> >>>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
> >>>>   ucount: add rlimit cache for ucount
> >>>>
> >>>>  include/linux/user_namespace.h |  23 ++++-
> >>>>  kernel/signal.c                |   2 +-
> >>>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
> >>>>  kernel/user.c                  |   2 +
> >>>>  kernel/user_namespace.c        |  60 ++++++++++-
> >>>>  5 files changed, 243 insertions(+), 25 deletions(-)
> >>>>
> >>>> -- 
> >>>> 2.34.1
> >>>>
> >>>
> >>
> > 
> 

-- 
Rgrds, legion
Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
Posted by Andrei Vagin 8 months, 3 weeks ago
On Wed, May 21, 2025 at 12:29 AM Alexey Gladkov <legion@kernel.org> wrote:
<....>
>
> > > All I'm saying is that "bottleneck" with atomic counter was there before
> > > and can't be removed anywhere.
> > >
> >
> > Yes, it can not be removed anywhere, maybe we can make it better.
>
> Yes, we probably can, but we need to have a reason to complicate the code.
> And we're still talking about a synthetic test.

I think I have a real use case that will be negatively impacted by this
issue. This involves gVisor with the systrap platform. gVisor is an
application kernel, similar to user-mode Linux. The systrap platform
utilizes seccomp to intercept guest syscalls, meaning each guest syscall
triggers a SIGSYS signal. For some workloads, the signal handling overhead
accounts for over 50% of the total workload execution time.

However, considering the gVisor problem, I think the solution could be
simpler. Each task could reserve one signal in advance. Then, when a signal
is triggered by an 'exception' (e.g., seccomp or page fault), the kernel
could queue a force signal without incurring a ucount charge. Even
currently, such signals are allocated with the override_rlimit flag set,
meaning they are not subject to standard resource limits.

Thanks,
Andrei