include/linux/user_namespace.h | 23 ++++- kernel/signal.c | 2 +- kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- kernel/user.c | 2 + kernel/user_namespace.c | 60 ++++++++++- 5 files changed, 243 insertions(+), 25 deletions(-)
The will-it-scale test case signal1 [1] has been observed. and the test results reveal that the signal sending system call lacks linearity. To further investigate this issue, we initiated a series of tests by launching varying numbers of dockers and closely monitored the throughput of each individual docker. The detailed test outcomes are presented as follows: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | The data clearly demonstrates a discernible trend: as the quantity of dockers increases, the throughput per container progressively declines. In-depth analysis has identified the root cause of this performance degradation. The ucouts module conducts statistics on rlimit, which involves a significant number of atomic operations. These atomic operations, when acting on the same variable, trigger a substantial number of cache misses or remote accesses, ultimately resulting in a drop in performance. Notably, even though a new user_namespace is created upon docker startup, the problem persists. This is because all these dockers share the same parent node, meaning that rlimit statistics continuously modify the same atomic variable. Currently, when incrementing a specific rlimit within a child user namespace by 1, the corresponding rlimit in the parent node must also be incremented by 1. Specifically, if the ucounts corresponding to a task in Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, the rlimit of the parent node, init_ucounts, must also be incremented by 1. This operation should be ensured to stay within the limits set for the user namespaces. init_user_ns init_ucounts ^ ^ | | | |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| | | | |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| ^ | | | ucount_b_1 What is expected is that dockers operating within separate namespaces should remain isolated and not interfere with one another. Regrettably, the current signal system call fails to achieve this desired level of isolation. Proposal: To address the aforementioned issues, the concept of implementing a cache for each namespace's rlimit has been proposed. If a cache is added for each user namespace's rlimit, a certain amount of rlimits can be allocated to a particular namespace in one go. When resources are abundant, these resources do not need to be immediately returned to the parent node. Within a user namespace, if there are available values in the cache, there is no need to request additional resources from the parent node. init_user_ns init_ucounts ^ ^ | | | |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| | | | |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| ^ ^ | | cache_rlimit--->| | ucount_b_1 The ultimate objective of this solution is to achieve complete isolation among namespaces. After applying this patch set, the final test results indicate that in the signal1 test case, the performance does not deteriorate as the number of containers increases. This effectively meets the goal of linear scalability. | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | Challenges: When checking the pending signals in the parent node using the command cat /proc/self/status | grep SigQ, the retrieved value includes the cached signal counts from its child nodes. As a result, the SigQ value in the parent node fails to accurately and instantaneously reflect the actual number of pending signals. # cat /proc/self/status | grep SigQ SigQ: 16/6187667 TODO: Add cache for the other rlimits. [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ Chen Ridong (5): user_namespace: add children list node usernamespace: make usernamespace rcu safe user_namespace: add user_ns iteration helper uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts ucount: add rlimit cache for ucount include/linux/user_namespace.h | 23 ++++- kernel/signal.c | 2 +- kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- kernel/user.c | 2 + kernel/user_namespace.c | 60 ++++++++++- 5 files changed, 243 insertions(+), 25 deletions(-) -- 2.34.1
On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: > The will-it-scale test case signal1 [1] has been observed. and the test > results reveal that the signal sending system call lacks linearity. > To further investigate this issue, we initiated a series of tests by > launching varying numbers of dockers and closely monitored the throughput > of each individual docker. The detailed test outcomes are presented as > follows: > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > The data clearly demonstrates a discernible trend: as the quantity of > dockers increases, the throughput per container progressively declines. > In-depth analysis has identified the root cause of this performance > degradation. The ucouts module conducts statistics on rlimit, which > involves a significant number of atomic operations. These atomic > operations, when acting on the same variable, trigger a substantial number > of cache misses or remote accesses, ultimately resulting in a drop in > performance. > > Notably, even though a new user_namespace is created upon docker startup, > the problem persists. This is because all these dockers share the same > parent node, meaning that rlimit statistics continuously modify the same > atomic variable. > > Currently, when incrementing a specific rlimit within a child user > namespace by 1, the corresponding rlimit in the parent node must also be > incremented by 1. Specifically, if the ucounts corresponding to a task in > Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, > the rlimit of the parent node, init_ucounts, must also be incremented by 1. > This operation should be ensured to stay within the limits set for the > user namespaces. > > init_user_ns init_ucounts > ^ ^ > | | | > |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > | | | > |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| > ^ > | > | > | > ucount_b_1 > > What is expected is that dockers operating within separate namespaces > should remain isolated and not interfere with one another. Regrettably, > the current signal system call fails to achieve this desired level of > isolation. > > Proposal: > > To address the aforementioned issues, the concept of implementing a cache > for each namespace's rlimit has been proposed. If a cache is added for > each user namespace's rlimit, a certain amount of rlimits can be allocated > to a particular namespace in one go. When resources are abundant, these > resources do not need to be immediately returned to the parent node. Within > a user namespace, if there are available values in the cache, there is no > need to request additional resources from the parent node. > > init_user_ns init_ucounts > ^ ^ > | | | > |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > | | | > |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| > ^ ^ > | | > cache_rlimit--->| > | > ucount_b_1 > > > The ultimate objective of this solution is to achieve complete isolation > among namespaces. After applying this patch set, the final test results > indicate that in the signal1 test case, the performance does not > deteriorate as the number of containers increases. This effectively meets > the goal of linear scalability. > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | > > Challenges: > > When checking the pending signals in the parent node using the command > cat /proc/self/status | grep SigQ, the retrieved value includes the > cached signal counts from its child nodes. As a result, the SigQ value > in the parent node fails to accurately and instantaneously reflect the > actual number of pending signals. > > # cat /proc/self/status | grep SigQ > SigQ: 16/6187667 > > TODO: > > Add cache for the other rlimits. Woah, I don't think we want to go down that route. That sounds so overly complex. We should only do that if we absolutely have to. If we can get away with the percpu counter and some optimizations we might be better off in the long run.
On 2025/5/15 18:29, Christian Brauner wrote: > Woah, I don't think we want to go down that route. That sounds so overly > complex. We should only do that if we absolutely have to. If we can get > away with the percpu counter and some optimizations we might be better > off in the long run. Thank you for your reply, I will send the next version with percpu_counter. Thanks, Ridong
On Fri, 9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote: > The will-it-scale test case signal1 [1] has been observed. and the test > results reveal that the signal sending system call lacks linearity. > To further investigate this issue, we initiated a series of tests by > launching varying numbers of dockers and closely monitored the throughput > of each individual docker. The detailed test outcomes are presented as > follows: > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > The data clearly demonstrates a discernible trend: as the quantity of > dockers increases, the throughput per container progressively declines. > In-depth analysis has identified the root cause of this performance > degradation. The ucouts module conducts statistics on rlimit, which > involves a significant number of atomic operations. These atomic > operations, when acting on the same variable, trigger a substantial number > of cache misses or remote accesses, ultimately resulting in a drop in > performance. Did you consider simply turning that atomic_t counter into a percpu_counter?
On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote: > On Fri, 9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote: > > > The will-it-scale test case signal1 [1] has been observed. and the test > > results reveal that the signal sending system call lacks linearity. > > To further investigate this issue, we initiated a series of tests by > > launching varying numbers of dockers and closely monitored the throughput > > of each individual docker. The detailed test outcomes are presented as > > follows: > > > > | Dockers |1 |4 |8 |16 |32 |64 | > > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > > > The data clearly demonstrates a discernible trend: as the quantity of > > dockers increases, the throughput per container progressively declines. > > In-depth analysis has identified the root cause of this performance > > degradation. The ucouts module conducts statistics on rlimit, which > > involves a significant number of atomic operations. These atomic > > operations, when acting on the same variable, trigger a substantial number > > of cache misses or remote accesses, ultimately resulting in a drop in > > performance. > > Did you consider simply turning that atomic_t counter into a > percpu_counter? That sounds like a smaller change. Also, do these 1…64 docker container play signal ping-pong or is there a real workload behind it? Sebastian
On 2025/5/12 18:48, Sebastian Andrzej Siewior wrote:
> On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
>> On Fri, 9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>> results reveal that the signal sending system call lacks linearity.
>>> To further investigate this issue, we initiated a series of tests by
>>> launching varying numbers of dockers and closely monitored the throughput
>>> of each individual docker. The detailed test outcomes are presented as
>>> follows:
>>>
>>> | Dockers |1 |4 |8 |16 |32 |64 |
>>> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 |
>>>
>>> The data clearly demonstrates a discernible trend: as the quantity of
>>> dockers increases, the throughput per container progressively declines.
>>> In-depth analysis has identified the root cause of this performance
>>> degradation. The ucouts module conducts statistics on rlimit, which
>>> involves a significant number of atomic operations. These atomic
>>> operations, when acting on the same variable, trigger a substantial number
>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>> performance.
>>
>> Did you consider simply turning that atomic_t counter into a
>> percpu_counter?
>
> That sounds like a smaller change. Also, do these 1…64 docker container
> play signal ping-pong or is there a real workload behind it?
>
> Sebastian
Hi Andrew and Sebastian,
Thanks for your prompt reply. I'm currently conducting a "will-it-scale"
test on a 384-core machine configured to run up to 64 Docker
containers(max num). Even with just 32 containers, the throughput drops
by 53%, which indicates significant scaling challenges.
Your suggestion about using percpu_counter was spot on. I've since
implemented a demo to benchmark its performance. Here's the code I wrote
for testing:
static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
@@ -281,10 +289,10 @@ static void do_dec_rlimit_put_ucounts(struct
ucounts *ucounts,
{
struct ucounts *iter, *next;
for (iter = ucounts; iter != last; iter = next) {
- long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
- WARN_ON_ONCE(dec < 0);
+ percpu_counter_sub(&iter->rlimit[type], 1);
+
next = iter->ns->ucounts;
- if (dec == 0)
+ if (percpu_counter_compare(&iter->rlimit[type], 0) == 0)
put_ucounts(iter);
}
}
@@ -295,36 +303,40 @@ void dec_rlimit_put_ucounts(struct ucounts
*ucounts, enum rlimit_type type)
}
long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
- bool override_rlimit)
+ bool override_rlimit, long limit)
{
/* Caller must hold a reference to ucounts */
struct ucounts *iter;
long max = LONG_MAX;
- long dec, ret = 0;
+ long ret = 0;
+
+ if (override_rlimit)
+ limit = LONG_MAX;
for (iter = ucounts; iter; iter = iter->ns->ucounts) {
- long new = atomic_long_add_return(1, &iter->rlimit[type]);
- if (new < 0 || new > max)
+ max = min(limit, max);
+
+ if (!percpu_counter_limited_add(&iter->rlimit[type],
max, 1)) {
+ ret = -1;
goto dec_unwind;
- if (iter == ucounts)
- ret = new;
+ }
if (!override_rlimit)
max = get_userns_rlimit_max(iter->ns, type);
/*
* Grab an extra ucount reference for the caller when
* the rlimit count was previously 0.
*/
- if (new != 1)
+ if (percpu_counter_compare(&iter->rlimit[type], 1) != 0)
continue;
- if (!get_ucounts(iter))
+ if (!get_ucounts(iter)) {
+ ret = 0;
goto dec_unwind;
+ }
}
- return ret;
+ return 1;
dec_unwind:
- dec = atomic_long_sub_return(1, &iter->rlimit[type]);
- WARN_ON_ONCE(dec < 0);
do_dec_rlimit_put_ucounts(ucounts, iter, type);
- return 0;
+ return ret;
}
As shown in demo code, the current implementation retrieves ucounts
during the initial rlimit increment and releases them when the rlimit
hits zero. This mechanism was introduced with the commits:
fda31c50292a5062332fa0343c084bd9f46604d9 signal: avoid double atomic
counter increments for user accounting
d64696905554e919321e31afc210606653b8f6a4 Reimplement
RLIMIT_SIGPENDING on top of ucounts
15bc01effefe97757ef02ca09e9d1b927ab22725 ucounts: Fix signal ucount
refcounting
It means that we have to requires summing all percpu rlimit counters
very time we increment or decrement the rlimit and this is expensive.
Running the demo code in a single Docker container yielded a throughput
of 73,970 —significantly lower than expected. Performance profiling via
perf revealed:__percpu_counter_sum is the primary performance bottleneck.
+ 97.44% 0.27% signal1_process [k] entry_SYSCALL_64_after_hwframe
+ 97.13% 1.96% signal1_process [k] do_syscall_64
+ 91.54% 0.00% signal1_process [.] 0x00007fb905c8d13f
+ 78.66% 0.03% signal1_process [k] __percpu_counter_compare
+ 78.63% 68.18% signal1_process [k] __percpu_counter_sum
+ 45.17% 0.37% signal1_process [k] syscall_exit_to_user_mode
+ 44.95% 0.20% signal1_process [k] __x64_sys_tgkill
+ 44.51% 0.40% signal1_process [k] do_send_specific
+ 44.16% 0.07% signal1_process [k] arch_do_signal_or_restart
+ 43.03% 0.37% signal1_process [k] do_send_sig_info
+ 42.08% 0.34% signal1_process [k] __send_signal_locked
+ 40.87% 0.03% signal1_process [k] sig_get_ucounts
+ 40.74% 0.44% signal1_process [k] inc_rlimit_get_ucounts
+ 40.55% 0.54% signal1_process [k] get_signal
+ 39.81% 0.07% signal1_process [k] dequeue_signal
+ 39.00% 0.07% signal1_process [k] __sigqueue_free
+ 38.94% 0.27% signal1_process [k] do_dec_rlimit_put_ucounts
+ 8.60% 8.60% signal1_process [k] _find_next_or_bit
However, If we can implement a mechanism to unpin ucounts for signal
pending operations (eliminating the need for get/put refcount
operations), the percpu_counter approach should effectively resolve this
scalability issue. I am trying to figure this out, and if you have any
ideas, please let know, I will do test.
Thank you for your guidance on this matter.
Best regards,
Ridong
On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: > The will-it-scale test case signal1 [1] has been observed. and the test > results reveal that the signal sending system call lacks linearity. The signal1 testcase is pretty synthetic. It sends a signal in a busy loop. Do you have an example of a closer-to-life scenario where this delay becomes a bottleneck ? https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c > To further investigate this issue, we initiated a series of tests by > launching varying numbers of dockers and closely monitored the throughput > of each individual docker. The detailed test outcomes are presented as > follows: > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > The data clearly demonstrates a discernible trend: as the quantity of > dockers increases, the throughput per container progressively declines. > In-depth analysis has identified the root cause of this performance > degradation. The ucouts module conducts statistics on rlimit, which > involves a significant number of atomic operations. These atomic > operations, when acting on the same variable, trigger a substantial number > of cache misses or remote accesses, ultimately resulting in a drop in > performance. > > Notably, even though a new user_namespace is created upon docker startup, > the problem persists. This is because all these dockers share the same > parent node, meaning that rlimit statistics continuously modify the same > atomic variable. > > Currently, when incrementing a specific rlimit within a child user > namespace by 1, the corresponding rlimit in the parent node must also be > incremented by 1. Specifically, if the ucounts corresponding to a task in > Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, > the rlimit of the parent node, init_ucounts, must also be incremented by 1. > This operation should be ensured to stay within the limits set for the > user namespaces. > > init_user_ns init_ucounts > ^ ^ > | | | > |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > | | | > |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| > ^ > | > | > | > ucount_b_1 > > What is expected is that dockers operating within separate namespaces > should remain isolated and not interfere with one another. Regrettably, > the current signal system call fails to achieve this desired level of > isolation. > > Proposal: > > To address the aforementioned issues, the concept of implementing a cache > for each namespace's rlimit has been proposed. If a cache is added for > each user namespace's rlimit, a certain amount of rlimits can be allocated > to a particular namespace in one go. When resources are abundant, these > resources do not need to be immediately returned to the parent node. Within > a user namespace, if there are available values in the cache, there is no > need to request additional resources from the parent node. > > init_user_ns init_ucounts > ^ ^ > | | | > |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > | | | > |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| > ^ ^ > | | > cache_rlimit--->| > | > ucount_b_1 > > > The ultimate objective of this solution is to achieve complete isolation > among namespaces. After applying this patch set, the final test results > indicate that in the signal1 test case, the performance does not > deteriorate as the number of containers increases. This effectively meets > the goal of linear scalability. > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | > > Challenges: > > When checking the pending signals in the parent node using the command > cat /proc/self/status | grep SigQ, the retrieved value includes the > cached signal counts from its child nodes. As a result, the SigQ value > in the parent node fails to accurately and instantaneously reflect the > actual number of pending signals. > > # cat /proc/self/status | grep SigQ > SigQ: 16/6187667 > > TODO: > > Add cache for the other rlimits. > > [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ > > Chen Ridong (5): > user_namespace: add children list node > usernamespace: make usernamespace rcu safe > user_namespace: add user_ns iteration helper > uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts > ucount: add rlimit cache for ucount > > include/linux/user_namespace.h | 23 ++++- > kernel/signal.c | 2 +- > kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- > kernel/user.c | 2 + > kernel/user_namespace.c | 60 ++++++++++- > 5 files changed, 243 insertions(+), 25 deletions(-) > > -- > 2.34.1 > -- Rgrds, legion
On 2025/5/16 19:48, Alexey Gladkov wrote: > On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: >> The will-it-scale test case signal1 [1] has been observed. and the test >> results reveal that the signal sending system call lacks linearity. > > The signal1 testcase is pretty synthetic. It sends a signal in a busy loop. > > Do you have an example of a closer-to-life scenario where this delay > becomes a bottleneck ? > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c > Thank you for your prompt reply. Unfortunately, I do not have the specific scenario. Motivation: I plan to use servers with 384 cores, and potentially even more in the future. Therefore, I am testing these system calls to identify any scalability bottlenecks that could arise in massively parallel high-density computing environments. In addition, we hope that the containers can be isolated as much as possible to avoid interfering with each other. Best regards, Ridong >> To further investigate this issue, we initiated a series of tests by >> launching varying numbers of dockers and closely monitored the throughput >> of each individual docker. The detailed test outcomes are presented as >> follows: >> >> | Dockers |1 |4 |8 |16 |32 |64 | >> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | >> >> The data clearly demonstrates a discernible trend: as the quantity of >> dockers increases, the throughput per container progressively declines. >> In-depth analysis has identified the root cause of this performance >> degradation. The ucouts module conducts statistics on rlimit, which >> involves a significant number of atomic operations. These atomic >> operations, when acting on the same variable, trigger a substantial number >> of cache misses or remote accesses, ultimately resulting in a drop in >> performance. >> >> Notably, even though a new user_namespace is created upon docker startup, >> the problem persists. This is because all these dockers share the same >> parent node, meaning that rlimit statistics continuously modify the same >> atomic variable. >> >> Currently, when incrementing a specific rlimit within a child user >> namespace by 1, the corresponding rlimit in the parent node must also be >> incremented by 1. Specifically, if the ucounts corresponding to a task in >> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, >> the rlimit of the parent node, init_ucounts, must also be incremented by 1. >> This operation should be ensured to stay within the limits set for the >> user namespaces. >> >> init_user_ns init_ucounts >> ^ ^ >> | | | >> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| >> | | | >> |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| >> ^ >> | >> | >> | >> ucount_b_1 >> >> What is expected is that dockers operating within separate namespaces >> should remain isolated and not interfere with one another. Regrettably, >> the current signal system call fails to achieve this desired level of >> isolation. >> >> Proposal: >> >> To address the aforementioned issues, the concept of implementing a cache >> for each namespace's rlimit has been proposed. If a cache is added for >> each user namespace's rlimit, a certain amount of rlimits can be allocated >> to a particular namespace in one go. When resources are abundant, these >> resources do not need to be immediately returned to the parent node. Within >> a user namespace, if there are available values in the cache, there is no >> need to request additional resources from the parent node. >> >> init_user_ns init_ucounts >> ^ ^ >> | | | >> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| >> | | | >> |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| >> ^ ^ >> | | >> cache_rlimit--->| >> | >> ucount_b_1 >> >> >> The ultimate objective of this solution is to achieve complete isolation >> among namespaces. After applying this patch set, the final test results >> indicate that in the signal1 test case, the performance does not >> deteriorate as the number of containers increases. This effectively meets > >> the goal of linear scalability. >> >> | Dockers |1 |4 |8 |16 |32 |64 | >> | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | >> >> Challenges: >> >> When checking the pending signals in the parent node using the command >> cat /proc/self/status | grep SigQ, the retrieved value includes the >> cached signal counts from its child nodes. As a result, the SigQ value >> in the parent node fails to accurately and instantaneously reflect the >> actual number of pending signals. >> >> # cat /proc/self/status | grep SigQ >> SigQ: 16/6187667 >> >> TODO: >> >> Add cache for the other rlimits. >> >> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ >> >> Chen Ridong (5): >> user_namespace: add children list node >> usernamespace: make usernamespace rcu safe >> user_namespace: add user_ns iteration helper >> uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts >> ucount: add rlimit cache for ucount >> >> include/linux/user_namespace.h | 23 ++++- >> kernel/signal.c | 2 +- >> kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- >> kernel/user.c | 2 + >> kernel/user_namespace.c | 60 ++++++++++- >> 5 files changed, 243 insertions(+), 25 deletions(-) >> >> -- >> 2.34.1 >> >
On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote: > > > On 2025/5/16 19:48, Alexey Gladkov wrote: > > On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: > >> The will-it-scale test case signal1 [1] has been observed. and the test > >> results reveal that the signal sending system call lacks linearity. > > > > The signal1 testcase is pretty synthetic. It sends a signal in a busy loop. > > > > Do you have an example of a closer-to-life scenario where this delay > > becomes a bottleneck ? > > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c > > > > Thank you for your prompt reply. Unfortunately, I do not have the > specific scenario. > > Motivation: > I plan to use servers with 384 cores, and potentially even more in the > future. Therefore, I am testing these system calls to identify any > scalability bottlenecks that could arise in massively parallel > high-density computing environments. But it turns out that you're proposing complex changes for something that is essentially a non-issue. In the real world, applications don't spam signals and I'm not sure we want to support that scenario. > In addition, we hope that the containers can be isolated as much as > possible to avoid interfering with each other. But that's impossible. Even before migration to ucounts, some rlimits (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were bound to user_struct. I mean, atomic counter and "bottleneck" was there. We can't remove the counters for that rlimits and they will have an impact. These rlimits are now counted per-namespace. In real life, docker/podman creates a separate user namespace for each container from init_user_ns. Usually only one additional counter is added for each rlimit in this way. All I'm saying is that "bottleneck" with atomic counter was there before and can't be removed anywhere. > > Best regards, > Ridong > > >> To further investigate this issue, we initiated a series of tests by > >> launching varying numbers of dockers and closely monitored the throughput > >> of each individual docker. The detailed test outcomes are presented as > >> follows: > >> > >> | Dockers |1 |4 |8 |16 |32 |64 | > >> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > >> > >> The data clearly demonstrates a discernible trend: as the quantity of > >> dockers increases, the throughput per container progressively declines. > >> In-depth analysis has identified the root cause of this performance > >> degradation. The ucouts module conducts statistics on rlimit, which > >> involves a significant number of atomic operations. These atomic > >> operations, when acting on the same variable, trigger a substantial number > >> of cache misses or remote accesses, ultimately resulting in a drop in > >> performance. > >> > >> Notably, even though a new user_namespace is created upon docker startup, > >> the problem persists. This is because all these dockers share the same > >> parent node, meaning that rlimit statistics continuously modify the same > >> atomic variable. > >> > >> Currently, when incrementing a specific rlimit within a child user > >> namespace by 1, the corresponding rlimit in the parent node must also be > >> incremented by 1. Specifically, if the ucounts corresponding to a task in > >> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, > >> the rlimit of the parent node, init_ucounts, must also be incremented by 1. > >> This operation should be ensured to stay within the limits set for the > >> user namespaces. > >> > >> init_user_ns init_ucounts > >> ^ ^ > >> | | | > >> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > >> | | | > >> |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| > >> ^ > >> | > >> | > >> | > >> ucount_b_1 > >> > >> What is expected is that dockers operating within separate namespaces > >> should remain isolated and not interfere with one another. Regrettably, > >> the current signal system call fails to achieve this desired level of > >> isolation. > >> > >> Proposal: > >> > >> To address the aforementioned issues, the concept of implementing a cache > >> for each namespace's rlimit has been proposed. If a cache is added for > >> each user namespace's rlimit, a certain amount of rlimits can be allocated > >> to a particular namespace in one go. When resources are abundant, these > >> resources do not need to be immediately returned to the parent node. Within > >> a user namespace, if there are available values in the cache, there is no > >> need to request additional resources from the parent node. > >> > >> init_user_ns init_ucounts > >> ^ ^ > >> | | | > >> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > >> | | | > >> |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| > >> ^ ^ > >> | | > >> cache_rlimit--->| > >> | > >> ucount_b_1 > >> > >> > >> The ultimate objective of this solution is to achieve complete isolation > >> among namespaces. After applying this patch set, the final test results > >> indicate that in the signal1 test case, the performance does not > >> deteriorate as the number of containers increases. This effectively meets > > > >> the goal of linear scalability. > >> > >> | Dockers |1 |4 |8 |16 |32 |64 | > >> | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | > >> > >> Challenges: > >> > >> When checking the pending signals in the parent node using the command > >> cat /proc/self/status | grep SigQ, the retrieved value includes the > >> cached signal counts from its child nodes. As a result, the SigQ value > >> in the parent node fails to accurately and instantaneously reflect the > >> actual number of pending signals. > >> > >> # cat /proc/self/status | grep SigQ > >> SigQ: 16/6187667 > >> > >> TODO: > >> > >> Add cache for the other rlimits. > >> > >> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ > >> > >> Chen Ridong (5): > >> user_namespace: add children list node > >> usernamespace: make usernamespace rcu safe > >> user_namespace: add user_ns iteration helper > >> uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts > >> ucount: add rlimit cache for ucount > >> > >> include/linux/user_namespace.h | 23 ++++- > >> kernel/signal.c | 2 +- > >> kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- > >> kernel/user.c | 2 + > >> kernel/user_namespace.c | 60 ++++++++++- > >> 5 files changed, 243 insertions(+), 25 deletions(-) > >> > >> -- > >> 2.34.1 > >> > > > -- Rgrds, legion
On 2025/5/20 0:32, Alexey Gladkov wrote: > On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote: >> >> >> On 2025/5/16 19:48, Alexey Gladkov wrote: >>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: >>>> The will-it-scale test case signal1 [1] has been observed. and the test >>>> results reveal that the signal sending system call lacks linearity. >>> >>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop. >>> >>> Do you have an example of a closer-to-life scenario where this delay >>> becomes a bottleneck ? >>> >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c >>> >> >> Thank you for your prompt reply. Unfortunately, I do not have the >> specific scenario. >> >> Motivation: >> I plan to use servers with 384 cores, and potentially even more in the >> future. Therefore, I am testing these system calls to identify any >> scalability bottlenecks that could arise in massively parallel >> high-density computing environments. > > But it turns out that you're proposing complex changes for something that Using percpu_couter is not as complex as this patch set. > is essentially a non-issue. In the real world, applications don't spam > signals and I'm not sure we want to support that scenario. > >> In addition, we hope that the containers can be isolated as much as >> possible to avoid interfering with each other. > > But that's impossible. Even before migration to ucounts, some rlimits > (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were > bound to user_struct. I mean, atomic counter and "bottleneck" was there. > We can't remove the counters for that rlimits and they will have an > impact. > > These rlimits are now counted per-namespace. In real life, docker/podman > creates a separate user namespace for each container from init_user_ns. > Usually only one additional counter is added for each rlimit in this way. > The problem is that when increasing an rlimit in Docker, the limit has to be increased in the init_user_ns even if a separate user namespace has been created. This is why I believe this issue deserves to be fixed. > All I'm saying is that "bottleneck" with atomic counter was there before > and can't be removed anywhere. > Yes, it can not be removed anywhere, maybe we can make it better. Best regards, Ridong >> >> Best regards, >> Ridong >> >>>> To further investigate this issue, we initiated a series of tests by >>>> launching varying numbers of dockers and closely monitored the throughput >>>> of each individual docker. The detailed test outcomes are presented as >>>> follows: >>>> >>>> | Dockers |1 |4 |8 |16 |32 |64 | >>>> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | >>>> >>>> The data clearly demonstrates a discernible trend: as the quantity of >>>> dockers increases, the throughput per container progressively declines. >>>> In-depth analysis has identified the root cause of this performance >>>> degradation. The ucouts module conducts statistics on rlimit, which >>>> involves a significant number of atomic operations. These atomic >>>> operations, when acting on the same variable, trigger a substantial number >>>> of cache misses or remote accesses, ultimately resulting in a drop in >>>> performance. >>>> >>>> Notably, even though a new user_namespace is created upon docker startup, >>>> the problem persists. This is because all these dockers share the same >>>> parent node, meaning that rlimit statistics continuously modify the same >>>> atomic variable. >>>> >>>> Currently, when incrementing a specific rlimit within a child user >>>> namespace by 1, the corresponding rlimit in the parent node must also be >>>> incremented by 1. Specifically, if the ucounts corresponding to a task in >>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, >>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1. >>>> This operation should be ensured to stay within the limits set for the >>>> user namespaces. >>>> >>>> init_user_ns init_ucounts >>>> ^ ^ >>>> | | | >>>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| >>>> | | | >>>> |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| >>>> ^ >>>> | >>>> | >>>> | >>>> ucount_b_1 >>>> >>>> What is expected is that dockers operating within separate namespaces >>>> should remain isolated and not interfere with one another. Regrettably, >>>> the current signal system call fails to achieve this desired level of >>>> isolation. >>>> >>>> Proposal: >>>> >>>> To address the aforementioned issues, the concept of implementing a cache >>>> for each namespace's rlimit has been proposed. If a cache is added for >>>> each user namespace's rlimit, a certain amount of rlimits can be allocated >>>> to a particular namespace in one go. When resources are abundant, these >>>> resources do not need to be immediately returned to the parent node. Within >>>> a user namespace, if there are available values in the cache, there is no >>>> need to request additional resources from the parent node. >>>> >>>> init_user_ns init_ucounts >>>> ^ ^ >>>> | | | >>>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| >>>> | | | >>>> |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| >>>> ^ ^ >>>> | | >>>> cache_rlimit--->| >>>> | >>>> ucount_b_1 >>>> >>>> >>>> The ultimate objective of this solution is to achieve complete isolation >>>> among namespaces. After applying this patch set, the final test results >>>> indicate that in the signal1 test case, the performance does not >>>> deteriorate as the number of containers increases. This effectively meets >>> >>>> the goal of linear scalability. >>>> >>>> | Dockers |1 |4 |8 |16 |32 |64 | >>>> | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | >>>> >>>> Challenges: >>>> >>>> When checking the pending signals in the parent node using the command >>>> cat /proc/self/status | grep SigQ, the retrieved value includes the >>>> cached signal counts from its child nodes. As a result, the SigQ value >>>> in the parent node fails to accurately and instantaneously reflect the >>>> actual number of pending signals. >>>> >>>> # cat /proc/self/status | grep SigQ >>>> SigQ: 16/6187667 >>>> >>>> TODO: >>>> >>>> Add cache for the other rlimits. >>>> >>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ >>>> >>>> Chen Ridong (5): >>>> user_namespace: add children list node >>>> usernamespace: make usernamespace rcu safe >>>> user_namespace: add user_ns iteration helper >>>> uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts >>>> ucount: add rlimit cache for ucount >>>> >>>> include/linux/user_namespace.h | 23 ++++- >>>> kernel/signal.c | 2 +- >>>> kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- >>>> kernel/user.c | 2 + >>>> kernel/user_namespace.c | 60 ++++++++++- >>>> 5 files changed, 243 insertions(+), 25 deletions(-) >>>> >>>> -- >>>> 2.34.1 >>>> >>> >> >
On Wed, May 21, 2025 at 09:32:12AM +0800, Chen Ridong wrote: > > > On 2025/5/20 0:32, Alexey Gladkov wrote: > > On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote: > >> > >> > >> On 2025/5/16 19:48, Alexey Gladkov wrote: > >>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote: > >>>> The will-it-scale test case signal1 [1] has been observed. and the test > >>>> results reveal that the signal sending system call lacks linearity. > >>> > >>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop. > >>> > >>> Do you have an example of a closer-to-life scenario where this delay > >>> becomes a bottleneck ? > >>> > >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c > >>> > >> > >> Thank you for your prompt reply. Unfortunately, I do not have the > >> specific scenario. > >> > >> Motivation: > >> I plan to use servers with 384 cores, and potentially even more in the > >> future. Therefore, I am testing these system calls to identify any > >> scalability bottlenecks that could arise in massively parallel > >> high-density computing environments. > > > > But it turns out that you're proposing complex changes for something that > > Using percpu_couter is not as complex as this patch set. > > > is essentially a non-issue. In the real world, applications don't spam > > signals and I'm not sure we want to support that scenario. > > > >> In addition, we hope that the containers can be isolated as much as > >> possible to avoid interfering with each other. > > > > But that's impossible. Even before migration to ucounts, some rlimits > > (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were > > bound to user_struct. I mean, atomic counter and "bottleneck" was there. > > We can't remove the counters for that rlimits and they will have an > > impact. > > > > These rlimits are now counted per-namespace. In real life, docker/podman > > creates a separate user namespace for each container from init_user_ns. > > Usually only one additional counter is added for each rlimit in this way. > > > The problem is that when increasing an rlimit in Docker, the limit has > to be increased in the init_user_ns even if a separate user namespace > has been created. This is why I believe this issue deserves to be fixed. man setrlimit(2): An unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value. This is a well-documented behavior. But it works in all user namespaces. If a unprivileged user has rlimits in init_user_ns, they should not be exceeded even if that process uses a different user namespeces. So even if you increase rlimits in the new user namespace as root (in the new userns), it doesn't mean that rlimits in the parent user namespace cease to apply or should somehow increase. You still have limitations from the upper user namespace. I don't see a issue here. > > All I'm saying is that "bottleneck" with atomic counter was there before > > and can't be removed anywhere. > > > > Yes, it can not be removed anywhere, maybe we can make it better. Yes, we probably can, but we need to have a reason to complicate the code. And we're still talking about a synthetic test. > > Best regards, > Ridong > > >> > >> Best regards, > >> Ridong > >> > >>>> To further investigate this issue, we initiated a series of tests by > >>>> launching varying numbers of dockers and closely monitored the throughput > >>>> of each individual docker. The detailed test outcomes are presented as > >>>> follows: > >>>> > >>>> | Dockers |1 |4 |8 |16 |32 |64 | > >>>> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > >>>> > >>>> The data clearly demonstrates a discernible trend: as the quantity of > >>>> dockers increases, the throughput per container progressively declines. > >>>> In-depth analysis has identified the root cause of this performance > >>>> degradation. The ucouts module conducts statistics on rlimit, which > >>>> involves a significant number of atomic operations. These atomic > >>>> operations, when acting on the same variable, trigger a substantial number > >>>> of cache misses or remote accesses, ultimately resulting in a drop in > >>>> performance. > >>>> > >>>> Notably, even though a new user_namespace is created upon docker startup, > >>>> the problem persists. This is because all these dockers share the same > >>>> parent node, meaning that rlimit statistics continuously modify the same > >>>> atomic variable. > >>>> > >>>> Currently, when incrementing a specific rlimit within a child user > >>>> namespace by 1, the corresponding rlimit in the parent node must also be > >>>> incremented by 1. Specifically, if the ucounts corresponding to a task in > >>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1, > >>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1. > >>>> This operation should be ensured to stay within the limits set for the > >>>> user namespaces. > >>>> > >>>> init_user_ns init_ucounts > >>>> ^ ^ > >>>> | | | > >>>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > >>>> | | | > >>>> |<---- usr_ns_b(docker B)|usr_ns_a->ucount---->| > >>>> ^ > >>>> | > >>>> | > >>>> | > >>>> ucount_b_1 > >>>> > >>>> What is expected is that dockers operating within separate namespaces > >>>> should remain isolated and not interfere with one another. Regrettably, > >>>> the current signal system call fails to achieve this desired level of > >>>> isolation. > >>>> > >>>> Proposal: > >>>> > >>>> To address the aforementioned issues, the concept of implementing a cache > >>>> for each namespace's rlimit has been proposed. If a cache is added for > >>>> each user namespace's rlimit, a certain amount of rlimits can be allocated > >>>> to a particular namespace in one go. When resources are abundant, these > >>>> resources do not need to be immediately returned to the parent node. Within > >>>> a user namespace, if there are available values in the cache, there is no > >>>> need to request additional resources from the parent node. > >>>> > >>>> init_user_ns init_ucounts > >>>> ^ ^ > >>>> | | | > >>>> |<---- usr_ns_a(docker A)|usr_ns_a->ucount---->| > >>>> | | | > >>>> |<---- usr_ns_b(docker B)|usr_ns_b->ucount---->| > >>>> ^ ^ > >>>> | | > >>>> cache_rlimit--->| > >>>> | > >>>> ucount_b_1 > >>>> > >>>> > >>>> The ultimate objective of this solution is to achieve complete isolation > >>>> among namespaces. After applying this patch set, the final test results > >>>> indicate that in the signal1 test case, the performance does not > >>>> deteriorate as the number of containers increases. This effectively meets > >>> > >>>> the goal of linear scalability. > >>>> > >>>> | Dockers |1 |4 |8 |16 |32 |64 | > >>>> | Throughput |381809 |382284 |380640 |383515 |381318 |380120 | > >>>> > >>>> Challenges: > >>>> > >>>> When checking the pending signals in the parent node using the command > >>>> cat /proc/self/status | grep SigQ, the retrieved value includes the > >>>> cached signal counts from its child nodes. As a result, the SigQ value > >>>> in the parent node fails to accurately and instantaneously reflect the > >>>> actual number of pending signals. > >>>> > >>>> # cat /proc/self/status | grep SigQ > >>>> SigQ: 16/6187667 > >>>> > >>>> TODO: > >>>> > >>>> Add cache for the other rlimits. > >>>> > >>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ > >>>> > >>>> Chen Ridong (5): > >>>> user_namespace: add children list node > >>>> usernamespace: make usernamespace rcu safe > >>>> user_namespace: add user_ns iteration helper > >>>> uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts > >>>> ucount: add rlimit cache for ucount > >>>> > >>>> include/linux/user_namespace.h | 23 ++++- > >>>> kernel/signal.c | 2 +- > >>>> kernel/ucount.c | 181 +++++++++++++++++++++++++++++---- > >>>> kernel/user.c | 2 + > >>>> kernel/user_namespace.c | 60 ++++++++++- > >>>> 5 files changed, 243 insertions(+), 25 deletions(-) > >>>> > >>>> -- > >>>> 2.34.1 > >>>> > >>> > >> > > > -- Rgrds, legion
On Wed, May 21, 2025 at 12:29 AM Alexey Gladkov <legion@kernel.org> wrote: <....> > > > > All I'm saying is that "bottleneck" with atomic counter was there before > > > and can't be removed anywhere. > > > > > > > Yes, it can not be removed anywhere, maybe we can make it better. > > Yes, we probably can, but we need to have a reason to complicate the code. > And we're still talking about a synthetic test. I think I have a real use case that will be negatively impacted by this issue. This involves gVisor with the systrap platform. gVisor is an application kernel, similar to user-mode Linux. The systrap platform utilizes seccomp to intercept guest syscalls, meaning each guest syscall triggers a SIGSYS signal. For some workloads, the signal handling overhead accounts for over 50% of the total workload execution time. However, considering the gVisor problem, I think the solution could be simpler. Each task could reserve one signal in advance. Then, when a signal is triggered by an 'exception' (e.g., seccomp or page fault), the kernel could queue a force signal without incurring a ucount charge. Even currently, such signals are allocated with the override_rlimit flag set, meaning they are not subject to standard resource limits. Thanks, Andrei
© 2016 - 2026 Red Hat, Inc.