include/linux/user_namespace.h | 17 +++- init/main.c | 1 + ipc/mqueue.c | 6 +- kernel/signal.c | 8 +- kernel/ucount.c | 169 +++++++++++++++++++++++---------- mm/mlock.c | 5 +- 6 files changed, 138 insertions(+), 68 deletions(-)
From: Chen Ridong <chenridong@huawei.com> The will-it-scale test case signal1 [1] has been observed. and the test results reveal that the signal sending system call lacks linearity. To further investigate this issue, we initiated a series of tests by launching varying numbers of dockers and closely monitored the throughput of each individual docker. The detailed test outcomes are presented as follows: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | The data clearly demonstrates a discernible trend: as the quantity of dockers increases, the throughput per container progressively declines. In-depth analysis has identified the root cause of this performance degradation. The ucouts module conducts statistics on rlimit, which involves a significant number of atomic operations. These atomic operations, when acting on the same variable, trigger a substantial number of cache misses or remote accesses, ultimately resulting in a drop in performance. This patch set addresses scalability issues in the ucounts rlimit by replacing atomic rlimit counters with percpu_counter, which distributes counts across CPU cores to reduce cache contention under heavy load. Patch 1 modifies thate ucount can be freed until both the refcount and rlimit are fully released, minimizing redundant summations. Patch 2 turns the atomic rlimit to percpu_counter, which is suggested by Andrew. [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ --- v2: use percpu_counter intead of cache rlimit. v1: https://lore.kernel.org/lkml/20250509072054.148257-1-chenridong@huaweicloud.com/ Chen Ridong (2): ucounts: free ucount only count and rlimit are zero ucounts: turn the atomic rlimit to percpu_counter include/linux/user_namespace.h | 17 +++- init/main.c | 1 + ipc/mqueue.c | 6 +- kernel/signal.c | 8 +- kernel/ucount.c | 169 +++++++++++++++++++++++---------- mm/mlock.c | 5 +- 6 files changed, 138 insertions(+), 68 deletions(-) -- 2.34.1
On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote: > From: Chen Ridong <chenridong@huawei.com> > > The will-it-scale test case signal1 [1] has been observed. and the test > results reveal that the signal sending system call lacks linearity. > To further investigate this issue, we initiated a series of tests by > launching varying numbers of dockers and closely monitored the throughput > of each individual docker. The detailed test outcomes are presented as > follows: > > | Dockers |1 |4 |8 |16 |32 |64 | > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > The data clearly demonstrates a discernible trend: as the quantity of > dockers increases, the throughput per container progressively declines. But is that actually a problem? Do you have real workloads that concurrently send so many signals, or create inotify watches so quickly, that this is has an actual performance impact? > In-depth analysis has identified the root cause of this performance > degradation. The ucouts module conducts statistics on rlimit, which > involves a significant number of atomic operations. These atomic > operations, when acting on the same variable, trigger a substantial number > of cache misses or remote accesses, ultimately resulting in a drop in > performance. You're probably running into the namespace-associated ucounts here? So the issue is probably that Docker creates all your containers with the same owner UID (EUID at namespace creation), causing them all to account towards a single ucount, while normally outside of containers, each RUID has its own ucount instance? Sharing of rlimits between containers is probably normally undesirable even without the cacheline bouncing, because it means that too much resource usage in one container can cause resource allocations in another container to fail... so I think the real problem here is at a higher level, in the namespace setup code. Maybe root should be able to create a namespace that doesn't inherit ucount limits of its owner UID, or something like that...
On 2025/5/20 3:32, Jann Horn wrote: > On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote: >> From: Chen Ridong <chenridong@huawei.com> >> >> The will-it-scale test case signal1 [1] has been observed. and the test >> results reveal that the signal sending system call lacks linearity. >> To further investigate this issue, we initiated a series of tests by >> launching varying numbers of dockers and closely monitored the throughput >> of each individual docker. The detailed test outcomes are presented as >> follows: >> >> | Dockers |1 |4 |8 |16 |32 |64 | >> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | >> >> The data clearly demonstrates a discernible trend: as the quantity of >> dockers increases, the throughput per container progressively declines. > > But is that actually a problem? Do you have real workloads that > concurrently send so many signals, or create inotify watches so > quickly, that this is has an actual performance impact? > Thanks Jann, Unfortunately, I do not have the specific scenario. >> In-depth analysis has identified the root cause of this performance >> degradation. The ucouts module conducts statistics on rlimit, which >> involves a significant number of atomic operations. These atomic >> operations, when acting on the same variable, trigger a substantial number >> of cache misses or remote accesses, ultimately resulting in a drop in >> performance. > > You're probably running into the namespace-associated ucounts here? So > the issue is probably that Docker creates all your containers with the > same owner UID (EUID at namespace creation), causing them all to > account towards a single ucount, while normally outside of containers, > each RUID has its own ucount instance? > Yes, every time rlimits change in the containers, they have to change the parent's rlimits, which are atomic options, even if these containers have their own user_namespace. Best regards, Ridong > Sharing of rlimits between containers is probably normally undesirable > even without the cacheline bouncing, because it means that too much > resource usage in one container can cause resource allocations in > another container to fail... so I think the real problem here is at a > higher level, in the namespace setup code. Maybe root should be able > to create a namespace that doesn't inherit ucount limits of its owner > UID, or something like that...
On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote: > On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote: > > From: Chen Ridong <chenridong@huawei.com> > > > > The will-it-scale test case signal1 [1] has been observed. and the test > > results reveal that the signal sending system call lacks linearity. > > To further investigate this issue, we initiated a series of tests by > > launching varying numbers of dockers and closely monitored the throughput > > of each individual docker. The detailed test outcomes are presented as > > follows: > > > > | Dockers |1 |4 |8 |16 |32 |64 | > > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > > > The data clearly demonstrates a discernible trend: as the quantity of > > dockers increases, the throughput per container progressively declines. > > But is that actually a problem? Do you have real workloads that > concurrently send so many signals, or create inotify watches so > quickly, that this is has an actual performance impact? > > > In-depth analysis has identified the root cause of this performance > > degradation. The ucouts module conducts statistics on rlimit, which > > involves a significant number of atomic operations. These atomic > > operations, when acting on the same variable, trigger a substantial number > > of cache misses or remote accesses, ultimately resulting in a drop in > > performance. > > You're probably running into the namespace-associated ucounts here? So > the issue is probably that Docker creates all your containers with the > same owner UID (EUID at namespace creation), causing them all to > account towards a single ucount, while normally outside of containers, > each RUID has its own ucount instance? > > Sharing of rlimits between containers is probably normally undesirable > even without the cacheline bouncing, because it means that too much > resource usage in one container can cause resource allocations in > another container to fail... so I think the real problem here is at a > higher level, in the namespace setup code. Maybe root should be able > to create a namespace that doesn't inherit ucount limits of its owner > UID, or something like that... If we allow rlimits not to be inherited in the userns being created, the user will be able to bypass their rlimits by running a fork bomb inside the new userns. Or I missed your point ? In init_user_ns all rlimits that are bound to it are set to RLIM_INFINITY. So root can only reduce rlimits. https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n1091 -- Rgrds, legion
On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote: > On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote: > > On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote: > > > From: Chen Ridong <chenridong@huawei.com> > > > > > > The will-it-scale test case signal1 [1] has been observed. and the test > > > results reveal that the signal sending system call lacks linearity. > > > To further investigate this issue, we initiated a series of tests by > > > launching varying numbers of dockers and closely monitored the throughput > > > of each individual docker. The detailed test outcomes are presented as > > > follows: > > > > > > | Dockers |1 |4 |8 |16 |32 |64 | > > > | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | > > > > > > The data clearly demonstrates a discernible trend: as the quantity of > > > dockers increases, the throughput per container progressively declines. > > > > But is that actually a problem? Do you have real workloads that > > concurrently send so many signals, or create inotify watches so > > quickly, that this is has an actual performance impact? > > > > > In-depth analysis has identified the root cause of this performance > > > degradation. The ucouts module conducts statistics on rlimit, which > > > involves a significant number of atomic operations. These atomic > > > operations, when acting on the same variable, trigger a substantial number > > > of cache misses or remote accesses, ultimately resulting in a drop in > > > performance. > > > > You're probably running into the namespace-associated ucounts here? So > > the issue is probably that Docker creates all your containers with the > > same owner UID (EUID at namespace creation), causing them all to > > account towards a single ucount, while normally outside of containers, > > each RUID has its own ucount instance? > > > > Sharing of rlimits between containers is probably normally undesirable > > even without the cacheline bouncing, because it means that too much > > resource usage in one container can cause resource allocations in > > another container to fail... so I think the real problem here is at a > > higher level, in the namespace setup code. Maybe root should be able > > to create a namespace that doesn't inherit ucount limits of its owner > > UID, or something like that... > > If we allow rlimits not to be inherited in the userns being created, the > user will be able to bypass their rlimits by running a fork bomb inside > the new userns. > > Or I missed your point ? You're right, I guess it would actually still be necessary to have one shared limit across the entire container, so rather than not having a namespace-level ucount, maybe it would make more sense to have a private ucount instance for a container... (But to be clear I'm not invested in this suggestion at all, I just looked at that patch and was wondering about alternatives if that is actually a real performance problem...)
On 2025/5/20 5:24, Jann Horn wrote: > On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote: >> On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote: >>> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote: >>>> From: Chen Ridong <chenridong@huawei.com> >>>> >>>> The will-it-scale test case signal1 [1] has been observed. and the test >>>> results reveal that the signal sending system call lacks linearity. >>>> To further investigate this issue, we initiated a series of tests by >>>> launching varying numbers of dockers and closely monitored the throughput >>>> of each individual docker. The detailed test outcomes are presented as >>>> follows: >>>> >>>> | Dockers |1 |4 |8 |16 |32 |64 | >>>> | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | >>>> >>>> The data clearly demonstrates a discernible trend: as the quantity of >>>> dockers increases, the throughput per container progressively declines. >>> >>> But is that actually a problem? Do you have real workloads that >>> concurrently send so many signals, or create inotify watches so >>> quickly, that this is has an actual performance impact? >>> >>>> In-depth analysis has identified the root cause of this performance >>>> degradation. The ucouts module conducts statistics on rlimit, which >>>> involves a significant number of atomic operations. These atomic >>>> operations, when acting on the same variable, trigger a substantial number >>>> of cache misses or remote accesses, ultimately resulting in a drop in >>>> performance. >>> >>> You're probably running into the namespace-associated ucounts here? So >>> the issue is probably that Docker creates all your containers with the >>> same owner UID (EUID at namespace creation), causing them all to >>> account towards a single ucount, while normally outside of containers, >>> each RUID has its own ucount instance? >>> >>> Sharing of rlimits between containers is probably normally undesirable >>> even without the cacheline bouncing, because it means that too much >>> resource usage in one container can cause resource allocations in >>> another container to fail... so I think the real problem here is at a >>> higher level, in the namespace setup code. Maybe root should be able >>> to create a namespace that doesn't inherit ucount limits of its owner >>> UID, or something like that... >> >> If we allow rlimits not to be inherited in the userns being created, the >> user will be able to bypass their rlimits by running a fork bomb inside >> the new userns. >> >> Or I missed your point ? > > You're right, I guess it would actually still be necessary to have one > shared limit across the entire container, so rather than not having a > namespace-level ucount, maybe it would make more sense to have a > private ucount instance for a container... > It sounds like the private ucounts were what I was trying to implement in version 1? It applies batch counts from the parent for each user namespace, but the approach is complex. Best regards, Ridong > (But to be clear I'm not invested in this suggestion at all, I just > looked at that patch and was wondering about alternatives if that is > actually a real performance problem...)
© 2016 - 2025 Red Hat, Inc.