ucounts: turn the atomic rlimit to percpu_counter

[RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Chen Ridong 8 months, 3 weeks ago

From: Chen Ridong <chenridong@huawei.com>

The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucouts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

This patch set addresses scalability issues in the ucounts rlimit by
replacing atomic rlimit counters with percpu_counter, which distributes
counts across CPU cores to reduce cache contention under heavy load.

Patch 1 modifies thate ucount can be freed until both the refcount and
rlimit are fully released, minimizing redundant summations. Patch 2 turns
the atomic rlimit to percpu_counter, which is suggested by Andrew.

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/

---
v2: use percpu_counter intead of cache rlimit.

v1: https://lore.kernel.org/lkml/20250509072054.148257-1-chenridong@huaweicloud.com/

Chen Ridong (2):
  ucounts: free ucount only count and rlimit are zero
  ucounts: turn the atomic rlimit to percpu_counter

 include/linux/user_namespace.h |  17 +++-
 init/main.c                    |   1 +
 ipc/mqueue.c                   |   6 +-
 kernel/signal.c                |   8 +-
 kernel/ucount.c                | 169 +++++++++++++++++++++++----------
 mm/mlock.c                     |   5 +-
 6 files changed, 138 insertions(+), 68 deletions(-)

-- 
2.34.1

Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Jann Horn 8 months, 3 weeks ago

On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
>
>         | Dockers     |1      |4      |8      |16     |32     |64     |
>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.

But is that actually a problem? Do you have real workloads that
concurrently send so many signals, or create inotify watches so
quickly, that this is has an actual performance impact?

> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.

You're probably running into the namespace-associated ucounts here? So
the issue is probably that Docker creates all your containers with the
same owner UID (EUID at namespace creation), causing them all to
account towards a single ucount, while normally outside of containers,
each RUID has its own ucount instance?

Sharing of rlimits between containers is probably normally undesirable
even without the cacheline bouncing, because it means that too much
resource usage in one container can cause resource allocations in
another container to fail... so I think the real problem here is at a
higher level, in the namespace setup code. Maybe root should be able
to create a namespace that doesn't inherit ucount limits of its owner
UID, or something like that...

Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Chen Ridong 8 months, 3 weeks ago


On 2025/5/20 3:32, Jann Horn wrote:
> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>>         | Dockers     |1      |4      |8      |16     |32     |64     |
>>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
> 
> But is that actually a problem? Do you have real workloads that
> concurrently send so many signals, or create inotify watches so
> quickly, that this is has an actual performance impact?
> 

Thanks Jann,
Unfortunately, I do not have the specific scenario.

>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
> 
> You're probably running into the namespace-associated ucounts here? So
> the issue is probably that Docker creates all your containers with the
> same owner UID (EUID at namespace creation), causing them all to
> account towards a single ucount, while normally outside of containers,
> each RUID has its own ucount instance?
> 

Yes, every time rlimits change in the containers, they have to change
the parent's rlimits, which are atomic options, even if these containers
have their own user_namespace.

Best regards,
Ridong

> Sharing of rlimits between containers is probably normally undesirable
> even without the cacheline bouncing, because it means that too much
> resource usage in one container can cause resource allocations in
> another container to fail... so I think the real problem here is at a
> higher level, in the namespace setup code. Maybe root should be able
> to create a namespace that doesn't inherit ucount limits of its owner
> UID, or something like that...

Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Alexey Gladkov 8 months, 3 weeks ago

On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > From: Chen Ridong <chenridong@huawei.com>
> >
> > The will-it-scale test case signal1 [1] has been observed. and the test
> > results reveal that the signal sending system call lacks linearity.
> > To further investigate this issue, we initiated a series of tests by
> > launching varying numbers of dockers and closely monitored the throughput
> > of each individual docker. The detailed test outcomes are presented as
> > follows:
> >
> >         | Dockers     |1      |4      |8      |16     |32     |64     |
> >         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >
> > The data clearly demonstrates a discernible trend: as the quantity of
> > dockers increases, the throughput per container progressively declines.
> 
> But is that actually a problem? Do you have real workloads that
> concurrently send so many signals, or create inotify watches so
> quickly, that this is has an actual performance impact?
> 
> > In-depth analysis has identified the root cause of this performance
> > degradation. The ucouts module conducts statistics on rlimit, which
> > involves a significant number of atomic operations. These atomic
> > operations, when acting on the same variable, trigger a substantial number
> > of cache misses or remote accesses, ultimately resulting in a drop in
> > performance.
> 
> You're probably running into the namespace-associated ucounts here? So
> the issue is probably that Docker creates all your containers with the
> same owner UID (EUID at namespace creation), causing them all to
> account towards a single ucount, while normally outside of containers,
> each RUID has its own ucount instance?
> 
> Sharing of rlimits between containers is probably normally undesirable
> even without the cacheline bouncing, because it means that too much
> resource usage in one container can cause resource allocations in
> another container to fail... so I think the real problem here is at a
> higher level, in the namespace setup code. Maybe root should be able
> to create a namespace that doesn't inherit ucount limits of its owner
> UID, or something like that...

If we allow rlimits not to be inherited in the userns being created, the
user will be able to bypass their rlimits by running a fork bomb inside
the new userns.

Or I missed your point ?

In init_user_ns all rlimits that are bound to it are set to RLIM_INFINITY.
So root can only reduce rlimits.

https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n1091

-- 
Rgrds, legion

Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Jann Horn 8 months, 3 weeks ago

On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote:
> On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
> > On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > > From: Chen Ridong <chenridong@huawei.com>
> > >
> > > The will-it-scale test case signal1 [1] has been observed. and the test
> > > results reveal that the signal sending system call lacks linearity.
> > > To further investigate this issue, we initiated a series of tests by
> > > launching varying numbers of dockers and closely monitored the throughput
> > > of each individual docker. The detailed test outcomes are presented as
> > > follows:
> > >
> > >         | Dockers     |1      |4      |8      |16     |32     |64     |
> > >         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> > >
> > > The data clearly demonstrates a discernible trend: as the quantity of
> > > dockers increases, the throughput per container progressively declines.
> >
> > But is that actually a problem? Do you have real workloads that
> > concurrently send so many signals, or create inotify watches so
> > quickly, that this is has an actual performance impact?
> >
> > > In-depth analysis has identified the root cause of this performance
> > > degradation. The ucouts module conducts statistics on rlimit, which
> > > involves a significant number of atomic operations. These atomic
> > > operations, when acting on the same variable, trigger a substantial number
> > > of cache misses or remote accesses, ultimately resulting in a drop in
> > > performance.
> >
> > You're probably running into the namespace-associated ucounts here? So
> > the issue is probably that Docker creates all your containers with the
> > same owner UID (EUID at namespace creation), causing them all to
> > account towards a single ucount, while normally outside of containers,
> > each RUID has its own ucount instance?
> >
> > Sharing of rlimits between containers is probably normally undesirable
> > even without the cacheline bouncing, because it means that too much
> > resource usage in one container can cause resource allocations in
> > another container to fail... so I think the real problem here is at a
> > higher level, in the namespace setup code. Maybe root should be able
> > to create a namespace that doesn't inherit ucount limits of its owner
> > UID, or something like that...
>
> If we allow rlimits not to be inherited in the userns being created, the
> user will be able to bypass their rlimits by running a fork bomb inside
> the new userns.
>
> Or I missed your point ?

You're right, I guess it would actually still be necessary to have one
shared limit across the entire container, so rather than not having a
namespace-level ucount, maybe it would make more sense to have a
private ucount instance for a container...

(But to be clear I'm not invested in this suggestion at all, I just
looked at that patch and was wondering about alternatives if that is
actually a real performance problem...)

Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter

Posted by Chen Ridong 8 months, 3 weeks ago


On 2025/5/20 5:24, Jann Horn wrote:
> On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote:
>> On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
>>> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>
>>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>>> results reveal that the signal sending system call lacks linearity.
>>>> To further investigate this issue, we initiated a series of tests by
>>>> launching varying numbers of dockers and closely monitored the throughput
>>>> of each individual docker. The detailed test outcomes are presented as
>>>> follows:
>>>>
>>>>         | Dockers     |1      |4      |8      |16     |32     |64     |
>>>>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>>
>>>> The data clearly demonstrates a discernible trend: as the quantity of
>>>> dockers increases, the throughput per container progressively declines.
>>>
>>> But is that actually a problem? Do you have real workloads that
>>> concurrently send so many signals, or create inotify watches so
>>> quickly, that this is has an actual performance impact?
>>>
>>>> In-depth analysis has identified the root cause of this performance
>>>> degradation. The ucouts module conducts statistics on rlimit, which
>>>> involves a significant number of atomic operations. These atomic
>>>> operations, when acting on the same variable, trigger a substantial number
>>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>>> performance.
>>>
>>> You're probably running into the namespace-associated ucounts here? So
>>> the issue is probably that Docker creates all your containers with the
>>> same owner UID (EUID at namespace creation), causing them all to
>>> account towards a single ucount, while normally outside of containers,
>>> each RUID has its own ucount instance?
>>>
>>> Sharing of rlimits between containers is probably normally undesirable
>>> even without the cacheline bouncing, because it means that too much
>>> resource usage in one container can cause resource allocations in
>>> another container to fail... so I think the real problem here is at a
>>> higher level, in the namespace setup code. Maybe root should be able
>>> to create a namespace that doesn't inherit ucount limits of its owner
>>> UID, or something like that...
>>
>> If we allow rlimits not to be inherited in the userns being created, the
>> user will be able to bypass their rlimits by running a fork bomb inside
>> the new userns.
>>
>> Or I missed your point ?
> 
> You're right, I guess it would actually still be necessary to have one
> shared limit across the entire container, so rather than not having a
> namespace-level ucount, maybe it would make more sense to have a
> private ucount instance for a container...
> 

It sounds like the private ucounts were what I was trying to implement
in version 1? It applies batch counts from the parent for each user
namespace, but the approach is complex.

Best regards,
Ridong

> (But to be clear I'm not invested in this suggestion at all, I just
> looked at that patch and was wondering about alternatives if that is
> actually a real performance problem...)