[v8] cgroup/cpuset: Support remote partitions

[PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Waiman Long 2 years, 5 months ago

 v8:
  - Add a new patch 1 to fix a load balance state problem.
  - Add new test cases to the test script and fixes some bugs in error
    handling.

 v7:
  - https://lore.kernel.org/lkml/d0380dfa-ee2e-e492-38e3-31bf6644e511@redhat.com/
  - Fix a compilation problem in patch 1 & a memory allocation bug in
    patch 2.
  - Change exclusive_cpus type to cpumask_var_t to match other cpumasks
    and make code more consistent.

 v6:
  - https://lore.kernel.org/lkml/20230713172601.3285847-1-longman@redhat.com/
  - Add another read-only cpuset.cpus.exclusive.effective control file
    to expose the effective set of exclusive CPUs.
  - Update the documentation and test accordingly.


This patch series introduces new cpuset control
files "cpuset.cpus.exclusive" (read-write) and
"cpuset.cpus.exclusive.effective" (read only) for better control of
which exclusive CPUs are being distributed down the cgroup hierarchy
for creating cpuset partition.

Any one of the exclusive CPUs can only be distributed to at most one
child cpuset. Invalid input to "cpuset.cpus.exclusive" that violates the
sibling exclusivity rule will be rejected. This new control files has
no effect on the behavior of the cpuset until it turns into a partition
root. At that point, its effective CPUs will be set to its exclusive
CPUs unless some of them are offline.

This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.

A remote partition can be formed far from the root cgroup with no
partition root parent. While local partitions can be created without
touching "cpuset.cpus.exclusive" as it can be set automatically
if a cpuset becomes a local partition root. Properly setting
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.

Both scheduling and isolated partitions can be formed as a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.

With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected.

Waiman Long (7):
  cgroup/cpuset: Fix load balance state in update_partition_sd_lb()
  cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2
  cgroup/cpuset: Add cpuset.cpus.exclusive for v2
  cgroup/cpuset: Introduce remote partition
  cgroup/cpuset: Check partition conflict with housekeeping setup
  cgroup/cpuset: Documentation update for partition
  cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition

 Documentation/admin-guide/cgroup-v2.rst       |  123 +-
 kernel/cgroup/cpuset.c                        | 1279 ++++++++++++-----
 .../selftests/cgroup/test_cpuset_prs.sh       |  458 ++++--
 3 files changed, 1366 insertions(+), 494 deletions(-)

-- 
2.31.1

Re: [PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Michal Koutný 2 years, 3 months ago

Hello.

(I know this is heading for 6.7. Still I wanted to have a look at this
after it stabilized somehow to understand the new concept better but I
still have some questions below.)

On Tue, Sep 05, 2023 at 09:32:36AM -0400, Waiman Long <longman@redhat.com> wrote:
> Both scheduling and isolated partitions can be formed as a remote
> partition. A local partition can be created under a remote partition.
> A remote partition, however, cannot be formed under a local partition
> for now.
> 
> 
> With this patch series, we allow the creation of remote partition
> far from the root. The container management tool can manage the
> "cpuset.cpus.exclusive" file without impacting the other cpuset
> files that are managed by other middlewares. Of course, invalid
> "cpuset.cpus.exclusive" values will be rejected.

I take the example with a nested cgroup `cont` to which I want to
dedicate two CPUs (0 and 1).
IIUC, I can do this both with a chain of local root partitions or as a
single remote partion.

[chain]
  root
  |                           \
  mid1a                        mid1b
   cpuset.cpus=0-1              cpuset.cpus=2-15
   cpuset.cpus.partition=root   
  |
  mid2
   cpuset.cpus=0-1
   cpuset.cpus.partition=root
  |
  cont
   cpuset.cpus=0-1
   cpuset.cpus.partition=root

[remote]
  root
  |                           \
  mid1a                        mid1b
   cpuset.cpus.exclusive=0-1    cpuset.cpus=2-15
  |
  mid2
   cpuset.cpus.exclusive=0-1
  |
  cont
   cpuset.cpus.exclusive=0-1
   cpuset.cpus.partition=root

In the former case I must configure cpuset.cpus and
cpuset.cpus.partition along the whole path and in the second case
cpuset.cpus.exclusive still along the whole path and root at the bottom
only.

What is the difference between the two configs above?
(Or can you please give an example where the remote partitions are
better illustrated?)

<snip>
> Modern container orchestration tools like Kubernetes use the cgroup
> hierarchy to manage different containers. And it is relying on other
> middleware like systemd to help managing it. If a container needs to
> use isolated CPUs, it is hard to get those with the local partitions
> as it will require the administrative parent cgroup to be a partition
> root too which tool like systemd may not be ready to manage.

Such tools ready aren't ready to manage cpuset.cpus.exclusive, are they?
IOW tools need to distinguish exclusive and "shared" CPUs which is equal
to distinguishing root and member partitions.

Thanks,
Michal

Re: [PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Waiman Long 2 years, 3 months ago

On 10/13/23 11:50, Michal Koutný wrote:
> Hello.
>
> (I know this is heading for 6.7. Still I wanted to have a look at this
> after it stabilized somehow to understand the new concept better but I
> still have some questions below.)
>
> On Tue, Sep 05, 2023 at 09:32:36AM -0400, Waiman Long <longman@redhat.com> wrote:
>> Both scheduling and isolated partitions can be formed as a remote
>> partition. A local partition can be created under a remote partition.
>> A remote partition, however, cannot be formed under a local partition
>> for now.
>>
>>
>> With this patch series, we allow the creation of remote partition
>> far from the root. The container management tool can manage the
>> "cpuset.cpus.exclusive" file without impacting the other cpuset
>> files that are managed by other middlewares. Of course, invalid
>> "cpuset.cpus.exclusive" values will be rejected.
> I take the example with a nested cgroup `cont` to which I want to
> dedicate two CPUs (0 and 1).
> IIUC, I can do this both with a chain of local root partitions or as a
> single remote partion.
>
>
> [chain]
>    root
>    |                           \
>    mid1a                        mid1b
>     cpuset.cpus=0-1              cpuset.cpus=2-15
>     cpuset.cpus.partition=root
>    |
>    mid2
>     cpuset.cpus=0-1
>     cpuset.cpus.partition=root
>    |
>    cont
>     cpuset.cpus=0-1
>     cpuset.cpus.partition=root
In this case, the effective CPUs of both mid1a and mid2 will be empty. 
IOW, you can't have any task in these 2 cpusets.
>
> [remote]
>    root
>    |                           \
>    mid1a                        mid1b
>     cpuset.cpus.exclusive=0-1    cpuset.cpus=2-15
>    |
>    mid2
>     cpuset.cpus.exclusive=0-1
>    |
>    cont
>     cpuset.cpus.exclusive=0-1
>     cpuset.cpus.partition=root
>
> In the former case I must configure cpuset.cpus and
> cpuset.cpus.partition along the whole path and in the second case
> cpuset.cpus.exclusive still along the whole path and root at the bottom
> only.
>
> What is the difference between the two configs above?
> (Or can you please give an example where the remote partitions are
> better illustrated?)

For the remote case, you can have intermediate tasks in both mid1a and 
mid2 as long as cpuset.cpus contains more CPUs than cpuset.cpus.exclusive.


> <snip>
>> Modern container orchestration tools like Kubernetes use the cgroup
>> hierarchy to manage different containers. And it is relying on other
>> middleware like systemd to help managing it. If a container needs to
>> use isolated CPUs, it is hard to get those with the local partitions
>> as it will require the administrative parent cgroup to be a partition
>> root too which tool like systemd may not be ready to manage.
> Such tools ready aren't ready to manage cpuset.cpus.exclusive, are they?
> IOW tools need to distinguish exclusive and "shared" CPUs which is equal
> to distinguishing root and member partitions.

They will be ready eventually. This requirement of remote partition 
actually came from our OpenShift team as the use of just local partition 
did not meet their need. They don't need access to exclusive CPUs in the 
parent cgroup layer for their management daemons. They do need to 
activate isolated partition in selected child cgroups to support our 
Telco customers to run workloads like DPDK.

So they will add the support to upstream Kubernetes.

Cheers,
Longman

Re: [PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Michal Koutný 2 years, 3 months ago

On Fri, Oct 13, 2023 at 12:03:18PM -0400, Waiman Long <longman@redhat.com> wrote:
> > [chain]
> >    root
> >    |                           \
> >    mid1a                        mid1b
> >     cpuset.cpus=0-1              cpuset.cpus=2-15
> >     cpuset.cpus.partition=root
> >    |
> >    mid2
> >     cpuset.cpus=0-1
> >     cpuset.cpus.partition=root
> >    |
> >    cont
> >     cpuset.cpus=0-1
> >     cpuset.cpus.partition=root
> In this case, the effective CPUs of both mid1a and mid2 will be empty. IOW,
> you can't have any task in these 2 cpusets.

I see, that is relevant to a threaded subtree only where the admin / app
can know how to distribute CPUs and place threads to internal nodes.

> For the remote case, you can have intermediate tasks in both mid1a and mid2
> as long as cpuset.cpus contains more CPUs than cpuset.cpus.exclusive.

It's obvious that cpuset.cpus.exclusive should be exclusive among
siblings.
Should it also be so along the vertical path?

  root
  |                           
  mid1a                       
   cpuset.cpus=0-2
   cpuset.cpus.exclusive=0    
  |
  mid2
   cpuset.cpus=0-2
   cpuset.cpus.exclusive=1
  |
  cont
   cpuset.cpus=0-2
   cpuset.cpus.exclusive=2
   cpuset.cpus.partition=root

IIUC, this should be a valid config regardless of cpuset.cpus.partition
setting on mid1a and mid2.
Whereas

  root
  |                           
  mid1a                       
   cpuset.cpus=0-2
   cpuset.cpus.exclusive=0    
  |
  mid2
   cpuset.cpus=0-2
   cpuset.cpus.exclusive=1-2
   cpuset.cpus.partition=root
  |
  cont
   cpuset.cpus=1-2
   cpuset.cpus.exclusive=1-2
   cpuset.cpus.partition=root

Here, I'm hesitating, will mid2 have any exclusively owned cpus?

(I have flashes of understading cpus.exclusive as being a more
expressive mechanism than partitions. OTOH, it seems non-intuitive when
both are combined, thus I'm asking to internalize it better.
Should partitions be deprecated for simplicty? They're still good to
provide the notification mechanism of invalidation.
cpuset.cpus.exclusive.effective don't have that.)

> They will be ready eventually. This requirement of remote partition actually
> came from our OpenShift team as the use of just local partition did not meet
> their need. They don't need access to exclusive CPUs in the parent cgroup
> layer for their management daemons. They do need to activate isolated
> partition in selected child cgroups to support our Telco customers to run
> workloads like DPDK.
> 
> So they will add the support to upstream Kubernetes.

Is it worth implementing anything touching (ancestral)
cpuset.cpus.partition then?

Thanks,
Michal

Re: [PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Waiman Long 2 years, 3 months ago

On 10/24/23 12:13, Michal Koutný wrote:
> On Fri, Oct 13, 2023 at 12:03:18PM -0400, Waiman Long <longman@redhat.com> wrote:
>>> [chain]
>>>     root
>>>     |                           \
>>>     mid1a                        mid1b
>>>      cpuset.cpus=0-1              cpuset.cpus=2-15
>>>      cpuset.cpus.partition=root
>>>     |
>>>     mid2
>>>      cpuset.cpus=0-1
>>>      cpuset.cpus.partition=root
>>>     |
>>>     cont
>>>      cpuset.cpus=0-1
>>>      cpuset.cpus.partition=root
>> In this case, the effective CPUs of both mid1a and mid2 will be empty. IOW,
>> you can't have any task in these 2 cpusets.
> I see, that is relevant to a threaded subtree only where the admin / app
> can know how to distribute CPUs and place threads to internal nodes.
>
>> For the remote case, you can have intermediate tasks in both mid1a and mid2
>> as long as cpuset.cpus contains more CPUs than cpuset.cpus.exclusive.
> It's obvious that cpuset.cpus.exclusive should be exclusive among
> siblings.
> Should it also be so along the vertical path?

Sorry for the late reply. I have forgot to respond earlier.

We don't support that vertical exclusive check in cgroup v1 
cpuset.cpu_exclusive.
>    root
>    |
>    mid1a
>     cpuset.cpus=0-2
>     cpuset.cpus.exclusive=0
>    |
>    mid2
>     cpuset.cpus=0-2
>     cpuset.cpus.exclusive=1
>    |
>    cont
>     cpuset.cpus=0-2
>     cpuset.cpus.exclusive=2
>     cpuset.cpus.partition=root
>
> IIUC, this should be a valid config regardless of cpuset.cpus.partition
> setting on mid1a and mid2.
> Whereas
>
>    root
>    |
>    mid1a
>     cpuset.cpus=0-2
>     cpuset.cpus.exclusive=0
>    |
>    mid2
>     cpuset.cpus=0-2
>     cpuset.cpus.exclusive=1-2
>     cpuset.cpus.partition=root
>    |
>    cont
>     cpuset.cpus=1-2
>     cpuset.cpus.exclusive=1-2
>     cpuset.cpus.partition=root
>
> Here, I'm hesitating, will mid2 have any exclusively owned cpus?
>
> (I have flashes of understading cpus.exclusive as being a more
> expressive mechanism than partitions. OTOH, it seems non-intuitive when
> both are combined, thus I'm asking to internalize it better.
> Should partitions be deprecated for simplicty? They're still good to
> provide the notification mechanism of invalidation.
> cpuset.cpus.exclusive.effective don't have that.)

Like cpuset.cpus, cpuset.cpus.exclusive follows the same hierarchical 
rule. IOW, the CPUs in cpuset.cpus.exclusive will be ignored if they are 
not present in its ancestor nodes. The value in cpuset.cpus.exclusive 
shows the intent of the users. cpuset.cpus.exclusive.effective shows the 
real exclusive CPUs when partition is enabled. So we just can't use 
cpuset.cpus.exclusive as a replacement for cpuset.cpus.partition.

As a result, we can't actually support the vertical CPU exclusion as you 
suggest above.

>
>> They will be ready eventually. This requirement of remote partition actually
>> came from our OpenShift team as the use of just local partition did not meet
>> their need. They don't need access to exclusive CPUs in the parent cgroup
>> layer for their management daemons. They do need to activate isolated
>> partition in selected child cgroups to support our Telco customers to run
>> workloads like DPDK.
>>
>> So they will add the support to upstream Kubernetes.
> Is it worth implementing anything touching (ancestral)
> cpuset.cpus.partition then?

I don't quite get what you want to ask here.

Cheers,
Longman

Re: [PATCH v8 0/7] cgroup/cpuset: Support remote partitions

Posted by Tejun Heo 2 years, 4 months ago

Applied the series to cgroup/for-6.7.

Thanks.

-- 
tejun