[PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.

Sun Shaojie posted 1 patch 2 months, 1 week ago
kernel/cgroup/cpuset-internal.h               |  3 +
kernel/cgroup/cpuset-v1.c                     | 19 ++++++
kernel/cgroup/cpuset.c                        | 60 ++++++++++++-------
.../selftests/cgroup/test_cpuset_prs.sh       | 12 ++--
4 files changed, 65 insertions(+), 29 deletions(-)
[PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Posted by Sun Shaojie 2 months, 1 week ago
Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
with its sibling partition, the sibling's partition state becomes invalid.
However, this invalidation is often unnecessary.

For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
under the root cgroup. Each cpuset is used by a single user(user-1 use
A1, ... , user-m use Am), and the partition states of these cpusets are
configured as follows:

                           root cgroup
        /             /                  \                 \
       A1            A2        ...       An                Am
     (root)        (root)      ...     (root) (root/root invalid/member)

Assume that A1 through Am have not set cpuset.cpus.exclusive. When
user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
states from A1 to An to change from root to root invalid, as shown
below.

                           root cgroup
        /              /                 \                 \
       A1             A2       ...       An                Am
 (root invalid) (root invalid) ... (root invalid) (root invalid/member)

This outcome is entirely undeserved for all users from A1 to An.

This patch prevents such outcomes by ensuring that modifications to
cpuset.cpus do not affect the partition state of other sibling cpusets.
Therefore, with this patch applied, when user-m configures Am's
cpuset.cpus to "0-127", the result will be as follows.

                           root cgroup
        /             /                  \                 \
       A1            A2        ...       An                Am
     (root)        (root)      ...     (root)     (root invalid/member)

It is worth noting that, since this patch enforces the exclusivity of
sibling cpusets, setting exclusivity now follows a "first-come,
first-served" principle.

For example, consider the following four steps: before applying this
patch, regardless of the order in which they are executed, the final
partition state of both A1 and B1 would always be "root invalid."

 Step                                       | A1's prstate | B1's prstate |
 #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
 #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
 #3> echo "1-2" > B1/cpuset.cpus            | root invalid | member       |
 #4> echo "root" > B1/cpuset.cpus.partition | root invalid | root invalid |

After applying this patch, the first party to set "root" will maintain
its exclusive validity. As follows:

 Step                                       | A1's prstate | B1's prstate |
 #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
 #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
 #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
 #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |

 Step                                       | A1's prstate | B1's prstate |
 #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
 #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
 #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
 #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |

In summary, if the current cpuset conflicts with its sibling cpusets on
exclusive CPUs (If a cpuset is exclusive and its exclusive CPUs are empty,
its allowed CPUs will be treated as exclusive CPUs), only the current
cpuset should bear the consequences.

Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn>
---
 kernel/cgroup/cpuset-internal.h               |  3 +
 kernel/cgroup/cpuset-v1.c                     | 19 ++++++
 kernel/cgroup/cpuset.c                        | 60 ++++++++++++-------
 .../selftests/cgroup/test_cpuset_prs.sh       | 12 ++--
 4 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..c53111998432 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -292,6 +292,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 			    struct cpumask *new_cpus, nodemask_t *new_mems,
 			    bool cpus_updated, bool mems_updated);
 int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial);
+bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2);
 #else
 static inline void fmeter_init(struct fmeter *fmp) {}
 static inline void cpuset1_update_task_spread_flags(struct cpuset *cs,
@@ -302,6 +303,8 @@ static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs,
 			    bool cpus_updated, bool mems_updated) {}
 static inline int cpuset1_validate_change(struct cpuset *cur,
 				struct cpuset *trial) { return 0; }
+static inline bool cpuset1_cpus_excl_conflict(struct cpuset *cs1,
+				struct cpuset *cs2) {return false; }
 #endif /* CONFIG_CPUSETS_V1 */
 
 #endif /* __CPUSET_INTERNAL_H */
diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
index 12e76774c75b..5aa0ac092ef6 100644
--- a/kernel/cgroup/cpuset-v1.c
+++ b/kernel/cgroup/cpuset-v1.c
@@ -373,6 +373,25 @@ int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial)
 	return ret;
 }
 
+/*
+ * cpuset1_cpus_excl_conflict() - Check if two cpusets have exclusive CPU conflicts
+ *                                to legacy (v1)
+ * @cs1: first cpuset to check
+ * @cs2: second cpuset to check
+ *
+ * Returns: true if CPU exclusivity conflict exists, false otherwise
+ *
+ * If either cpuset is CPU exclusive, their allowed CPUs cannot intersect.
+ */
+bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
+{
+	if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
+		return cpumask_intersects(cs1->cpus_allowed,
+					  cs2->cpus_allowed);
+
+	return false;
+}
+
 #ifdef CONFIG_PROC_PID_CPUSET
 /*
  * proc_cpuset_show()
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 52468d2c178a..e58dd26e074a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -586,14 +586,24 @@ static inline bool cpusets_are_exclusive(struct cpuset *cs1, struct cpuset *cs2)
  * Returns: true if CPU exclusivity conflict exists, false otherwise
  *
  * Conflict detection rules:
- * 1. If either cpuset is CPU exclusive, they must be mutually exclusive
+ * For cgroup-v1:
+ *     see cpuset1_cpus_excl_conflict()
+ * For cgroup-v2:
+ * 1. If both cs1 and cs2 are exclusive, cs1 and cs2 must be mutually exclusive
  * 2. exclusive_cpus masks cannot intersect between cpusets
  * 3. The allowed CPUs of one cpuset cannot be a subset of another's exclusive CPUs
+ * 4. If a cpuset is exclusive and its exclusive CPUs are empty, its allowed CPUs
+ *    will be treated as exclusive CPUs; therefore, its allowed CPUs must not
+ *    intersect with another's exclusive CPUs.
  */
 static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
 {
-	/* If either cpuset is exclusive, check if they are mutually exclusive */
-	if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
+	/* For cgroup-v1 */
+	if (!cpuset_v2())
+		return cpuset1_cpus_excl_conflict(cs1, cs2);
+
+	/* If cpusets are exclusive, check if they are mutually exclusive*/
+	if (is_cpu_exclusive(cs1) && is_cpu_exclusive(cs2))
 		return !cpusets_are_exclusive(cs1, cs2);
 
 	/* Exclusive_cpus cannot intersect */
@@ -609,6 +619,20 @@ static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
 	    cpumask_subset(cs2->cpus_allowed, cs1->exclusive_cpus))
 		return true;
 
+	/*
+	 * When a cpuset is exclusive and its exclusive CPUs are empty,
+	 * its cpus_allowed cannot intersect with another cpuset's exclusive_cpus.
+	 */
+	if (is_cpu_exclusive(cs1) &&
+	    cpumask_empty(cs1->exclusive_cpus) &&
+	    cpumask_intersects(cs1->cpus_allowed, cs2->exclusive_cpus))
+		return true;
+
+	if (is_cpu_exclusive(cs2) &&
+	    cpumask_empty(cs2->exclusive_cpus) &&
+	    cpumask_intersects(cs2->cpus_allowed, cs1->exclusive_cpus))
+		return true;
+
 	return false;
 }
 
@@ -2411,34 +2435,17 @@ static int cpus_allowed_validate_change(struct cpuset *cs, struct cpuset *trialc
 					struct tmpmasks *tmp)
 {
 	int retval;
-	struct cpuset *parent = parent_cs(cs);
 
 	retval = validate_change(cs, trialcs);
 
 	if ((retval == -EINVAL) && cpuset_v2()) {
-		struct cgroup_subsys_state *css;
-		struct cpuset *cp;
-
 		/*
 		 * The -EINVAL error code indicates that partition sibling
 		 * CPU exclusivity rule has been violated. We still allow
 		 * the cpumask change to proceed while invalidating the
-		 * partition. However, any conflicting sibling partitions
-		 * have to be marked as invalid too.
+		 * partition.
 		 */
 		trialcs->prs_err = PERR_NOTEXCL;
-		rcu_read_lock();
-		cpuset_for_each_child(cp, css, parent) {
-			struct cpumask *xcpus = user_xcpus(trialcs);
-
-			if (is_partition_valid(cp) &&
-			    cpumask_intersects(xcpus, cp->effective_xcpus)) {
-				rcu_read_unlock();
-				update_parent_effective_cpumask(cp, partcmd_invalidate, NULL, tmp);
-				rcu_read_lock();
-			}
-		}
-		rcu_read_unlock();
 		retval = 0;
 	}
 	return retval;
@@ -2506,8 +2513,15 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (alloc_tmpmasks(&tmp))
 		return -ENOMEM;
 
-	compute_trialcs_excpus(trialcs, cs);
-	trialcs->prs_err = PERR_NONE;
+	/*
+	 * if there is exclusive CPUs conflict with the siblings,
+	 * we still allow the cpumask change to proceed while
+	 * invalidating the partition.
+	 */
+	if (compute_trialcs_excpus(trialcs, cs))
+		trialcs->prs_err = PERR_NOTEXCL;
+	else
+		trialcs->prs_err = PERR_NONE;
 
 	retval = cpus_allowed_validate_change(cs, trialcs, &tmp);
 	if (retval < 0)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a17256d9f88a..75154e22c702 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -269,7 +269,7 @@ TEST_MATRIX=(
 	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X3:P2    .      .     0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3"
 	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3  X2-3:P2   .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
 	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:C3 .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
-	" C0-3:S+ C1-3:S+ C2-3   C2-3     .      .      .      P2    0 A1:0-3|A2:1-3|A3:2-3|B1:2-3 A1:P0|A3:P0|B1:P-2"
+	" C0-3:S+ C1-3:S+ C2-3   C2-3     .      .      .      P2    0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2"
 	" C0-3:S+ C1-3:S+ C2-3   C4-5     .      .      .      P2    0 B1:4-5 B1:P2 4-5"
 	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3  X2-3:P2   P2    0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
 	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3 X2-3:P2:C1-3 P2  0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
@@ -318,7 +318,7 @@ TEST_MATRIX=(
 	# Invalid to valid local partition direct transition tests
 	" C1-3:S+:P2 X4:P2  .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
 	" C1-3:S+:P2 X4:P2  .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
-	"  C0-3:P2   .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:4-6 A1:P-2|B1:P0"
+	"  C0-3:P2   .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
 	"  C0-3:P2   .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
 
 	# Local partition invalidation tests
@@ -388,10 +388,10 @@ TEST_MATRIX=(
 	"  C0-1:S+  C1      .    C2-3     .      P2     .      .     0 A1:0-1|A2:1 A1:P0|A2:P-2"
 	"  C0-1:S+ C1:P2    .    C2-3     P1     .      .      .     0 A1:0|A2:1 A1:P1|A2:P2 0-1|1"
 
-	# A non-exclusive cpuset.cpus change will invalidate partition and its siblings
-	"  C0-1:P1   .      .    C2-3   C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P-1|B1:P0"
-	"  C0-1:P1   .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P-1|B1:P-1"
-	"   C0-1     .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P0|B1:P-1"
+	# A non-exclusive cpuset.cpus change will not invalidate its siblings partition.
+	"  C0-1:P1   .      .    C2-3   C0-2     .      .      .     0 A1:0-2|B1:3 A1:P1|B1:P0"
+	"  C0-1:P1   .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-1|B1:2-3 A1:P-1|B1:P1"
+	"   C0-1     .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-1|B1:2-3 A1:P0|B1:P1"
 
 	# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
 	"   C0-3     .      .    C4-5     X5     .      .      .     0 A1:0-3|B1:4-5"
-- 
2.25.1
Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Posted by Michal Koutný 1 month, 2 weeks ago
Hello Shaojie.

On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@kylinos.cn> wrote:
> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
> with its sibling partition, the sibling's partition state becomes invalid.
> However, this invalidation is often unnecessary.
> 
> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
> under the root cgroup. Each cpuset is used by a single user(user-1 use
> A1, ... , user-m use Am), and the partition states of these cpusets are
> configured as follows:
> 
>                            root cgroup
>         /             /                  \                 \
>        A1            A2        ...       An                Am
>      (root)        (root)      ...     (root) (root/root invalid/member)
> 
> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
> states from A1 to An to change from root to root invalid, as shown
> below.
> 
>                            root cgroup
>         /              /                 \                 \
>        A1             A2       ...       An                Am
>  (root invalid) (root invalid) ... (root invalid) (root invalid/member)
> 
> This outcome is entirely undeserved for all users from A1 to An.

s/cpuset.cpus/memory.max/ 

When the permissions are such that the last (any) sibling can come and
claim so much to cause overcommit, then it can set up large limit and
(potentially) reclaim from others.

s/cpuset.cpus/memory.min/

Here is the overcommit approached by recalculating effective values of
memory.min, again one sibling can skew toward itself and reduce every
other's effective value.

Above are not exact analogies because first of them is Limits, the
second is Protections and cpusets are Allocations (refering to Resource
Distribution Models from Documentation/admin-guide/cgroup-v2.rst).

But the advice to get some guarantees would be same in all cases -- if
some guarantees are expected, the permissions (of respective cgroup
attributes) should be configured so that it decouples the owner of the
cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
root or there's a middle level cgroup that'd cap each of the siblings
individually).


> After applying this patch, the first party to set "root" will maintain
> its exclusive validity. As follows:
> 
>  Step                                       | A1's prstate | B1's prstate |
>  #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>  #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>  #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
>  #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |
> 
>  Step                                       | A1's prstate | B1's prstate |
>  #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
>  #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
>  #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
>  #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |

I'm worried that the ordering dependency would lead to situations where
users may not be immediately aware their config is overcommitting the system.
Consider that CPUs are vital for A1 but B1 can somehow survive the
degraded state, depending on the starting order the system may either
run fine (A1 valid) or fail because of A1.

I'm curious about Waiman's take.

Thanks,
Michal
Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Posted by Waiman Long 1 month, 2 weeks ago
On 12/22/25 10:26 AM, Michal Koutný wrote:
> Hello Shaojie.
>
> On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@kylinos.cn> wrote:
>> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
>> with its sibling partition, the sibling's partition state becomes invalid.
>> However, this invalidation is often unnecessary.
>>
>> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
>> under the root cgroup. Each cpuset is used by a single user(user-1 use
>> A1, ... , user-m use Am), and the partition states of these cpusets are
>> configured as follows:
>>
>>                             root cgroup
>>          /             /                  \                 \
>>         A1            A2        ...       An                Am
>>       (root)        (root)      ...     (root) (root/root invalid/member)
>>
>> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
>> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
>> states from A1 to An to change from root to root invalid, as shown
>> below.
>>
>>                             root cgroup
>>          /              /                 \                 \
>>         A1             A2       ...       An                Am
>>   (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>>
>> This outcome is entirely undeserved for all users from A1 to An.
> s/cpuset.cpus/memory.max/
>
> When the permissions are such that the last (any) sibling can come and
> claim so much to cause overcommit, then it can set up large limit and
> (potentially) reclaim from others.
>
> s/cpuset.cpus/memory.min/
>
> Here is the overcommit approached by recalculating effective values of
> memory.min, again one sibling can skew toward itself and reduce every
> other's effective value.
>
> Above are not exact analogies because first of them is Limits, the
> second is Protections and cpusets are Allocations (refering to Resource
> Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
>
> But the advice to get some guarantees would be same in all cases -- if
> some guarantees are expected, the permissions (of respective cgroup
> attributes) should be configured so that it decouples the owner of the
> cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
> root or there's a middle level cgroup that'd cap each of the siblings
> individually).
>
 From sibling point of view, CPUs in partitions are exclusive. A cpuset 
either have all the requested CPUs to form a partition (assuming that at 
least one can be granted from the parent cpuset) or it doesn't have all 
of them and fails to form a valid partition. It is different from memory 
that a cgroup can have a reduced amount of memory than requested and can 
still work fine.

Anyway, I consider using cpuset.cpus to form a partition is legacy and 
is supported for backward compatibility reason. Now the proper way to 
form a partition is to use cpuset.cpus.exclusive, the setting of it can 
fail if it conflicts with siblings.

By using cpuset.cpus only to form partitions, the cpuset.cpus value will 
be treated the same as cpuset.cpus.exclusive if a valid partition is 
formed. In that sense, the examples listed in the patch will have the 
same result if cpuset.cpu.exclusive is used instead of cpuset.cpus. The 
difference is that writing to the cpuset.cpus.exclusive will fail 
instead of forming an invalid partition in the case of cpust.cpus.

>> After applying this patch, the first party to set "root" will maintain
>> its exclusive validity. As follows:
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>>   #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
>>   #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
>>   #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
>>   #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |
> I'm worried that the ordering dependency would lead to situations where
> users may not be immediately aware their config is overcommitting the system.
> Consider that CPUs are vital for A1 but B1 can somehow survive the
> degraded state, depending on the starting order the system may either
> run fine (A1 valid) or fail because of A1.
>
> I'm curious about Waiman's take.

That is why I will recommend users to use cpuset.cpus.exclusive to form 
partition as they can get early feedback if they are overcommitting. Of 
course, setting cpuset.cpus.exclusive without failure still doesn't 
guarantee the formation of a valid partition if none of the exclusive 
CPUs can be granted from the parent.

Cheers,
Longman

Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Posted by Waiman Long 1 month, 2 weeks ago
On 12/1/25 4:38 AM, Sun Shaojie wrote:
> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
> with its sibling partition, the sibling's partition state becomes invalid.
> However, this invalidation is often unnecessary.
>
> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
> under the root cgroup. Each cpuset is used by a single user(user-1 use
> A1, ... , user-m use Am), and the partition states of these cpusets are
> configured as follows:
>
>                             root cgroup
>          /             /                  \                 \
>         A1            A2        ...       An                Am
>       (root)        (root)      ...     (root) (root/root invalid/member)
>
> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
> states from A1 to An to change from root to root invalid, as shown
> below.
>
>                             root cgroup
>          /              /                 \                 \
>         A1             A2       ...       An                Am
>   (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>
> This outcome is entirely undeserved for all users from A1 to An.
>
> This patch prevents such outcomes by ensuring that modifications to
> cpuset.cpus do not affect the partition state of other sibling cpusets.
> Therefore, with this patch applied, when user-m configures Am's
> cpuset.cpus to "0-127", the result will be as follows.
>
>                             root cgroup
>          /             /                  \                 \
>         A1            A2        ...       An                Am
>       (root)        (root)      ...     (root)     (root invalid/member)
>
> It is worth noting that, since this patch enforces the exclusivity of
> sibling cpusets, setting exclusivity now follows a "first-come,
> first-served" principle.
>
> For example, consider the following four steps: before applying this
> patch, regardless of the order in which they are executed, the final
> partition state of both A1 and B1 would always be "root invalid."
>
>   Step                                       | A1's prstate | B1's prstate |
>   #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>   #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>   #3> echo "1-2" > B1/cpuset.cpus            | root invalid | member       |
>   #4> echo "root" > B1/cpuset.cpus.partition | root invalid | root invalid |
>
> After applying this patch, the first party to set "root" will maintain
> its exclusive validity. As follows:
>
>   Step                                       | A1's prstate | B1's prstate |
>   #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>   #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>   #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
>   #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |
>
>   Step                                       | A1's prstate | B1's prstate |
>   #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
>   #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
>   #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
>   #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |
>
> In summary, if the current cpuset conflicts with its sibling cpusets on
> exclusive CPUs (If a cpuset is exclusive and its exclusive CPUs are empty,
> its allowed CPUs will be treated as exclusive CPUs), only the current
> cpuset should bear the consequences.
>
> Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn>

I agree with you that it is probably not a good idea to invalidate 
partitions whenever there is a conflict. However, I have a different 
idea of how to do it. I am going to post another patch to show my idea. 
Let me know what you think about it and whether it can meet your need.

Cheers,
Longman

> ---
>   kernel/cgroup/cpuset-internal.h               |  3 +
>   kernel/cgroup/cpuset-v1.c                     | 19 ++++++
>   kernel/cgroup/cpuset.c                        | 60 ++++++++++++-------
>   .../selftests/cgroup/test_cpuset_prs.sh       | 12 ++--
>   4 files changed, 65 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 337608f408ce..c53111998432 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -292,6 +292,7 @@ void cpuset1_hotplug_update_tasks(struct cpuset *cs,
>   			    struct cpumask *new_cpus, nodemask_t *new_mems,
>   			    bool cpus_updated, bool mems_updated);
>   int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial);
> +bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2);
>   #else
>   static inline void fmeter_init(struct fmeter *fmp) {}
>   static inline void cpuset1_update_task_spread_flags(struct cpuset *cs,
> @@ -302,6 +303,8 @@ static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs,
>   			    bool cpus_updated, bool mems_updated) {}
>   static inline int cpuset1_validate_change(struct cpuset *cur,
>   				struct cpuset *trial) { return 0; }
> +static inline bool cpuset1_cpus_excl_conflict(struct cpuset *cs1,
> +				struct cpuset *cs2) {return false; }
>   #endif /* CONFIG_CPUSETS_V1 */
>   
>   #endif /* __CPUSET_INTERNAL_H */
> diff --git a/kernel/cgroup/cpuset-v1.c b/kernel/cgroup/cpuset-v1.c
> index 12e76774c75b..5aa0ac092ef6 100644
> --- a/kernel/cgroup/cpuset-v1.c
> +++ b/kernel/cgroup/cpuset-v1.c
> @@ -373,6 +373,25 @@ int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial)
>   	return ret;
>   }
>   
> +/*
> + * cpuset1_cpus_excl_conflict() - Check if two cpusets have exclusive CPU conflicts
> + *                                to legacy (v1)
> + * @cs1: first cpuset to check
> + * @cs2: second cpuset to check
> + *
> + * Returns: true if CPU exclusivity conflict exists, false otherwise
> + *
> + * If either cpuset is CPU exclusive, their allowed CPUs cannot intersect.
> + */
> +bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
> +{
> +	if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
> +		return cpumask_intersects(cs1->cpus_allowed,
> +					  cs2->cpus_allowed);
> +
> +	return false;
> +}
> +
>   #ifdef CONFIG_PROC_PID_CPUSET
>   /*
>    * proc_cpuset_show()
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 52468d2c178a..e58dd26e074a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -586,14 +586,24 @@ static inline bool cpusets_are_exclusive(struct cpuset *cs1, struct cpuset *cs2)
>    * Returns: true if CPU exclusivity conflict exists, false otherwise
>    *
>    * Conflict detection rules:
> - * 1. If either cpuset is CPU exclusive, they must be mutually exclusive
> + * For cgroup-v1:
> + *     see cpuset1_cpus_excl_conflict()
> + * For cgroup-v2:
> + * 1. If both cs1 and cs2 are exclusive, cs1 and cs2 must be mutually exclusive
>    * 2. exclusive_cpus masks cannot intersect between cpusets
>    * 3. The allowed CPUs of one cpuset cannot be a subset of another's exclusive CPUs
> + * 4. If a cpuset is exclusive and its exclusive CPUs are empty, its allowed CPUs
> + *    will be treated as exclusive CPUs; therefore, its allowed CPUs must not
> + *    intersect with another's exclusive CPUs.
>    */
>   static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
>   {
> -	/* If either cpuset is exclusive, check if they are mutually exclusive */
> -	if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
> +	/* For cgroup-v1 */
> +	if (!cpuset_v2())
> +		return cpuset1_cpus_excl_conflict(cs1, cs2);
> +
> +	/* If cpusets are exclusive, check if they are mutually exclusive*/
> +	if (is_cpu_exclusive(cs1) && is_cpu_exclusive(cs2))
>   		return !cpusets_are_exclusive(cs1, cs2);
>   
>   	/* Exclusive_cpus cannot intersect */
> @@ -609,6 +619,20 @@ static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2)
>   	    cpumask_subset(cs2->cpus_allowed, cs1->exclusive_cpus))
>   		return true;
>   
> +	/*
> +	 * When a cpuset is exclusive and its exclusive CPUs are empty,
> +	 * its cpus_allowed cannot intersect with another cpuset's exclusive_cpus.
> +	 */
> +	if (is_cpu_exclusive(cs1) &&
> +	    cpumask_empty(cs1->exclusive_cpus) &&
> +	    cpumask_intersects(cs1->cpus_allowed, cs2->exclusive_cpus))
> +		return true;
> +
> +	if (is_cpu_exclusive(cs2) &&
> +	    cpumask_empty(cs2->exclusive_cpus) &&
> +	    cpumask_intersects(cs2->cpus_allowed, cs1->exclusive_cpus))
> +		return true;
> +
>   	return false;
>   }
>   
> @@ -2411,34 +2435,17 @@ static int cpus_allowed_validate_change(struct cpuset *cs, struct cpuset *trialc
>   					struct tmpmasks *tmp)
>   {
>   	int retval;
> -	struct cpuset *parent = parent_cs(cs);
>   
>   	retval = validate_change(cs, trialcs);
>   
>   	if ((retval == -EINVAL) && cpuset_v2()) {
> -		struct cgroup_subsys_state *css;
> -		struct cpuset *cp;
> -
>   		/*
>   		 * The -EINVAL error code indicates that partition sibling
>   		 * CPU exclusivity rule has been violated. We still allow
>   		 * the cpumask change to proceed while invalidating the
> -		 * partition. However, any conflicting sibling partitions
> -		 * have to be marked as invalid too.
> +		 * partition.
>   		 */
>   		trialcs->prs_err = PERR_NOTEXCL;
> -		rcu_read_lock();
> -		cpuset_for_each_child(cp, css, parent) {
> -			struct cpumask *xcpus = user_xcpus(trialcs);
> -
> -			if (is_partition_valid(cp) &&
> -			    cpumask_intersects(xcpus, cp->effective_xcpus)) {
> -				rcu_read_unlock();
> -				update_parent_effective_cpumask(cp, partcmd_invalidate, NULL, tmp);
> -				rcu_read_lock();
> -			}
> -		}
> -		rcu_read_unlock();
>   		retval = 0;
>   	}
>   	return retval;
> @@ -2506,8 +2513,15 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>   	if (alloc_tmpmasks(&tmp))
>   		return -ENOMEM;
>   
> -	compute_trialcs_excpus(trialcs, cs);
> -	trialcs->prs_err = PERR_NONE;
> +	/*
> +	 * if there is exclusive CPUs conflict with the siblings,
> +	 * we still allow the cpumask change to proceed while
> +	 * invalidating the partition.
> +	 */
> +	if (compute_trialcs_excpus(trialcs, cs))
> +		trialcs->prs_err = PERR_NOTEXCL;
> +	else
> +		trialcs->prs_err = PERR_NONE;
>   
>   	retval = cpus_allowed_validate_change(cs, trialcs, &tmp);
>   	if (retval < 0)
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index a17256d9f88a..75154e22c702 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -269,7 +269,7 @@ TEST_MATRIX=(
>   	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X3:P2    .      .     0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3"
>   	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3  X2-3:P2   .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
>   	" C0-3:S+ C1-3:S+ C2-3     .    X2-3   X2-3 X2-3:P2:C3 .     0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3"
> -	" C0-3:S+ C1-3:S+ C2-3   C2-3     .      .      .      P2    0 A1:0-3|A2:1-3|A3:2-3|B1:2-3 A1:P0|A3:P0|B1:P-2"
> +	" C0-3:S+ C1-3:S+ C2-3   C2-3     .      .      .      P2    0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2"
>   	" C0-3:S+ C1-3:S+ C2-3   C4-5     .      .      .      P2    0 B1:4-5 B1:P2 4-5"
>   	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3  X2-3:P2   P2    0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
>   	" C0-3:S+ C1-3:S+ C2-3    C4    X2-3   X2-3 X2-3:P2:C1-3 P2  0 A3:2-3|B1:4 A3:P2|B1:P2 2-4"
> @@ -318,7 +318,7 @@ TEST_MATRIX=(
>   	# Invalid to valid local partition direct transition tests
>   	" C1-3:S+:P2 X4:P2  .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
>   	" C1-3:S+:P2 X4:P2  .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
> -	"  C0-3:P2   .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:4-6 A1:P-2|B1:P0"
> +	"  C0-3:P2   .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
>   	"  C0-3:P2   .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
>   
>   	# Local partition invalidation tests
> @@ -388,10 +388,10 @@ TEST_MATRIX=(
>   	"  C0-1:S+  C1      .    C2-3     .      P2     .      .     0 A1:0-1|A2:1 A1:P0|A2:P-2"
>   	"  C0-1:S+ C1:P2    .    C2-3     P1     .      .      .     0 A1:0|A2:1 A1:P1|A2:P2 0-1|1"
>   
> -	# A non-exclusive cpuset.cpus change will invalidate partition and its siblings
> -	"  C0-1:P1   .      .    C2-3   C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P-1|B1:P0"
> -	"  C0-1:P1   .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P-1|B1:P-1"
> -	"   C0-1     .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-2|B1:2-3 A1:P0|B1:P-1"
> +	# A non-exclusive cpuset.cpus change will not invalidate its siblings partition.
> +	"  C0-1:P1   .      .    C2-3   C0-2     .      .      .     0 A1:0-2|B1:3 A1:P1|B1:P0"
> +	"  C0-1:P1   .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-1|B1:2-3 A1:P-1|B1:P1"
> +	"   C0-1     .      .  P1:C2-3  C0-2     .      .      .     0 A1:0-1|B1:2-3 A1:P0|B1:P1"
>   
>   	# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
>   	"   C0-3     .      .    C4-5     X5     .      .      .     0 A1:0-3|B1:4-5"
[PING][PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict
Posted by Sun Shaojie 1 month, 3 weeks ago
Hi, Longman,
 
Just a friendly ping regarding the patch "[PATCH v6] cpuset: Avoid
invalidating sibling partitions on cpuset.cpus conflict" sent on
[Mon,  1 Dec 2025 17:38:06 +0800].
Link: https://lore.kernel.org/cgroups/20251201093806.107157-1-sunshaojie@kylinos.cn/

Could you please take a look when you have a moment? We'd appreciate any
initial feedback or suggestions you might have.

Thank you again for your time and consideration.

Thanks,
Sun Shaojie
Re: [PING][PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict
Posted by Waiman Long 1 month, 2 weeks ago
On 12/17/25 4:45 AM, Sun Shaojie wrote:
> Hi, Longman,
>   
> Just a friendly ping regarding the patch "[PATCH v6] cpuset: Avoid
> invalidating sibling partitions on cpuset.cpus conflict" sent on
> [Mon,  1 Dec 2025 17:38:06 +0800].
> Link: https://lore.kernel.org/cgroups/20251201093806.107157-1-sunshaojie@kylinos.cn/
>
> Could you please take a look when you have a moment? We'd appreciate any
> initial feedback or suggestions you might have.
>
> Thank you again for your time and consideration.

I am sorry that I am late in reviewing your patch. I was busy in the 
last few weeks. Now I will try to review your patch later this week.

Cheers,
Longman