[v4] cgroup/cpuset: Fix partition related locking issues

[PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue

Posted by Waiman Long 1 day, 23 hours ago

The update_isolation_cpumasks() function can be called either directly
from regular cpuset control file write with cpuset_full_lock() called
or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.

As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the actual housekeeping_update() call, if needed, will happen after
cpus_write_lock is released.

We can't use the synchronous task_work API as call from CPU hotplug
path happen in the per-cpu kthread of the CPU that is being shut down
or brought up. Because of the asynchronous nature of workqueue, the
HK_TYPE_DOMAIN housekeeping cpumask will be updated a bit later than the
"cpuset.cpus.isolated" control file in this case.

Also add a check in test_cpuset_prs.sh and modify some existing
test cases to confirm that "cpuset.cpus.isolated" and HK_TYPE_DOMAIN
housekeeping cpumask will both be updated.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c                        | 41 +++++++++++++++++--
 .../selftests/cgroup/test_cpuset_prs.sh       | 13 ++++--
 2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a4c6386a594d..eb0eabd85e8c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1302,6 +1302,17 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
+static void isolcpus_workfn(struct work_struct *work)
+{
+	cpuset_full_lock();
+	if (isolated_cpus_updating) {
+		isolated_cpus_updating = false;
+		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
+		rebuild_sched_domains_locked();
+	}
+	cpuset_full_unlock();
+}
+
 /*
  * update_isolation_cpumasks - Update external isolation related CPU masks
  *
@@ -1310,14 +1321,38 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
  */
 static void update_isolation_cpumasks(void)
 {
-	int ret;
+	static DECLARE_WORK(isolcpus_work, isolcpus_workfn);
 
+	lockdep_assert_cpuset_lock_held();
 	if (!isolated_cpus_updating)
 		return;
 
-	ret = housekeeping_update(isolated_cpus);
-	WARN_ON_ONCE(ret < 0);
+	/*
+	 * This function can be reached either directly from regular cpuset
+	 * control file write or via CPU hotplug. In the latter case, it is
+	 * the per-cpu kthread that calls cpuset_handle_hotplug() on behalf
+	 * of the task that initiates CPU shutdown or bringup.
+	 *
+	 * To have better flexibility and prevent the possibility of deadlock
+	 * when calling from CPU hotplug, we defer the housekeeping_update()
+	 * call to after the current cpuset critical section has finished.
+	 * This is done via workqueue.
+	 */
+	if (current->flags & PF_KTHREAD) {
+		/*
+		 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work
+		 * item that is still pending. Before the pending bit is
+		 * cleared, the work data is copied out and work item dequeued.
+		 * So it is possible to queue the work again before the
+		 * isolcpus_workfn() is invoked to process the previously
+		 * queued work. Since isolcpus_workfn() doesn't use the work
+		 * item at all, this is not a problem.
+		 */
+		queue_work(system_unbound_wq, &isolcpus_work);
+		return;
+	}
 
+	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
 	isolated_cpus_updating = false;
 }
 
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 5dff3ad53867..0502b156582b 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -245,8 +245,9 @@ TEST_MATRIX=(
 	"C2-3:P1:S+  C3:P2  .      .     O2=0   O2=1    .      .     0 A1:2|A2:3 A1:P1|A2:P2"
 	"C2-3:P1:S+  C3:P1  .      .     O2=0    .      .      .     0 A1:|A2:3 A1:P1|A2:P1"
 	"C2-3:P1:S+  C3:P1  .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
-	"C2-3:P1:S+  C3:P1  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
-	"C2-3:P1:S+  C3:P1  .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
+	"C2-3:P1:S+  C3:P2  .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
+	"C1-3:P1:S+  C3:P2  .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
 	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
 	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -764,7 +765,7 @@ check_cgroup_states()
 # only CPUs in isolated partitions as well as those that are isolated at
 # boot time.
 #
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
 # <isolcpus1> - expected sched/domains value
 # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
 #
@@ -773,6 +774,7 @@ check_isolcpus()
 	EXPECTED_ISOLCPUS=$1
 	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
 	ISOLCPUS=$(cat $ISCPUS)
+	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
 	LASTISOLCPU=
 	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
 	if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -810,6 +812,11 @@ check_isolcpus()
 	ISOLCPUS=
 	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
 
+	#
+	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+	#
+	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
 	#
 	# Use the sched domain in debugfs to check isolated CPUs, if available
 	#
-- 
2.52.0

Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue

Posted by Frederic Weisbecker 1 day, 21 hours ago

Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
> The update_isolation_cpumasks() function can be called either directly
> from regular cpuset control file write with cpuset_full_lock() called
> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we

Why do we need to call housekeeping_update() from hotplug? I would
expect it to be called only when cpuset control file are written since
housekeeping cpumask don't deal with online CPUs but with possible
CPUs.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue

Posted by Waiman Long 17 hours ago

On 2/6/26 5:28 PM, Frederic Weisbecker wrote:
> Le Fri, Feb 06, 2026 at 03:37:10PM -0500, Waiman Long a écrit :
>> The update_isolation_cpumasks() function can be called either directly
>> from regular cpuset control file write with cpuset_full_lock() called
>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
> Why do we need to call housekeeping_update() from hotplug? I would
> expect it to be called only when cpuset control file are written since
> housekeeping cpumask don't deal with online CPUs but with possible
> CPUs.

It needs to call housekeeping_update() only in the special case where 
there is only one active CPU in an isolated partition and that CPU goes 
offline. In this case, the partition becomes disabled that causes change 
in the isolated CPUs. I know this special case shouldn't happen in real 
world, but I do have test case to test that.

Theoretically, we can add code to handle this special case to keep this 
offline isolated CPU in a special pool without changing isolated_cpus 
and hence  HK_TYPE_DOMAIN cpumask. In this way, we shouldn't need to 
call housekeeping_update() from CPU hotplug. I will probably do that as 
CPU hotplug will be used when we make HK_TYPE_KERNEL_NOISE cpumask 
dynamic in the near future.

Cheers,
Longman

[PATCH/for-next v4 1/4] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables
[PATCH/for-next v4 2/4] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
[PATCH/for-next v4 3/4] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
[PATCH/for-next v4 4/4] cgroup/cpuset: Eliminate some duplicated rebuild_sched_domains() calls