[v2] cgroup/cpuset: Fix partition related locking issues

[PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Waiman Long 1 week, 2 days ago

The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.18.0-test+ #2 Tainted: G S
  ------------------------------------------------------
  test_cpuset_prs/10971 is trying to acquire lock:
  ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180

  but task is already holding lock:
  ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #4 (cpuset_mutex){+.+.}-{4:4}:
  -> #3 (cpu_hotplug_lock){++++}-{0:0}:
  -> #2 (rtnl_mutex){+.+.}-{4:4}:
  -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
  -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

  Chain exists of:
    (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

  5 locks held by test_cpuset_prs/10971:
   #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
   #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
   #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
   #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
   #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130

  Call Trace:
   <TASK>
     :
   touch_wq_lockdep_map+0x93/0x180
   __flush_workqueue+0x111/0x10b0
   housekeeping_update+0x12d/0x2d0
   update_parent_effective_cpumask+0x595/0x2440
   update_prstate+0x89d/0xce0
   cpuset_partition_write+0xc5/0x130
   cgroup_file_write+0x1a5/0x680
   kernfs_fop_write_iter+0x3df/0x5f0
   vfs_write+0x525/0xfd0
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x95/0x520
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

One way to do that is to introduce a new top level cpuset_top_mutex
which will be acquired first.  This new cpuset_top_mutex will provide
the need mutual exclusion without the need to hold cpus_read_lock().

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to check the new
cpuset_top_mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c        | 101 +++++++++++++++++++++++-----------
 kernel/sched/isolation.c      |   4 +-
 kernel/time/timer_migration.c |   3 +-
 3 files changed, 70 insertions(+), 38 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0b0eb1df09d5..edccfa2df9da 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -78,13 +78,13 @@ static cpumask_var_t	subpartitions_cpus;
 static cpumask_var_t	isolated_cpus;
 
 /*
- * isolated_cpus updating flag (protected by cpuset_mutex)
+ * isolated_cpus updating flag (protected by cpuset_top_mutex)
  * Set if isolated_cpus is going to be updated in the current
  * cpuset_mutex crtical section.
  */
 static bool isolated_cpus_updating;
 
-/* Both cpuset_mutex and cpus_read_locked acquired */
+/* cpuset_top_mutex acquired */
 static bool cpuset_locked;
 
 /*
@@ -222,29 +222,44 @@ struct cpuset top_cpuset = {
 };
 
 /*
- * There are two global locks guarding cpuset structures - cpuset_mutex and
- * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
- * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
- * structures. Note that cpuset_mutex needs to be a mutex as it is used in
- * paths that rely on priority inheritance (e.g. scheduler - on RT) for
- * correctness.
+ * CPUSET Locking Convention
+ * -------------------------
  *
- * A task must hold both locks to modify cpusets.  If a task holds
- * cpuset_mutex, it blocks others, ensuring that it is the only task able to
- * also acquire callback_lock and be able to modify cpusets.  It can perform
- * various checks on the cpuset structure first, knowing nothing will change.
- * It can also allocate memory while just holding cpuset_mutex.  While it is
- * performing these checks, various callback routines can briefly acquire
- * callback_lock to query cpusets.  Once it is ready to make the changes, it
- * takes callback_lock, blocking everyone else.
+ * Below are the four global locks guarding cpuset structures in lock
+ * acquisition order:
+ *  - cpuset_top_mutex
+ *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
+ *  - cpuset_mutex
+ *  - callback_lock (raw spinlock)
  *
- * Calls to the kernel memory allocator can not be made while holding
- * callback_lock, as that would risk double tripping on callback_lock
- * from one of the callbacks into the cpuset code from within
- * __alloc_pages().
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead.
  *
- * If a task is only holding callback_lock, then it has read-only
- * access to cpusets.
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() when the set of isolated CPUs is going to be changed,
+ * it may not be safe from the circular locking perspective to hold the
+ * cpus_read_lock. So cpus_read_lock and cpuset_mutex will be released before
+ * calling housekeeping_update() and re-acquired afterward.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
+ * can be modified without holding callback_lock. If only reliable read access
+ * of the externally used fields are needed, a task can hold either
+ * cpuset_mutex or callback_lock which are exposed to other subsystems.
+ *
+ * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
+ * ensuring that it is the only task able to also acquire callback_lock and
+ * be able to modify cpusets.  It can perform various checks on the cpuset
+ * structure first, knowing nothing will change. It can also allocate memory
+ * without holding callback_lock. While it is performing these checks, various
+ * callback routines can briefly acquire callback_lock to query cpusets.  Once
+ * it is ready to make the changes, it takes callback_lock, blocking everyone
+ * else.
+ *
+ * Calls to the kernel memory allocator cannot be made while holding
+ * callback_lock which is a spinlock, as the memory allocator may sleep or
+ * call back into cpuset code and acquire callback_lock.
  *
  * Now, the task_struct fields mems_allowed and mempolicy may be changed
  * by other task, we use alloc_lock in the task_struct fields to protect
@@ -255,6 +270,7 @@ struct cpuset top_cpuset = {
  * cpumasks and nodemasks.
  */
 
+static DEFINE_MUTEX(cpuset_top_mutex);
 static DEFINE_MUTEX(cpuset_mutex);
 
 /**
@@ -278,6 +294,18 @@ void lockdep_assert_cpuset_lock_held(void)
 	lockdep_assert_held(&cpuset_mutex);
 }
 
+static void cpuset_partial_lock(void)
+{
+	cpus_read_lock();
+	mutex_lock(&cpuset_mutex);
+}
+
+static void cpuset_partial_unlock(void)
+{
+	mutex_unlock(&cpuset_mutex);
+	cpus_read_unlock();
+}
+
 /**
  * cpuset_full_lock - Acquire full protection for cpuset modification
  *
@@ -286,22 +314,22 @@ void lockdep_assert_cpuset_lock_held(void)
  */
 void cpuset_full_lock(void)
 {
-	cpus_read_lock();
-	mutex_lock(&cpuset_mutex);
+	mutex_lock(&cpuset_top_mutex);
+	cpuset_partial_lock();
 	cpuset_locked = true;
 }
 
 void cpuset_full_unlock(void)
 {
 	cpuset_locked = false;
-	mutex_unlock(&cpuset_mutex);
-	cpus_read_unlock();
+	cpuset_partial_unlock();
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 #ifdef CONFIG_LOCKDEP
 bool lockdep_is_cpuset_held(void)
 {
-	return lockdep_is_held(&cpuset_mutex);
+	return lockdep_is_held(&cpuset_top_mutex);
 }
 #endif
 
@@ -1292,12 +1320,12 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 
 static void isolcpus_workfn(struct work_struct *work)
 {
-	cpuset_full_lock();
-	if (isolated_cpus_updating) {
-		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
-		isolated_cpus_updating = false;
-	}
-	cpuset_full_unlock();
+	guard(mutex)(&cpuset_top_mutex);
+	if (!isolated_cpus_updating)
+		return;
+
+	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
+	isolated_cpus_updating = false;
 }
 
 /*
@@ -1331,8 +1359,15 @@ static void update_isolation_cpumasks(void)
 		return;
 	}
 
+	lockdep_assert_held(&cpuset_top_mutex);
+	/*
+	 * Release cpus_read_lock & cpuset_mutex before calling
+	 * housekeeping_update() and re-acquiring them afterward.
+	 */
+	cpuset_partial_unlock();
 	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
 	isolated_cpus_updating = false;
+	cpuset_partial_lock();
 }
 
 /**
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b725d39c06e..ef152d401fe2 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
 	struct cpumask *trial, *old = NULL;
 	int err;
 
-	lockdep_assert_cpus_held();
-
 	trial = kmalloc(cpumask_size(), GFP_KERNEL);
 	if (!trial)
 		return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
 	}
 
 	if (!housekeeping.flags)
-		static_branch_enable_cpuslocked(&housekeeping_overridden);
+		static_branch_enable(&housekeeping_overridden);
 
 	if (housekeeping.flags & HK_FLAG_DOMAIN)
 		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 6da9cd562b20..244a8d025e78 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
 	int cpu;
 
-	lockdep_assert_cpus_held();
-
 	if (!works)
 		return -ENOMEM;
 	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
 	 * First set previously isolated CPUs as available (unisolate).
 	 * This cpumask contains only CPUs that switched to available now.
 	 */
+	guard(cpus_read_lock)();
 	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
 	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
 
-- 
2.52.0

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Chen Ridong 1 week, 1 day ago


On 2026/1/30 23:42, Waiman Long wrote:
> The current cpuset partition code is able to dynamically update
> the sched domains of a running system and the corresponding
> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
> "isolcpus=domain,..." boot command line feature at run time.
> 
> The housekeeping cpumask update requires flushing a number of different
> workqueues which may not be safe with cpus_read_lock() held as the
> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
> which have locking dependency with cpus_read_lock() down the chain. Below
> is an example of such circular locking problem.
> 
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   6.18.0-test+ #2 Tainted: G S
>   ------------------------------------------------------
>   test_cpuset_prs/10971 is trying to acquire lock:
>   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
> 
>   but task is already holding lock:
>   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
>   -> #4 (cpuset_mutex){+.+.}-{4:4}:
>   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>   -> #2 (rtnl_mutex){+.+.}-{4:4}:
>   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
> 
>   Chain exists of:
>     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
> 
>   5 locks held by test_cpuset_prs/10971:
>    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
> 
>   Call Trace:
>    <TASK>
>      :
>    touch_wq_lockdep_map+0x93/0x180
>    __flush_workqueue+0x111/0x10b0
>    housekeeping_update+0x12d/0x2d0
>    update_parent_effective_cpumask+0x595/0x2440
>    update_prstate+0x89d/0xce0
>    cpuset_partition_write+0xc5/0x130
>    cgroup_file_write+0x1a5/0x680
>    kernfs_fop_write_iter+0x3df/0x5f0
>    vfs_write+0x525/0xfd0
>    ksys_write+0xf9/0x1d0
>    do_syscall_64+0x95/0x520
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> To avoid such a circular locking dependency problem, we have to
> call housekeeping_update() without holding the cpus_read_lock() and
> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
> may not have work functions that call cpus_read_lock() directly,
> but we are likely to extend the list of wq's that are flushed in the
> future. Moreover, the current set of work functions may hold locks that
> may have cpu_hotplug_lock down the dependency chain.
> 
> One way to do that is to introduce a new top level cpuset_top_mutex
> which will be acquired first.  This new cpuset_top_mutex will provide
> the need mutual exclusion without the need to hold cpus_read_lock().
> 

Introducing a new global lock warrants careful consideration. I wonder if we
could make all updates to isolated_cpus asynchronous. If that is feasible, we
could avoid adding a global lock altogether. If not, we need to clarify which
updates must remain synchronous and which ones can be handled asynchronously.

> As cpus_read_lock() is now no longer held when
> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
> directly.
> 
> The lockdep_is_cpuset_held() is also updated to check the new
> cpuset_top_mutex.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c        | 101 +++++++++++++++++++++++-----------
>  kernel/sched/isolation.c      |   4 +-
>  kernel/time/timer_migration.c |   3 +-
>  3 files changed, 70 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0b0eb1df09d5..edccfa2df9da 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -78,13 +78,13 @@ static cpumask_var_t	subpartitions_cpus;
>  static cpumask_var_t	isolated_cpus;
>  
>  /*
> - * isolated_cpus updating flag (protected by cpuset_mutex)
> + * isolated_cpus updating flag (protected by cpuset_top_mutex)
>   * Set if isolated_cpus is going to be updated in the current
>   * cpuset_mutex crtical section.
>   */
>  static bool isolated_cpus_updating;
>  
> -/* Both cpuset_mutex and cpus_read_locked acquired */
> +/* cpuset_top_mutex acquired */
>  static bool cpuset_locked;
>  
>  /*
> @@ -222,29 +222,44 @@ struct cpuset top_cpuset = {
>  };
>  
>  /*
> - * There are two global locks guarding cpuset structures - cpuset_mutex and
> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel
> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset
> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in
> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for
> - * correctness.
> + * CPUSET Locking Convention
> + * -------------------------
>   *
> - * A task must hold both locks to modify cpusets.  If a task holds
> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to
> - * also acquire callback_lock and be able to modify cpusets.  It can perform
> - * various checks on the cpuset structure first, knowing nothing will change.
> - * It can also allocate memory while just holding cpuset_mutex.  While it is
> - * performing these checks, various callback routines can briefly acquire
> - * callback_lock to query cpusets.  Once it is ready to make the changes, it
> - * takes callback_lock, blocking everyone else.
> + * Below are the four global locks guarding cpuset structures in lock
> + * acquisition order:
> + *  - cpuset_top_mutex
> + *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
> + *  - cpuset_mutex
> + *  - callback_lock (raw spinlock)
>   *
> - * Calls to the kernel memory allocator can not be made while holding
> - * callback_lock, as that would risk double tripping on callback_lock
> - * from one of the callbacks into the cpuset code from within
> - * __alloc_pages().
> + * The first cpuset_top_mutex will be held except when calling into
> + * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
> + * and cpuset_mutex will be held instead.
>   *
> - * If a task is only holding callback_lock, then it has read-only
> - * access to cpusets.
> + * As cpuset will now indirectly flush a number of different workqueues in
> + * housekeeping_update() when the set of isolated CPUs is going to be changed,
> + * it may not be safe from the circular locking perspective to hold the
> + * cpus_read_lock. So cpus_read_lock and cpuset_mutex will be released before
> + * calling housekeeping_update() and re-acquired afterward.
> + *
> + * A task must hold all the remaining three locks to modify externally visible
> + * or used fields of cpusets, though some of the internally used cpuset fields
> + * can be modified without holding callback_lock. If only reliable read access
> + * of the externally used fields are needed, a task can hold either
> + * cpuset_mutex or callback_lock which are exposed to other subsystems.
> + *
> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others,
> + * ensuring that it is the only task able to also acquire callback_lock and
> + * be able to modify cpusets.  It can perform various checks on the cpuset
> + * structure first, knowing nothing will change. It can also allocate memory
> + * without holding callback_lock. While it is performing these checks, various
> + * callback routines can briefly acquire callback_lock to query cpusets.  Once
> + * it is ready to make the changes, it takes callback_lock, blocking everyone
> + * else.
> + *
> + * Calls to the kernel memory allocator cannot be made while holding
> + * callback_lock which is a spinlock, as the memory allocator may sleep or
> + * call back into cpuset code and acquire callback_lock.
>   *
>   * Now, the task_struct fields mems_allowed and mempolicy may be changed
>   * by other task, we use alloc_lock in the task_struct fields to protect
> @@ -255,6 +270,7 @@ struct cpuset top_cpuset = {
>   * cpumasks and nodemasks.
>   */
>  
> +static DEFINE_MUTEX(cpuset_top_mutex);
>  static DEFINE_MUTEX(cpuset_mutex);
>  
>  /**
> @@ -278,6 +294,18 @@ void lockdep_assert_cpuset_lock_held(void)
>  	lockdep_assert_held(&cpuset_mutex);
>  }
>  
> +static void cpuset_partial_lock(void)
> +{
> +	cpus_read_lock();
> +	mutex_lock(&cpuset_mutex);
> +}
> +
> +static void cpuset_partial_unlock(void)
> +{
> +	mutex_unlock(&cpuset_mutex);
> +	cpus_read_unlock();
> +}
> +
>  /**
>   * cpuset_full_lock - Acquire full protection for cpuset modification
>   *
> @@ -286,22 +314,22 @@ void lockdep_assert_cpuset_lock_held(void)
>   */
>  void cpuset_full_lock(void)
>  {
> -	cpus_read_lock();
> -	mutex_lock(&cpuset_mutex);
> +	mutex_lock(&cpuset_top_mutex);
> +	cpuset_partial_lock();
>  	cpuset_locked = true;
>  }
>  
>  void cpuset_full_unlock(void)
>  {
>  	cpuset_locked = false;
> -	mutex_unlock(&cpuset_mutex);
> -	cpus_read_unlock();
> +	cpuset_partial_unlock();
> +	mutex_unlock(&cpuset_top_mutex);
>  }
>  
>  #ifdef CONFIG_LOCKDEP
>  bool lockdep_is_cpuset_held(void)
>  {
> -	return lockdep_is_held(&cpuset_mutex);
> +	return lockdep_is_held(&cpuset_top_mutex);
>  }
>  #endif
>  
> @@ -1292,12 +1320,12 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  
>  static void isolcpus_workfn(struct work_struct *work)
>  {
> -	cpuset_full_lock();
> -	if (isolated_cpus_updating) {
> -		WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> -		isolated_cpus_updating = false;
> -	}
> -	cpuset_full_unlock();
> +	guard(mutex)(&cpuset_top_mutex);
> +	if (!isolated_cpus_updating)
> +		return;
> +
> +	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
> +	isolated_cpus_updating = false;
>  }
>  
>  /*
> @@ -1331,8 +1359,15 @@ static void update_isolation_cpumasks(void)
>  		return;
>  	}
>  
> +	lockdep_assert_held(&cpuset_top_mutex);
> +	/*
> +	 * Release cpus_read_lock & cpuset_mutex before calling
> +	 * housekeeping_update() and re-acquiring them afterward.
> +	 */
> +	cpuset_partial_unlock();
>  	WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0);
>  	isolated_cpus_updating = false;
> +	cpuset_partial_lock();
>  }
>  
>  /**
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 3b725d39c06e..ef152d401fe2 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	struct cpumask *trial, *old = NULL;
>  	int err;
>  
> -	lockdep_assert_cpus_held();
> -
>  	trial = kmalloc(cpumask_size(), GFP_KERNEL);
>  	if (!trial)
>  		return -ENOMEM;
> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
>  	}
>  
>  	if (!housekeeping.flags)
> -		static_branch_enable_cpuslocked(&housekeeping_overridden);
> +		static_branch_enable(&housekeeping_overridden);
>  
>  	if (housekeeping.flags & HK_FLAG_DOMAIN)
>  		old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 6da9cd562b20..244a8d025e78 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
>  	int cpu;
>  
> -	lockdep_assert_cpus_held();
> -
>  	if (!works)
>  		return -ENOMEM;
>  	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
>  	 * First set previously isolated CPUs as available (unisolate).
>  	 * This cpumask contains only CPUs that switched to available now.
>  	 */
> +	guard(cpus_read_lock)();
>  	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
>  	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
>  

-- 
Best regards,
Ridong

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Waiman Long 1 week ago

On 1/30/26 9:53 PM, Chen Ridong wrote:
>
> On 2026/1/30 23:42, Waiman Long wrote:
>> The current cpuset partition code is able to dynamically update
>> the sched domains of a running system and the corresponding
>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>> "isolcpus=domain,..." boot command line feature at run time.
>>
>> The housekeeping cpumask update requires flushing a number of different
>> workqueues which may not be safe with cpus_read_lock() held as the
>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>> which have locking dependency with cpus_read_lock() down the chain. Below
>> is an example of such circular locking problem.
>>
>>    ======================================================
>>    WARNING: possible circular locking dependency detected
>>    6.18.0-test+ #2 Tainted: G S
>>    ------------------------------------------------------
>>    test_cpuset_prs/10971 is trying to acquire lock:
>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
>>
>>    but task is already holding lock:
>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>>    which lock already depends on the new lock.
>>
>>    the existing dependency chain (in reverse order) is:
>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>
>>    Chain exists of:
>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>
>>    5 locks held by test_cpuset_prs/10971:
>>     #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>     #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
>>     #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
>>     #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
>>     #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
>>
>>    Call Trace:
>>     <TASK>
>>       :
>>     touch_wq_lockdep_map+0x93/0x180
>>     __flush_workqueue+0x111/0x10b0
>>     housekeeping_update+0x12d/0x2d0
>>     update_parent_effective_cpumask+0x595/0x2440
>>     update_prstate+0x89d/0xce0
>>     cpuset_partition_write+0xc5/0x130
>>     cgroup_file_write+0x1a5/0x680
>>     kernfs_fop_write_iter+0x3df/0x5f0
>>     vfs_write+0x525/0xfd0
>>     ksys_write+0xf9/0x1d0
>>     do_syscall_64+0x95/0x520
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> To avoid such a circular locking dependency problem, we have to
>> call housekeeping_update() without holding the cpus_read_lock() and
>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>> may not have work functions that call cpus_read_lock() directly,
>> but we are likely to extend the list of wq's that are flushed in the
>> future. Moreover, the current set of work functions may hold locks that
>> may have cpu_hotplug_lock down the dependency chain.
>>
>> One way to do that is to introduce a new top level cpuset_top_mutex
>> which will be acquired first.  This new cpuset_top_mutex will provide
>> the need mutual exclusion without the need to hold cpus_read_lock().
>>
> Introducing a new global lock warrants careful consideration. I wonder if we
> could make all updates to isolated_cpus asynchronous. If that is feasible, we
> could avoid adding a global lock altogether. If not, we need to clarify which
> updates must remain synchronous and which ones can be handled asynchronously.

Almost all the cpuset code are run with cpuset_mutex held with either 
cpus_read_lock or cpus_write_lock. So there is no concurrent 
access/update to any of the cpuset internal data. The new 
cpuset_top_mutex is aded to resolve the possible deadlock scenarios with 
the new housekeeping_update() call without breaking this model. Allow 
parallel concurrent access/update to cpuset data will greatly complicate 
the code and we will likely missed some corner cases that we have to fix 
in the future. We will only do that if cpuset is in a critical 
performance path, but it is not. It is not just isolated_cpus that we 
are protecting, all the other cpuset data may be at risk if we don't 
have another top level mutex to protect them.

Cheers,
Longman

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Chen Ridong 6 days, 17 hours ago


On 2026/2/1 7:13, Waiman Long wrote:
> 
> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>
>> On 2026/1/30 23:42, Waiman Long wrote:
>>> The current cpuset partition code is able to dynamically update
>>> the sched domains of a running system and the corresponding
>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>> "isolcpus=domain,..." boot command line feature at run time.
>>>
>>> The housekeeping cpumask update requires flushing a number of different
>>> workqueues which may not be safe with cpus_read_lock() held as the
>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>> is an example of such circular locking problem.
>>>
>>>    ======================================================
>>>    WARNING: possible circular locking dependency detected
>>>    6.18.0-test+ #2 Tainted: G S
>>>    ------------------------------------------------------
>>>    test_cpuset_prs/10971 is trying to acquire lock:
>>>    ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>> touch_wq_lockdep_map+0x7a/0x180
>>>
>>>    but task is already holding lock:
>>>    ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>> cpuset_partition_write+0x85/0x130
>>>
>>>    which lock already depends on the new lock.
>>>
>>>    the existing dependency chain (in reverse order) is:
>>>    -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>    -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>    -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>    -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>    -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>
>>>    Chain exists of:
>>>      (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>
>>>    5 locks held by test_cpuset_prs/10971:
>>>     #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>>     #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>> kernfs_fop_write_iter+0x260/0x5f0
>>>     #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>     #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>> cpuset_partition_write+0x77/0x130
>>>     #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>> cpuset_partition_write+0x85/0x130
>>>
>>>    Call Trace:
>>>     <TASK>
>>>       :
>>>     touch_wq_lockdep_map+0x93/0x180
>>>     __flush_workqueue+0x111/0x10b0
>>>     housekeeping_update+0x12d/0x2d0
>>>     update_parent_effective_cpumask+0x595/0x2440
>>>     update_prstate+0x89d/0xce0
>>>     cpuset_partition_write+0xc5/0x130
>>>     cgroup_file_write+0x1a5/0x680
>>>     kernfs_fop_write_iter+0x3df/0x5f0
>>>     vfs_write+0x525/0xfd0
>>>     ksys_write+0xf9/0x1d0
>>>     do_syscall_64+0x95/0x520
>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> To avoid such a circular locking dependency problem, we have to
>>> call housekeeping_update() without holding the cpus_read_lock() and
>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>> may not have work functions that call cpus_read_lock() directly,
>>> but we are likely to extend the list of wq's that are flushed in the
>>> future. Moreover, the current set of work functions may hold locks that
>>> may have cpu_hotplug_lock down the dependency chain.
>>>
>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>
>> Introducing a new global lock warrants careful consideration. I wonder if we
>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>> could avoid adding a global lock altogether. If not, we need to clarify which
>> updates must remain synchronous and which ones can be handled asynchronously.
> 
> Almost all the cpuset code are run with cpuset_mutex held with either
> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
> possible deadlock scenarios with the new housekeeping_update() call without
> breaking this model. Allow parallel concurrent access/update to cpuset data will
> greatly complicate the code and we will likely missed some corner cases that we

I agree with that point. However, we already have paths where isolated_cpus is
updated asynchronously, meaning parallel concurrent access/update is already
happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
the locking simple(make all updates to isolated_cpus asynchronous)?

This is just a thought in my mind.

> have to fix in the future. We will only do that if cpuset is in a critical
> performance path, but it is not. It is not just isolated_cpus that we are
> protecting, all the other cpuset data may be at risk if we don't have another
> top level mutex to protect them.
> 
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Waiman Long 5 days, 23 hours ago

On 2/1/26 8:11 PM, Chen Ridong wrote:
>
> On 2026/2/1 7:13, Waiman Long wrote:
>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>> The current cpuset partition code is able to dynamically update
>>>> the sched domains of a running system and the corresponding
>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>
>>>> The housekeeping cpumask update requires flushing a number of different
>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>> is an example of such circular locking problem.
>>>>
>>>>     ======================================================
>>>>     WARNING: possible circular locking dependency detected
>>>>     6.18.0-test+ #2 Tainted: G S
>>>>     ------------------------------------------------------
>>>>     test_cpuset_prs/10971 is trying to acquire lock:
>>>>     ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>
>>>>     but task is already holding lock:
>>>>     ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>> cpuset_partition_write+0x85/0x130
>>>>
>>>>     which lock already depends on the new lock.
>>>>
>>>>     the existing dependency chain (in reverse order) is:
>>>>     -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>     -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>     -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>     -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>
>>>>     Chain exists of:
>>>>       (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>
>>>>     5 locks held by test_cpuset_prs/10971:
>>>>      #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
>>>>      #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>      #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>      #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>> cpuset_partition_write+0x77/0x130
>>>>      #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>> cpuset_partition_write+0x85/0x130
>>>>
>>>>     Call Trace:
>>>>      <TASK>
>>>>        :
>>>>      touch_wq_lockdep_map+0x93/0x180
>>>>      __flush_workqueue+0x111/0x10b0
>>>>      housekeeping_update+0x12d/0x2d0
>>>>      update_parent_effective_cpumask+0x595/0x2440
>>>>      update_prstate+0x89d/0xce0
>>>>      cpuset_partition_write+0xc5/0x130
>>>>      cgroup_file_write+0x1a5/0x680
>>>>      kernfs_fop_write_iter+0x3df/0x5f0
>>>>      vfs_write+0x525/0xfd0
>>>>      ksys_write+0xf9/0x1d0
>>>>      do_syscall_64+0x95/0x520
>>>>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> To avoid such a circular locking dependency problem, we have to
>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>> may not have work functions that call cpus_read_lock() directly,
>>>> but we are likely to extend the list of wq's that are flushed in the
>>>> future. Moreover, the current set of work functions may hold locks that
>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>
>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>
>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>> updates must remain synchronous and which ones can be handled asynchronously.
>> Almost all the cpuset code are run with cpuset_mutex held with either
>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>> possible deadlock scenarios with the new housekeeping_update() call without
>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>> greatly complicate the code and we will likely missed some corner cases that we
> I agree with that point. However, we already have paths where isolated_cpus is
> updated asynchronously, meaning parallel concurrent access/update is already
> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
> the locking simple(make all updates to isolated_cpus asynchronous)?

isolated_cpus should only be updated in isolated_cpus_update() where 
both cpuset_mutex and callback_lock are held. It can be read 
asynchronously if either cpuset_mutex or callback_lock is held. Can you 
show me the  places where this rule isn't followed?

Cheers,
Longman

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Chen Ridong 4 days, 16 hours ago


On 2026/2/3 2:29, Waiman Long wrote:
> On 2/1/26 8:11 PM, Chen Ridong wrote:
>>
>> On 2026/2/1 7:13, Waiman Long wrote:
>>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>>> The current cpuset partition code is able to dynamically update
>>>>> the sched domains of a running system and the corresponding
>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>>
>>>>> The housekeeping cpumask update requires flushing a number of different
>>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>>> is an example of such circular locking problem.
>>>>>
>>>>>     ======================================================
>>>>>     WARNING: possible circular locking dependency detected
>>>>>     6.18.0-test+ #2 Tainted: G S
>>>>>     ------------------------------------------------------
>>>>>     test_cpuset_prs/10971 is trying to acquire lock:
>>>>>     ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>>
>>>>>     but task is already holding lock:
>>>>>     ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>> cpuset_partition_write+0x85/0x130
>>>>>
>>>>>     which lock already depends on the new lock.
>>>>>
>>>>>     the existing dependency chain (in reverse order) is:
>>>>>     -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>>     -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>>     -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>>     -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>>     -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>>
>>>>>     Chain exists of:
>>>>>       (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>>
>>>>>     5 locks held by test_cpuset_prs/10971:
>>>>>      #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at:
>>>>> ksys_write+0xf9/0x1d0
>>>>>      #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>>      #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>>      #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>>> cpuset_partition_write+0x77/0x130
>>>>>      #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>> cpuset_partition_write+0x85/0x130
>>>>>
>>>>>     Call Trace:
>>>>>      <TASK>
>>>>>        :
>>>>>      touch_wq_lockdep_map+0x93/0x180
>>>>>      __flush_workqueue+0x111/0x10b0
>>>>>      housekeeping_update+0x12d/0x2d0
>>>>>      update_parent_effective_cpumask+0x595/0x2440
>>>>>      update_prstate+0x89d/0xce0
>>>>>      cpuset_partition_write+0xc5/0x130
>>>>>      cgroup_file_write+0x1a5/0x680
>>>>>      kernfs_fop_write_iter+0x3df/0x5f0
>>>>>      vfs_write+0x525/0xfd0
>>>>>      ksys_write+0xf9/0x1d0
>>>>>      do_syscall_64+0x95/0x520
>>>>>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> To avoid such a circular locking dependency problem, we have to
>>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>>> may not have work functions that call cpus_read_lock() directly,
>>>>> but we are likely to extend the list of wq's that are flushed in the
>>>>> future. Moreover, the current set of work functions may hold locks that
>>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>>
>>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>>
>>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>>> updates must remain synchronous and which ones can be handled asynchronously.
>>> Almost all the cpuset code are run with cpuset_mutex held with either
>>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>>> possible deadlock scenarios with the new housekeeping_update() call without
>>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>>> greatly complicate the code and we will likely missed some corner cases that we
>> I agree with that point. However, we already have paths where isolated_cpus is
>> updated asynchronously, meaning parallel concurrent access/update is already
>> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
>> the locking simple(make all updates to isolated_cpus asynchronous)?
> 
> isolated_cpus should only be updated in isolated_cpus_update() where both
> cpuset_mutex and callback_lock are held. It can be read asynchronously if either
> cpuset_mutex or callback_lock is held. Can you show me the  places where this
> rule isn't followed?
> 

I was considering that since the hotplug path calls update_isolation_cpumasks
asynchronously, could other cpuset paths (such as setting CPUs or partitions)
also call update_isolation_cpumasks asynchronously? If so, the global
cpuset_top_mutex lock might be unnecessary. Note that isolated_cpus is updated
synchronously, while housekeeping_update is invoked asynchronously.

Just a thought for discussion, and I’d really appreciate your insights on this.

-- 
Best regards,
Ridong

Re: [PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex

Posted by Waiman Long 3 days, 21 hours ago

On 2/3/26 8:55 PM, Chen Ridong wrote:
>
> On 2026/2/3 2:29, Waiman Long wrote:
>> On 2/1/26 8:11 PM, Chen Ridong wrote:
>>> On 2026/2/1 7:13, Waiman Long wrote:
>>>> On 1/30/26 9:53 PM, Chen Ridong wrote:
>>>>> On 2026/1/30 23:42, Waiman Long wrote:
>>>>>> The current cpuset partition code is able to dynamically update
>>>>>> the sched domains of a running system and the corresponding
>>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
>>>>>> "isolcpus=domain,..." boot command line feature at run time.
>>>>>>
>>>>>> The housekeeping cpumask update requires flushing a number of different
>>>>>> workqueues which may not be safe with cpus_read_lock() held as the
>>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks
>>>>>> which have locking dependency with cpus_read_lock() down the chain. Below
>>>>>> is an example of such circular locking problem.
>>>>>>
>>>>>>      ======================================================
>>>>>>      WARNING: possible circular locking dependency detected
>>>>>>      6.18.0-test+ #2 Tainted: G S
>>>>>>      ------------------------------------------------------
>>>>>>      test_cpuset_prs/10971 is trying to acquire lock:
>>>>>>      ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at:
>>>>>> touch_wq_lockdep_map+0x7a/0x180
>>>>>>
>>>>>>      but task is already holding lock:
>>>>>>      ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>>> cpuset_partition_write+0x85/0x130
>>>>>>
>>>>>>      which lock already depends on the new lock.
>>>>>>
>>>>>>      the existing dependency chain (in reverse order) is:
>>>>>>      -> #4 (cpuset_mutex){+.+.}-{4:4}:
>>>>>>      -> #3 (cpu_hotplug_lock){++++}-{0:0}:
>>>>>>      -> #2 (rtnl_mutex){+.+.}-{4:4}:
>>>>>>      -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
>>>>>>      -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
>>>>>>
>>>>>>      Chain exists of:
>>>>>>        (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
>>>>>>
>>>>>>      5 locks held by test_cpuset_prs/10971:
>>>>>>       #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at:
>>>>>> ksys_write+0xf9/0x1d0
>>>>>>       #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at:
>>>>>> kernfs_fop_write_iter+0x260/0x5f0
>>>>>>       #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at:
>>>>>> kernfs_fop_write_iter+0x2b6/0x5f0
>>>>>>       #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at:
>>>>>> cpuset_partition_write+0x77/0x130
>>>>>>       #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at:
>>>>>> cpuset_partition_write+0x85/0x130
>>>>>>
>>>>>>      Call Trace:
>>>>>>       <TASK>
>>>>>>         :
>>>>>>       touch_wq_lockdep_map+0x93/0x180
>>>>>>       __flush_workqueue+0x111/0x10b0
>>>>>>       housekeeping_update+0x12d/0x2d0
>>>>>>       update_parent_effective_cpumask+0x595/0x2440
>>>>>>       update_prstate+0x89d/0xce0
>>>>>>       cpuset_partition_write+0xc5/0x130
>>>>>>       cgroup_file_write+0x1a5/0x680
>>>>>>       kernfs_fop_write_iter+0x3df/0x5f0
>>>>>>       vfs_write+0x525/0xfd0
>>>>>>       ksys_write+0xf9/0x1d0
>>>>>>       do_syscall_64+0x95/0x520
>>>>>>       entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>>
>>>>>> To avoid such a circular locking dependency problem, we have to
>>>>>> call housekeeping_update() without holding the cpus_read_lock() and
>>>>>> cpuset_mutex. The current set of wq's flushed by housekeeping_update()
>>>>>> may not have work functions that call cpus_read_lock() directly,
>>>>>> but we are likely to extend the list of wq's that are flushed in the
>>>>>> future. Moreover, the current set of work functions may hold locks that
>>>>>> may have cpu_hotplug_lock down the dependency chain.
>>>>>>
>>>>>> One way to do that is to introduce a new top level cpuset_top_mutex
>>>>>> which will be acquired first.  This new cpuset_top_mutex will provide
>>>>>> the need mutual exclusion without the need to hold cpus_read_lock().
>>>>>>
>>>>> Introducing a new global lock warrants careful consideration. I wonder if we
>>>>> could make all updates to isolated_cpus asynchronous. If that is feasible, we
>>>>> could avoid adding a global lock altogether. If not, we need to clarify which
>>>>> updates must remain synchronous and which ones can be handled asynchronously.
>>>> Almost all the cpuset code are run with cpuset_mutex held with either
>>>> cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to
>>>> any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the
>>>> possible deadlock scenarios with the new housekeeping_update() call without
>>>> breaking this model. Allow parallel concurrent access/update to cpuset data will
>>>> greatly complicate the code and we will likely missed some corner cases that we
>>> I agree with that point. However, we already have paths where isolated_cpus is
>>> updated asynchronously, meaning parallel concurrent access/update is already
>>> happening. Therefore, we cannot entirely avoid such scenarios, so why not keep
>>> the locking simple(make all updates to isolated_cpus asynchronous)?
>> isolated_cpus should only be updated in isolated_cpus_update() where both
>> cpuset_mutex and callback_lock are held. It can be read asynchronously if either
>> cpuset_mutex or callback_lock is held. Can you show me the  places where this
>> rule isn't followed?
>>
> I was considering that since the hotplug path calls update_isolation_cpumasks
> asynchronously, could other cpuset paths (such as setting CPUs or partitions)
> also call update_isolation_cpumasks asynchronously? If so, the global
> cpuset_top_mutex lock might be unnecessary. Note that isolated_cpus is updated
> synchronously, while housekeeping_update is invoked asynchronously.

update_isolation_cpumasks() is always called synchronously as 
cpuset_mutex will always be held. With the current patchset, the only 
asynchronous piece is CPU hotplug vs the the housekeeping_update() call 
as it is being called without holding cpus_read_lock(). AFASICS, it 
should not be a problem. Please let me if you are aware of some 
potential hazard with the current setup.

Cheers,
Longman

[PATCH/for-next v2 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to workqueue
[PATCH/for-next v2 2/2] cgroup/cpuset: Introduce a new top level cpuset_top_mutex