The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3d0d18bf182f..2c80bfc30bbc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
rebuild_sched_domains_locked();
}
+/*
+ * Work function to invoke update_hk_sched_domains()
+ */
+static void hk_sd_workfn(struct work_struct *work)
+{
+ cpuset_full_lock();
+ update_hk_sched_domains();
+ cpuset_full_unlock();
+}
+
/**
* rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
* @parent: Parent cpuset containing all siblings
@@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
*/
static void cpuset_handle_hotplug(void)
{
+ static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated, mems_updated;
@@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
}
- if (update_housekeeping || force_sd_rebuild) {
- mutex_lock(&cpuset_mutex);
- update_hk_sched_domains();
- mutex_unlock(&cpuset_mutex);
- }
+ /*
+ * Queue a work to call housekeeping_update() & rebuild_sched_domains()
+ * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
+ * cpumask can correctly reflect what is in isolated_cpus.
+ *
+ * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
+ * is still pending. Before the pending bit is cleared, the work data
+ * is copied out and work item dequeued. So it is possible to queue
+ * the work again before the hk_sd_workfn() is invoked to process the
+ * previously queued work. Since hk_sd_workfn() doesn't use the work
+ * item at all, this is not a problem.
+ */
+ if (update_housekeeping || force_sd_rebuild)
+ queue_work(system_unbound_wq, &hk_sd_work);
+
free_tmpmasks(ptmp);
}
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 0c5db118f2d1..dc2dff361ec6 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -246,6 +246,9 @@ TEST_MATRIX=(
" C2-3:P1 C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1"
" C2-3:P1 C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1"
" C2-3:P1 C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1"
+ " C2-3:P1 C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2"
+ " C1-3:P1 C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
+ " C1-3:P1 C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3 A1:P1|A2:P2 3"
"$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
"$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
"$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
@@ -762,7 +765,7 @@ check_cgroup_states()
# only CPUs in isolated partitions as well as those that are isolated at
# boot time.
#
-# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
+# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
# <isolcpus1> - expected sched/domains value
# <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
#
@@ -771,6 +774,7 @@ check_isolcpus()
EXPECTED_ISOLCPUS=$1
ISCPUS=${CGROUP2}/cpuset.cpus.isolated
ISOLCPUS=$(cat $ISCPUS)
+ HKICPUS=$(cat /sys/devices/system/cpu/isolated)
LASTISOLCPU=
SCHED_DOMAINS=/sys/kernel/debug/sched/domains
if [[ $EXPECTED_ISOLCPUS = . ]]
@@ -808,6 +812,11 @@ check_isolcpus()
ISOLCPUS=
EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
+ #
+ # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
+ #
+ [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
+
#
# Use the sched domain in debugfs to check isolated CPUs, if available
#
--
2.53.0
Hi Waiman,
On 21/02/2026 18:54, Waiman Long wrote:
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
>
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the update_hk_sched_domains() function will be invoked via the
> hk_sd_workfn().
>
> An concurrent cpuset control file write may have executed the required
> update_hk_sched_domains() function before the work function is called. So
> the work function call may become a no-op when it is invoked.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
> 2 files changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d0d18bf182f..2c80bfc30bbc 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
> rebuild_sched_domains_locked();
> }
>
> +/*
> + * Work function to invoke update_hk_sched_domains()
> + */
> +static void hk_sd_workfn(struct work_struct *work)
> +{
> + cpuset_full_lock();
> + update_hk_sched_domains();
> + cpuset_full_unlock();
> +}
> +
> /**
> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
> * @parent: Parent cpuset containing all siblings
> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> */
> static void cpuset_handle_hotplug(void)
> {
> + static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
> static cpumask_t new_cpus;
> static nodemask_t new_mems;
> bool cpus_updated, mems_updated;
> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
> }
>
>
> - if (update_housekeeping || force_sd_rebuild) {
> - mutex_lock(&cpuset_mutex);
> - update_hk_sched_domains();
> - mutex_unlock(&cpuset_mutex);
> - }
> + /*
> + * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> + * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> + * cpumask can correctly reflect what is in isolated_cpus.
> + *
> + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
> + * is still pending. Before the pending bit is cleared, the work data
> + * is copied out and work item dequeued. So it is possible to queue
> + * the work again before the hk_sd_workfn() is invoked to process the
> + * previously queued work. Since hk_sd_workfn() doesn't use the work
> + * item at all, this is not a problem.
> + */
> + if (update_housekeeping || force_sd_rebuild)
> + queue_work(system_unbound_wq, &hk_sd_work);
> +
> free_tmpmasks(ptmp);
> }
>
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 0c5db118f2d1..dc2dff361ec6 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -246,6 +246,9 @@ TEST_MATRIX=(
> " C2-3:P1 C3:P1 . . O3=0 . . . 0 A1:2|A2: A1:P1|A2:P1"
> " C2-3:P1 C3:P1 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-1"
> " C2-3:P1 C3:P1 . . . T:O3=0 . . 0 A1:2|A2:2 A1:P1|A2:P-1"
> + " C2-3:P1 C3:P2 . . T:O2=0 . . . 0 A1:3|A2:3 A1:P1|A2:P-2"
> + " C1-3:P1 C3:P2 . . . T:O3=0 . . 0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> + " C1-3:P1 C3:P2 . . . T:O3=0 O3=1 . 0 A1:1-2|A2:3 A1:P1|A2:P2 3"
> "$SETUP_A123_PARTITIONS . O1=0 . . . 0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O2=0 . . . 0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
> "$SETUP_A123_PARTITIONS . O3=0 . . . 0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -762,7 +765,7 @@ check_cgroup_states()
> # only CPUs in isolated partitions as well as those that are isolated at
> # boot time.
> #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
> # <isolcpus1> - expected sched/domains value
> # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
> #
> @@ -771,6 +774,7 @@ check_isolcpus()
> EXPECTED_ISOLCPUS=$1
> ISCPUS=${CGROUP2}/cpuset.cpus.isolated
> ISOLCPUS=$(cat $ISCPUS)
> + HKICPUS=$(cat /sys/devices/system/cpu/isolated)
> LASTISOLCPU=
> SCHED_DOMAINS=/sys/kernel/debug/sched/domains
> if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -808,6 +812,11 @@ check_isolcpus()
> ISOLCPUS=
> EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>
> + #
> + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> + #
> + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
> #
> # Use the sched domain in debugfs to check isolated CPUs, if available
> #
We have a CPU hotplug test that cycles through all CPUs off-lining them
and on-lining them in different combinations. Since this change was
added to -next, this test is failing on our Tegra210 boards. Bisecting
the issue, it pointed to this commit and reverting this on top of -next
fixes the issue.
The test is quite simple and part of Thierry's tegra-test suite [0].
$ ./tegra-tests/tests/cpu.py --verbose hotplug
cpu: hotplug: CPU#0: mask: 1
cpu: hotplug: CPU#1: mask: 2
cpu: hotplug: CPU#2: mask: 4
cpu: hotplug: CPU#3: mask: 8
cpu: hotplug: applying mask 0xf
cpu: hotplug: applying mask 0xe
cpu: hotplug: applying mask 0xd
cpu: hotplug: applying mask 0xc
cpu: hotplug: applying mask 0xb
cpu: hotplug: applying mask 0xa
...
cpu: hotplug: applying mask 0x1
Traceback (most recent call last):
File "./tegra-tests/tests/cpu.py", line 159, in <module>
runner.standalone(module)
File "./tegra-tests/tests/runner.py", line 147, in standalone
log.test(log = log, args = args)
File "./tegra-tests/tests/cpu.py", line 29, in __call__
cpus.apply_mask(mask)
File "./tegra-tests/linux/system.py", line 149, in apply_mask
cpu.set_online(False)
File "./tegra-tests/linux/system.py", line 45, in set_online
self.online = online
OSError: [Errno 16] Device or resource busy
From looking at different runs it appears to fail at different places.
Let me know if you have any thoughts.
Thanks
Jon
[0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py
--
nvpublic
On 3/3/26 10:18 AM, Jon Hunter wrote:
> Hi Waiman,
>
> On 21/02/2026 18:54, Waiman Long wrote:
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update()
>> directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is
>> called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
>> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
>> 2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>> rebuild_sched_domains_locked();
>> }
>> +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> + cpuset_full_lock();
>> + update_hk_sched_domains();
>> + cpuset_full_unlock();
>> +}
>> +
>> /**
>> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by
>> sibling cpusets
>> * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct
>> cpuset *cs, struct tmpmasks *tmp)
>> */
>> static void cpuset_handle_hotplug(void)
>> {
>> + static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>> static cpumask_t new_cpus;
>> static nodemask_t new_mems;
>> bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>> }
>> - if (update_housekeeping || force_sd_rebuild) {
>> - mutex_lock(&cpuset_mutex);
>> - update_hk_sched_domains();
>> - mutex_unlock(&cpuset_mutex);
>> - }
>> + /*
>> + * Queue a work to call housekeeping_update() &
>> rebuild_sched_domains()
>> + * There will be a slight delay before the HK_TYPE_DOMAIN
>> housekeeping
>> + * cpumask can correctly reflect what is in isolated_cpus.
>> + *
>> + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item
>> that
>> + * is still pending. Before the pending bit is cleared, the work
>> data
>> + * is copied out and work item dequeued. So it is possible to queue
>> + * the work again before the hk_sd_workfn() is invoked to
>> process the
>> + * previously queued work. Since hk_sd_workfn() doesn't use the
>> work
>> + * item at all, this is not a problem.
>> + */
>> + if (update_housekeeping || force_sd_rebuild)
>> + queue_work(system_unbound_wq, &hk_sd_work);
>> +
>> free_tmpmasks(ptmp);
>> }
>> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> index 0c5db118f2d1..dc2dff361ec6 100755
>> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>> " C2-3:P1 C3:P1 . . O3=0 . . . 0
>> A1:2|A2: A1:P1|A2:P1"
>> " C2-3:P1 C3:P1 . . T:O2=0 . . . 0
>> A1:3|A2:3 A1:P1|A2:P-1"
>> " C2-3:P1 C3:P1 . . . T:O3=0 . . 0
>> A1:2|A2:2 A1:P1|A2:P-1"
>> + " C2-3:P1 C3:P2 . . T:O2=0 . . . 0
>> A1:3|A2:3 A1:P1|A2:P-2"
>> + " C1-3:P1 C3:P2 . . . T:O3=0 . . 0
>> A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
>> + " C1-3:P1 C3:P2 . . . T:O3=0 O3=1 . 0
>> A1:1-2|A2:3 A1:P1|A2:P2 3"
>> "$SETUP_A123_PARTITIONS . O1=0 . . . 0
>> A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>> "$SETUP_A123_PARTITIONS . O2=0 . . . 0
>> A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>> "$SETUP_A123_PARTITIONS . O3=0 . . . 0
>> A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
>> @@ -762,7 +765,7 @@ check_cgroup_states()
>> # only CPUs in isolated partitions as well as those that are
>> isolated at
>> # boot time.
>> #
>> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
>> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>> # <isolcpus1> - expected sched/domains value
>> # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not
>> defined
>> #
>> @@ -771,6 +774,7 @@ check_isolcpus()
>> EXPECTED_ISOLCPUS=$1
>> ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>> ISOLCPUS=$(cat $ISCPUS)
>> + HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>> LASTISOLCPU=
>> SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>> if [[ $EXPECTED_ISOLCPUS = . ]]
>> @@ -808,6 +812,11 @@ check_isolcpus()
>> ISOLCPUS=
>> EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>> + #
>> + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match
>> $ISOLCPUS
>> + #
>> + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
>> +
>> #
>> # Use the sched domain in debugfs to check isolated CPUs, if
>> available
>> #
>
> We have a CPU hotplug test that cycles through all CPUs off-lining
> them and on-lining them in different combinations. Since this change
> was added to -next, this test is failing on our Tegra210 boards.
> Bisecting the issue, it pointed to this commit and reverting this on
> top of -next fixes the issue.
>
> The test is quite simple and part of Thierry's tegra-test suite [0].
>
> $ ./tegra-tests/tests/cpu.py --verbose hotplug
> cpu: hotplug: CPU#0: mask: 1
> cpu: hotplug: CPU#1: mask: 2
> cpu: hotplug: CPU#2: mask: 4
> cpu: hotplug: CPU#3: mask: 8
> cpu: hotplug: applying mask 0xf
> cpu: hotplug: applying mask 0xe
> cpu: hotplug: applying mask 0xd
> cpu: hotplug: applying mask 0xc
> cpu: hotplug: applying mask 0xb
> cpu: hotplug: applying mask 0xa
> ...
> cpu: hotplug: applying mask 0x1
> Traceback (most recent call last):
> File "./tegra-tests/tests/cpu.py", line 159, in <module>
> runner.standalone(module)
> File "./tegra-tests/tests/runner.py", line 147, in standalone
> log.test(log = log, args = args)
> File "./tegra-tests/tests/cpu.py", line 29, in __call__
> cpus.apply_mask(mask)
> File "./tegra-tests/linux/system.py", line 149, in apply_mask
> cpu.set_online(False)
> File "./tegra-tests/linux/system.py", line 45, in set_online
> self.online = online
> OSError: [Errno 16] Device or resource busy
>
> From looking at different runs it appears to fail at different places.
>
> Let me know if you have any thoughts.
>
> Thanks
> Jon
>
> [0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py
>
It looks that -EBUSY was returned when the script tries to
online/offline a CPU. I ran a simple script to repetitively doing
offline/online operation and couldn't reproduce the problem. I don't
have access to the tegra board that you use for testing. Would you mind
trying out the following patch to see if it can get rid of the problem.
Thanks,
Longman
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e200de7c60b6..5a5953fb391c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void)
* previously queued work. Since hk_sd_workfn() doesn't use the
work
* item at all, this is not a problem.
*/
- if (update_housekeeping || force_sd_rebuild)
- queue_work(system_unbound_wq, &hk_sd_work);
+ if (force_sd_rebuild)
+ rebuild_sched_domains_cpuslocked();
+ if (update_housekeeping)
+ queue_work(system_dfl_wq, &hk_sd_work);
free_tmpmasks(ptmp);
}
On 04/03/2026 03:58, Waiman Long wrote: ... > It looks that -EBUSY was returned when the script tries to online/ > offline a CPU. I ran a simple script to repetitively doing offline/ > online operation and couldn't reproduce the problem. I don't have access > to the tegra board that you use for testing. Would you mind trying out > the following patch to see if it can get rid of the problem. > > Thanks, > Longman > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index e200de7c60b6..5a5953fb391c 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void) > * previously queued work. Since hk_sd_workfn() doesn't use the > work > * item at all, this is not a problem. > */ > - if (update_housekeeping || force_sd_rebuild) > - queue_work(system_unbound_wq, &hk_sd_work); > + if (force_sd_rebuild) > + rebuild_sched_domains_cpuslocked(); > + if (update_housekeeping) > + queue_work(system_dfl_wq, &hk_sd_work); > > free_tmpmasks(ptmp); > } > > Yes that did the trick. Works for me. Feel free to add my ... Tested-by: Jon Hunter <jonathanh@nvidia.com> Thanks Jon -- nvpublic
On 3/4/26 6:07 AM, Jon Hunter wrote: > > On 04/03/2026 03:58, Waiman Long wrote: > > ... > >> It looks that -EBUSY was returned when the script tries to online/ >> offline a CPU. I ran a simple script to repetitively doing offline/ >> online operation and couldn't reproduce the problem. I don't have >> access to the tegra board that you use for testing. Would you mind >> trying out the following patch to see if it can get rid of the problem. >> >> Thanks, >> Longman >> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index e200de7c60b6..5a5953fb391c 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -3936,8 +3936,10 @@ static void cpuset_handle_hotplug(void) >> * previously queued work. Since hk_sd_workfn() doesn't use >> the work >> * item at all, this is not a problem. >> */ >> - if (update_housekeeping || force_sd_rebuild) >> - queue_work(system_unbound_wq, &hk_sd_work); >> + if (force_sd_rebuild) >> + rebuild_sched_domains_cpuslocked(); >> + if (update_housekeeping) >> + queue_work(system_dfl_wq, &hk_sd_work); >> >> free_tmpmasks(ptmp); >> } >> >> > > Yes that did the trick. Works for me. Feel free to add my ... > > Tested-by: Jon Hunter <jonathanh@nvidia.com> Thanks for the confirmation. Cheers, Longman
On 3/3/26 10:18 AM, Jon Hunter wrote:
> Hi Waiman,
>
> On 21/02/2026 18:54, Waiman Long wrote:
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update()
>> directly
>> from update_isolation_cpumasks() will likely cause deadlock. So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is
>> called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
>> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
>> 2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>> rebuild_sched_domains_locked();
>> }
>> +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> + cpuset_full_lock();
>> + update_hk_sched_domains();
>> + cpuset_full_unlock();
>> +}
>> +
>> /**
>> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by
>> sibling cpusets
>> * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct
>> cpuset *cs, struct tmpmasks *tmp)
>> */
>> static void cpuset_handle_hotplug(void)
>> {
>> + static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>> static cpumask_t new_cpus;
>> static nodemask_t new_mems;
>> bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>> }
>> - if (update_housekeeping || force_sd_rebuild) {
>> - mutex_lock(&cpuset_mutex);
>> - update_hk_sched_domains();
>> - mutex_unlock(&cpuset_mutex);
>> - }
>> + /*
>> + * Queue a work to call housekeeping_update() &
>> rebuild_sched_domains()
>> + * There will be a slight delay before the HK_TYPE_DOMAIN
>> housekeeping
>> + * cpumask can correctly reflect what is in isolated_cpus.
>> + *
>> + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item
>> that
>> + * is still pending. Before the pending bit is cleared, the work
>> data
>> + * is copied out and work item dequeued. So it is possible to queue
>> + * the work again before the hk_sd_workfn() is invoked to
>> process the
>> + * previously queued work. Since hk_sd_workfn() doesn't use the
>> work
>> + * item at all, this is not a problem.
>> + */
>> + if (update_housekeeping || force_sd_rebuild)
>> + queue_work(system_unbound_wq, &hk_sd_work);
>> +
>> free_tmpmasks(ptmp);
>> }
>> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> index 0c5db118f2d1..dc2dff361ec6 100755
>> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
>> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>> " C2-3:P1 C3:P1 . . O3=0 . . . 0
>> A1:2|A2: A1:P1|A2:P1"
>> " C2-3:P1 C3:P1 . . T:O2=0 . . . 0
>> A1:3|A2:3 A1:P1|A2:P-1"
>> " C2-3:P1 C3:P1 . . . T:O3=0 . . 0
>> A1:2|A2:2 A1:P1|A2:P-1"
>> + " C2-3:P1 C3:P2 . . T:O2=0 . . . 0
>> A1:3|A2:3 A1:P1|A2:P-2"
>> + " C1-3:P1 C3:P2 . . . T:O3=0 . . 0
>> A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
>> + " C1-3:P1 C3:P2 . . . T:O3=0 O3=1 . 0
>> A1:1-2|A2:3 A1:P1|A2:P2 3"
>> "$SETUP_A123_PARTITIONS . O1=0 . . . 0
>> A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>> "$SETUP_A123_PARTITIONS . O2=0 . . . 0
>> A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>> "$SETUP_A123_PARTITIONS . O3=0 . . . 0
>> A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
>> @@ -762,7 +765,7 @@ check_cgroup_states()
>> # only CPUs in isolated partitions as well as those that are
>> isolated at
>> # boot time.
>> #
>> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
>> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>> # <isolcpus1> - expected sched/domains value
>> # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not
>> defined
>> #
>> @@ -771,6 +774,7 @@ check_isolcpus()
>> EXPECTED_ISOLCPUS=$1
>> ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>> ISOLCPUS=$(cat $ISCPUS)
>> + HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>> LASTISOLCPU=
>> SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>> if [[ $EXPECTED_ISOLCPUS = . ]]
>> @@ -808,6 +812,11 @@ check_isolcpus()
>> ISOLCPUS=
>> EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>> + #
>> + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match
>> $ISOLCPUS
>> + #
>> + [[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
>> +
>> #
>> # Use the sched domain in debugfs to check isolated CPUs, if
>> available
>> #
>
> We have a CPU hotplug test that cycles through all CPUs off-lining
> them and on-lining them in different combinations. Since this change
> was added to -next, this test is failing on our Tegra210 boards.
> Bisecting the issue, it pointed to this commit and reverting this on
> top of -next fixes the issue.
>
> The test is quite simple and part of Thierry's tegra-test suite [0].
>
> $ ./tegra-tests/tests/cpu.py --verbose hotplug
> cpu: hotplug: CPU#0: mask: 1
> cpu: hotplug: CPU#1: mask: 2
> cpu: hotplug: CPU#2: mask: 4
> cpu: hotplug: CPU#3: mask: 8
> cpu: hotplug: applying mask 0xf
> cpu: hotplug: applying mask 0xe
> cpu: hotplug: applying mask 0xd
> cpu: hotplug: applying mask 0xc
> cpu: hotplug: applying mask 0xb
> cpu: hotplug: applying mask 0xa
> ...
> cpu: hotplug: applying mask 0x1
> Traceback (most recent call last):
> File "./tegra-tests/tests/cpu.py", line 159, in <module>
> runner.standalone(module)
> File "./tegra-tests/tests/runner.py", line 147, in standalone
> log.test(log = log, args = args)
> File "./tegra-tests/tests/cpu.py", line 29, in __call__
> cpus.apply_mask(mask)
> File "./tegra-tests/linux/system.py", line 149, in apply_mask
> cpu.set_online(False)
> File "./tegra-tests/linux/system.py", line 45, in set_online
> self.online = online
> OSError: [Errno 16] Device or resource busy
>
> From looking at different runs it appears to fail at different places.
>
> Let me know if you have any thoughts.
>
> Thanks
> Jon
>
> [0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py
Thanks for the report. Will take a further look into this problem and
report back what I find.
Cheers,
Longman
On Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long wrote: > The cpuset_handle_hotplug() may need to invoke housekeeping_update(), > for instance, when an isolated partition is invalidated because its > last active CPU has been put offline. > > As we are going to enable dynamic update to the nozh_full housekeeping > cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, > allowing the CPU hotplug path to call into housekeeping_update() directly > from update_isolation_cpumasks() will likely cause deadlock. It would be nice to describe the deadlock scenario here. Thanks.
Le Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long a écrit :
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
>
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock.
I am a bit confused here. Why would CPU hotplug path need to call
update_isolation_cpumasks() -> housekeeping_update() for
HK_TYPE_KERNEL_NOISE?
> So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the update_hk_sched_domains() function will be invoked via the
> hk_sd_workfn().
>
> An concurrent cpuset control file write may have executed the required
> update_hk_sched_domains() function before the work function is called. So
> the work function call may become a no-op when it is invoked.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
> 2 files changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d0d18bf182f..2c80bfc30bbc 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
> rebuild_sched_domains_locked();
> }
>
> +/*
> + * Work function to invoke update_hk_sched_domains()
> + */
> +static void hk_sd_workfn(struct work_struct *work)
> +{
> + cpuset_full_lock();
> + update_hk_sched_domains();
> + cpuset_full_unlock();
> +}
> +
> /**
> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
> * @parent: Parent cpuset containing all siblings
> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
> */
> static void cpuset_handle_hotplug(void)
> {
> + static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
> static cpumask_t new_cpus;
> static nodemask_t new_mems;
> bool cpus_updated, mems_updated;
> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
> }
>
>
> - if (update_housekeeping || force_sd_rebuild) {
> - mutex_lock(&cpuset_mutex);
> - update_hk_sched_domains();
> - mutex_unlock(&cpuset_mutex);
> - }
> + /*
> + * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> + * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> + * cpumask can correctly reflect what is in isolated_cpus.
> + *
> + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
> + * is still pending. Before the pending bit is cleared, the work data
> + * is copied out and work item dequeued. So it is possible to queue
> + * the work again before the hk_sd_workfn() is invoked to process the
> + * previously queued work. Since hk_sd_workfn() doesn't use the work
> + * item at all, this is not a problem.
> + */
> + if (update_housekeeping || force_sd_rebuild)
> + queue_work(system_unbound_wq, &hk_sd_work);
Nit about recent wq renames:
s/system_unbound_wq/system_dfl_wq
But what makes sure this work is executed by the end of the hotplug operations?
Is there a risk for a stale hierarchy to be observed when it shouldn't? Or a
stale housekeeping cpumask?
Thanks.
--
Frederic Weisbecker
SUSE Labs
On 2/26/26 11:06 AM, Frederic Weisbecker wrote:
> Le Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long a écrit :
>> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
>> for instance, when an isolated partition is invalidated because its
>> last active CPU has been put offline.
>>
>> As we are going to enable dynamic update to the nozh_full housekeeping
>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
>> allowing the CPU hotplug path to call into housekeeping_update() directly
>> from update_isolation_cpumasks() will likely cause deadlock.
> I am a bit confused here. Why would CPU hotplug path need to call
> update_isolation_cpumasks() -> housekeeping_update() for
> HK_TYPE_KERNEL_NOISE?
Oh, this is not the current behavior. However, to make nohz_full fully
dynamically changeable in the near future, we will have to do that
eventually.
Cheers,
Longman
>> So we
>> have to defer any call to housekeeping_update() after the CPU hotplug
>> operation has finished. This is now done via the workqueue where
>> the update_hk_sched_domains() function will be invoked via the
>> hk_sd_workfn().
>>
>> An concurrent cpuset control file write may have executed the required
>> update_hk_sched_domains() function before the work function is called. So
>> the work function call may become a no-op when it is invoked.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
>> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
>> 2 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 3d0d18bf182f..2c80bfc30bbc 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>> rebuild_sched_domains_locked();
>> }
>>
>> +/*
>> + * Work function to invoke update_hk_sched_domains()
>> + */
>> +static void hk_sd_workfn(struct work_struct *work)
>> +{
>> + cpuset_full_lock();
>> + update_hk_sched_domains();
>> + cpuset_full_unlock();
>> +}
>> +
>> /**
>> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>> * @parent: Parent cpuset containing all siblings
>> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>> */
>> static void cpuset_handle_hotplug(void)
>> {
>> + static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>> static cpumask_t new_cpus;
>> static nodemask_t new_mems;
>> bool cpus_updated, mems_updated;
>> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>> }
>>
>>
>> - if (update_housekeeping || force_sd_rebuild) {
>> - mutex_lock(&cpuset_mutex);
>> - update_hk_sched_domains();
>> - mutex_unlock(&cpuset_mutex);
>> - }
>> + /*
>> + * Queue a work to call housekeeping_update() & rebuild_sched_domains()
>> + * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
>> + * cpumask can correctly reflect what is in isolated_cpus.
>> + *
>> + * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
>> + * is still pending. Before the pending bit is cleared, the work data
>> + * is copied out and work item dequeued. So it is possible to queue
>> + * the work again before the hk_sd_workfn() is invoked to process the
>> + * previously queued work. Since hk_sd_workfn() doesn't use the work
>> + * item at all, this is not a problem.
>> + */
>> + if (update_housekeeping || force_sd_rebuild)
>> + queue_work(system_unbound_wq, &hk_sd_work);
> Nit about recent wq renames:
>
> s/system_unbound_wq/system_dfl_wq
Good point. Will send additional patch to do the rename.
>
> But what makes sure this work is executed by the end of the hotplug operations?
> Is there a risk for a stale hierarchy to be observed when it shouldn't? Or a
> stale housekeeping cpumask?
If you look at the work function, it will make a copy of HK_TYPE_DOMAIN
cpumask while holding rcu_read_lock(). So the current hotplug operation
must have finished at that point. Of course, if there is another
hot-add/remove operation right after the rcu_read_lock is released, the
cpumask passed down to housekeeping_update() may not be the latest one.
In this case, another work will be scheduled to call
housekeeping_update() with the new cpumask again.
Cheers,
Longman
Le Tue, Mar 03, 2026 at 11:00:54AM -0500, Waiman Long a écrit : > On 2/26/26 11:06 AM, Frederic Weisbecker wrote: > If you look at the work function, it will make a copy of HK_TYPE_DOMAIN > cpumask while holding rcu_read_lock(). Where? > So the current hotplug operation must > have finished at that point. I'm confused. This is called from sched_cpu_deactivate(), right? So the work is scheduled at that point. But the work does cpuset_full_lock() which includes cpu hotplug read lock, so the sched domain rebuild can only happen at the end of cpu_down(). This means that between CPUHP_TEARDOWN_CPU and CPUHP_OFFLINE, the offline CPU still appears in the scheduler topology because the scheduler domains haven't been rebuilt. And even if the work wouldn't cpu hotplug read lock, what guarantees that it executes before reaching CPUHP_TEARDOWN_CPU? > Of course, if there is another hot-add/remove > operation right after the rcu_read_lock is released, the cpumask passed down > to housekeeping_update() may not be the latest one. In this case, another > work will be scheduled to call housekeeping_update() with the new cpumask > again. I'm not so much worried about housekeeping_update() (yet). I'm worried about topology rebuild to happen before CPUHP_TEARDOWN_CPU. Offline CPUs shouldn't exist in the topology. Thanks. -- Frederic Weisbecker SUSE Labs
On 3/3/26 5:48 PM, Frederic Weisbecker wrote: > Le Tue, Mar 03, 2026 at 11:00:54AM -0500, Waiman Long a écrit : >> On 2/26/26 11:06 AM, Frederic Weisbecker wrote: >> If you look at the work function, it will make a copy of HK_TYPE_DOMAIN >> cpumask while holding rcu_read_lock(). > Where? > >> So the current hotplug operation must >> have finished at that point. > I'm confused. This is called from sched_cpu_deactivate(), right? > So the work is scheduled at that point. But the work does cpuset_full_lock() > which includes cpu hotplug read lock, so the sched domain rebuild can only > happen at the end of cpu_down(). > > This means that between CPUHP_TEARDOWN_CPU and CPUHP_OFFLINE, the offline > CPU still appears in the scheduler topology because the scheduler domains > haven't been rebuilt. > > And even if the work wouldn't cpu hotplug read lock, what guarantees that > it executes before reaching CPUHP_TEARDOWN_CPU? > >> Of course, if there is another hot-add/remove >> operation right after the rcu_read_lock is released, the cpumask passed down >> to housekeeping_update() may not be the latest one. In this case, another >> work will be scheduled to call housekeeping_update() with the new cpumask >> again. > I'm not so much worried about housekeeping_update() (yet). I'm worried about > topology rebuild to happen before CPUHP_TEARDOWN_CPU. Offline CPUs shouldn't > exist in the topology. Yes, I am aware that this could be a problem. I am working on a fix patch that will always do a rebuild_sched_domains_cpuslocked() call directly in the hotplug path if needed as shown in the patch that I sent to Jon. I want to get a confirmation first before I send it out. There will be other minor code/comment adjustments as well. Cheers, Longman
© 2016 - 2026 Red Hat, Inc.