[v3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

[PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 3 months, 3 weeks ago

When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
[   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
[   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
[   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
[   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
[   97.438028] sp : ffff800097c6b9a0
[   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
[   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
[   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
[   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
[   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
[   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
[   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
[   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
[   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
[   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
[   97.514379] Call trace:
[   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
[   97.521769]  machine_shutdown+0x20/0x38
[   97.525693]  kernel_kexec+0xc4/0xf0
[   97.529260]  __do_sys_reboot+0x24c/0x278
[   97.533272]  __arm64_sys_reboot+0x2c/0x40
[   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
[   97.542179]  do_el0_svc+0xb0/0xe8
[   97.545562]  el0_svc+0x44/0x1d0
[   97.548772]  el0t_64_sync_handler+0x120/0x130
[   97.553222]  el0t_64_sync+0x1a4/0x1a8
[   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
[   97.563191] ---[ end trace 0000000000000000 ]---
[   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
[   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
[   97.608502] PHYS_OFFSET: 0x80000000
[   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
[   97.617580] Memory Limit: none
[   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]

Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
When a CPU is inactive, its rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. As a result,
its bandwidth is wrongly accounted into def_root_domain during domain
rebuild.

The key point is that root_domain is only tracked through active rq->rd.
To avoid using a global data structure to track all root_domains in the
system, we need a way to locate an active CPU within the corresponding
root_domain.

The following rules stand for deadline sub-system and help locating the
active cpu
  -1.any cpu belongs to a unique root domain at a given time
  -2.DL bandwidth checker ensures that the root domain has active cpus.

Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset that has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
ancestor root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the ancestor root domain.

This patch groups CPUs into isolated and housekeeping sets. For the
housekeeping group, it walks up the cpuset hierarchy to find active CPUs
in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: cgroups@vger.kernel.org
To: linux-kernel@vger.kernel.org
---
 include/linux/cpuset.h  | 18 ++++++++++++++++++
 kernel/cgroup/cpuset.c  | 27 +++++++++++++++++++++++++++
 kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
 3 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b51..7c00ebcdf85d9 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
 
 extern void cpuset_print_current_mems_allowed(void);
 extern void cpuset_reset_sched_domains(void);
+extern void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus);
 
 /*
  * read_mems_allowed_begin is required when making decisions involving
@@ -276,6 +277,23 @@ static inline void cpuset_reset_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+static inline void task_get_rd_effective_cpus(struct task_struct *p,
+		struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+	struct cpumask msk;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
+			/* isolated cpus belong to a root domain */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+	cpumask_and(cpus, cpu_active_mask, hk_msk);
+}
+
 static inline void cpuset_print_current_mems_allowed(void)
 {
 }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675d..f7b18892ed093 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1102,6 +1102,33 @@ void cpuset_reset_sched_domains(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+/* caller hold RCU read lock */
+void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+	struct cpumask msk;
+	struct cpuset *cs;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
+			/* isolated cpus belong to a root domain */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+	/* In HK_TYPE_DOMAIN, cpuset can be applied */
+	cs = task_cs(p);
+	while (cs != &top_cpuset) {
+		if (is_sched_load_balance(cs))
+			break;
+		cs = parent_cs(cs);
+	}
+
+	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
+	cpumask_and(cpus, cs->effective_cpus, hk_msk);
+}
+
 /**
  * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
  * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 72c1f72463c75..0a35b165d42a0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
+	struct cpumask msk;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	/* prevent race among cpu hotplug, changing of partition_root_state */
+	lockdep_assert_cpus_held();
+	/*
+	 * If @p is in blocked state, task_cpu() may be not active. In that
+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
+	 * @p must belong to an root_domain at any given time, which must have
+	 * active rq, whose rq->rd traces the valid root domain.
+	 */
+	task_get_rd_effective_cpus(p, &msk);
+	cpu = cpumask_first_and(cpu_active_mask, &msk);
+	/*
+	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
+	 * check prevents CPU hot removal from deactivating all CPUs in that
+	 * domain.
+	 */
+	BUG_ON(cpu >= nr_cpu_ids);
+	rq = cpu_rq(cpu);
+	/*
+	 * This point is under the protection of cpu_hotplug_lock. Hence
+	 * rq->rd is stable.
+	 */
 	dl_b = &rq->rd->dl_bw;
 	raw_spin_lock(&dl_b->lock);
-
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by kernel test robot 3 months, 3 weeks ago

Hi Pingfan,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on tip/sched/core tip/master linus/master v6.18-rc1 next-20251017]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pingfan-Liu/sched-deadline-Walk-up-cpuset-hierarchy-to-decide-root-domain-when-hot-unplug/20251017-202902
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20251017122636.17671-1-piliu%40redhat.com
patch subject: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
config: i386-randconfig-141-20251018 (https://download.01.org/0day-ci/archive/20251018/202510181259.vccVb2DD-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251018/202510181259.vccVb2DD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510181259.vccVb2DD-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from init/main.c:53:
   include/linux/cpuset.h: In function 'task_get_rd_effective_cpus':
   include/linux/cpuset.h:286:18: error: implicit declaration of function 'housekeeping_cpumask' [-Wimplicit-function-declaration]
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                  ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:286:39: error: 'HK_TYPE_DOMAIN' undeclared (first use in this function)
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                                       ^~~~~~~~~~~~~~
   include/linux/cpuset.h:286:39: note: each undeclared identifier is reported only once for each function it appears in
   include/linux/cpuset.h:287:13: error: implicit declaration of function 'housekeeping_enabled' [-Wimplicit-function-declaration]
     287 |         if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
         |             ^~~~~~~~~~~~~~~~~~~~
   In file included from init/main.c:57:
   include/linux/sched/isolation.h: At top level:
>> include/linux/sched/isolation.h:43:37: error: conflicting types for 'housekeeping_cpumask'; have 'const struct cpumask *(enum hk_type)'
      43 | static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
         |                                     ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:286:18: note: previous implicit declaration of 'housekeeping_cpumask' with type 'int()'
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                  ^~~~~~~~~~~~~~~~~~~~
>> include/linux/sched/isolation.h:48:20: error: conflicting types for 'housekeeping_enabled'; have 'bool(enum hk_type)' {aka '_Bool(enum hk_type)'}
      48 | static inline bool housekeeping_enabled(enum hk_type type)
         |                    ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:287:13: note: previous implicit declaration of 'housekeeping_enabled' with type 'int()'
     287 |         if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
         |             ^~~~~~~~~~~~~~~~~~~~
--
   In file included from include/linux/sched/isolation.h:5,
                    from kernel/cpu.c:13:
   include/linux/cpuset.h: In function 'task_get_rd_effective_cpus':
   include/linux/cpuset.h:286:18: error: implicit declaration of function 'housekeeping_cpumask' [-Wimplicit-function-declaration]
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                  ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:286:39: error: 'HK_TYPE_DOMAIN' undeclared (first use in this function); did you mean 'TOPO_TILE_DOMAIN'?
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                                       ^~~~~~~~~~~~~~
         |                                       TOPO_TILE_DOMAIN
   include/linux/cpuset.h:286:39: note: each undeclared identifier is reported only once for each function it appears in
   include/linux/cpuset.h:287:13: error: implicit declaration of function 'housekeeping_enabled' [-Wimplicit-function-declaration]
     287 |         if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
         |             ^~~~~~~~~~~~~~~~~~~~
   include/linux/sched/isolation.h: At top level:
>> include/linux/sched/isolation.h:43:37: error: conflicting types for 'housekeeping_cpumask'; have 'const struct cpumask *(enum hk_type)'
      43 | static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
         |                                     ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:286:18: note: previous implicit declaration of 'housekeeping_cpumask' with type 'int()'
     286 |         hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
         |                  ^~~~~~~~~~~~~~~~~~~~
>> include/linux/sched/isolation.h:48:20: error: conflicting types for 'housekeeping_enabled'; have 'bool(enum hk_type)' {aka '_Bool(enum hk_type)'}
      48 | static inline bool housekeeping_enabled(enum hk_type type)
         |                    ^~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:287:13: note: previous implicit declaration of 'housekeeping_enabled' with type 'int()'
     287 |         if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
         |             ^~~~~~~~~~~~~~~~~~~~


vim +43 include/linux/sched/isolation.h

7863406143d8bb Frederic Weisbecker 2017-10-27  42  
04d4e665a60902 Frederic Weisbecker 2022-02-07 @43  static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
7863406143d8bb Frederic Weisbecker 2017-10-27  44  {
7863406143d8bb Frederic Weisbecker 2017-10-27  45  	return cpu_possible_mask;
7863406143d8bb Frederic Weisbecker 2017-10-27  46  }
7863406143d8bb Frederic Weisbecker 2017-10-27  47  
04d4e665a60902 Frederic Weisbecker 2022-02-07 @48  static inline bool housekeeping_enabled(enum hk_type type)
0c5f81dad46c90 Wanpeng Li          2019-07-06  49  {
0c5f81dad46c90 Wanpeng Li          2019-07-06  50  	return false;
0c5f81dad46c90 Wanpeng Li          2019-07-06  51  }
0c5f81dad46c90 Wanpeng Li          2019-07-06  52  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Waiman Long 3 months, 3 weeks ago

On 10/17/25 8:26 AM, Pingfan Liu wrote:
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
>
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> [   97.438028] sp : ffff800097c6b9a0
> [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> [   97.514379] Call trace:
> [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.521769]  machine_shutdown+0x20/0x38
> [   97.525693]  kernel_kexec+0xc4/0xf0
> [   97.529260]  __do_sys_reboot+0x24c/0x278
> [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> [   97.542179]  do_el0_svc+0xb0/0xe8
> [   97.545562]  el0_svc+0x44/0x1d0
> [   97.548772]  el0t_64_sync_handler+0x120/0x130
> [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> [   97.563191] ---[ end trace 0000000000000000 ]---
> [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> [   97.608502] PHYS_OFFSET: 0x80000000
> [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> [   97.617580] Memory Limit: none
> [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
>
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> When a CPU is inactive, its rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. As a result,
> its bandwidth is wrongly accounted into def_root_domain during domain
> rebuild.

First of all, in an emergency situation when we need to shutdown the 
kernel, does it really matter if dl_bw_activate() returns -EBUSY? Should 
we just go ahead and ignore this dl_bw generated error?


> The key point is that root_domain is only tracked through active rq->rd.
> To avoid using a global data structure to track all root_domains in the
> system, we need a way to locate an active CPU within the corresponding
> root_domain.
>
> The following rules stand for deadline sub-system and help locating the
> active cpu
>    -1.any cpu belongs to a unique root domain at a given time
>    -2.DL bandwidth checker ensures that the root domain has active cpus.
>
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset that has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> ancestor root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the ancestor root domain.
>
> This patch groups CPUs into isolated and housekeeping sets. For the
> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
>
> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Michal Koutný" <mkoutny@suse.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Pierre Gondois <pierre.gondois@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> To: cgroups@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> ---
>   include/linux/cpuset.h  | 18 ++++++++++++++++++
>   kernel/cgroup/cpuset.c  | 27 +++++++++++++++++++++++++++
>   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>   3 files changed, 69 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..7c00ebcdf85d9 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
>   
>   extern void cpuset_print_current_mems_allowed(void);
>   extern void cpuset_reset_sched_domains(void);
> +extern void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>   
>   /*
>    * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +277,23 @@ static inline void cpuset_reset_sched_domains(void)
>   	partition_sched_domains(1, NULL, NULL);
>   }
>   
> +static inline void task_get_rd_effective_cpus(struct task_struct *p,
> +		struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +	struct cpumask msk;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> +}

The size of struct cpumask can be large depending on the extra value of 
NR_CPUS. For a x86-64 RHEL kernel, it is over 1 kbytes. We can actually 
eliminate the use of a struct cpumask variable by replacing 
cpumask_and() with cpumask_intersects().

You said that isolated CPUs belong to a root domain. In the case of CPUs 
within an isolated partition, the CPUs are in a null root domain which I 
don't know if it is problematic or not.

We usually prefix an externally visible function from cpuset with the 
cpuset prefix to avoid namespace collision. You should consider doing 
that for this function.

Also I am still not very clear about the exact purpose of this function. 
You should probably add comment about this.

> +
>   static inline void cpuset_print_current_mems_allowed(void)
>   {
>   }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..f7b18892ed093 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,33 @@ void cpuset_reset_sched_domains(void)
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> +/* caller hold RCU read lock */
> +void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +	struct cpumask msk;
> +	struct cpuset *cs;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> +}
> +

Similar problems with the non-CONFIG_CPUSETS version in cpuset.h.

Cheers,
Longman

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 3 months, 2 weeks ago

Hi Waiman,

I appreciate your time in reviewing my patch. Please see the comment
belows.

On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> > 
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > [   97.438028] sp : ffff800097c6b9a0
> > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > [   97.514379] Call trace:
> > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > [   97.521769]  machine_shutdown+0x20/0x38
> > [   97.525693]  kernel_kexec+0xc4/0xf0
> > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > [   97.542179]  do_el0_svc+0xb0/0xe8
> > [   97.545562]  el0_svc+0x44/0x1d0
> > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > [   97.563191] ---[ end trace 0000000000000000 ]---
> > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > [   97.608502] PHYS_OFFSET: 0x80000000
> > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > [   97.617580] Memory Limit: none
> > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > 
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > When a CPU is inactive, its rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. As a result,
> > its bandwidth is wrongly accounted into def_root_domain during domain
> > rebuild.
> 
> First of all, in an emergency situation when we need to shutdown the kernel,
> does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> ahead and ignore this dl_bw generated error?
> 

Ah, sorry - the previous test example was misleading. Let me restate it
as an equivalent operation on a system with 144 CPUs:
  sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'

That extracts the hot-removal part, which is affected by the bug, from
the kexec reboot process. It expects that only cpu0 is online, but in
practice, the cpu143 refused to be offline due to this bug.

As for the ignorance of dl_bw in the kexec process, I have a dedicated
draft for it. Later I will send it out and cc you.

> 
> > The key point is that root_domain is only tracked through active rq->rd.
> > To avoid using a global data structure to track all root_domains in the
> > system, we need a way to locate an active CPU within the corresponding
> > root_domain.
> > 
> > The following rules stand for deadline sub-system and help locating the
> > active cpu
> >    -1.any cpu belongs to a unique root domain at a given time
> >    -2.DL bandwidth checker ensures that the root domain has active cpus.
> > 
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset that has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > ancestor root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the ancestor root domain.
> > 
> > This patch groups CPUs into isolated and housekeeping sets. For the
> > housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> > in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> > 
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: "Michal Koutný" <mkoutny@suse.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Pierre Gondois <pierre.gondois@arm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > To: cgroups@vger.kernel.org
> > To: linux-kernel@vger.kernel.org
> > ---
> >   include/linux/cpuset.h  | 18 ++++++++++++++++++
> >   kernel/cgroup/cpuset.c  | 27 +++++++++++++++++++++++++++
> >   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >   3 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..7c00ebcdf85d9 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
> >   extern void cpuset_print_current_mems_allowed(void);
> >   extern void cpuset_reset_sched_domains(void);
> > +extern void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >   /*
> >    * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +277,23 @@ static inline void cpuset_reset_sched_domains(void)
> >   	partition_sched_domains(1, NULL, NULL);
> >   }
> > +static inline void task_get_rd_effective_cpus(struct task_struct *p,
> > +		struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +	struct cpumask msk;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> > +}
> 
> The size of struct cpumask can be large depending on the extra value of
> NR_CPUS. For a x86-64 RHEL kernel, it is over 1 kbytes. We can actually
> eliminate the use of a struct cpumask variable by replacing cpumask_and()
> with cpumask_intersects().
> 

OK.

> You said that isolated CPUs belong to a root domain. In the case of CPUs
> within an isolated partition, the CPUs are in a null root domain which I
> don't know if it is problematic or not.
> 

If I understand correctly, during CPU hot-removal, the following rules apply:

-1.Check whether the total dl_bw of all DL tasks in the affected root
domain can be satisfied by the remaining CPUs in the same root domain.
If it can, the hot-removal proceeds; otherwise, the hot-removal is rejected.

-2.During the CPU hot-removal process, migratable tasks on the dying CPU
are forcibly migrated to other CPUs in the same root domain, regardless
of their CPU affinity.

My patch does not violate these rules.

> We usually prefix an externally visible function from cpuset with the cpuset
> prefix to avoid namespace collision. You should consider doing that for this
> function.
> 

OK.

> Also I am still not very clear about the exact purpose of this function. You
> should probably add comment about this.
> 

> > +
> >   static inline void cpuset_print_current_mems_allowed(void)
> >   {
> >   }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..f7b18892ed093 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,33 @@ void cpuset_reset_sched_domains(void)
> >   	mutex_unlock(&cpuset_mutex);
> >   }
> > +/* caller hold RCU read lock */
> > +void task_get_rd_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +	struct cpumask msk;
> > +	struct cpuset *cs;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_and(&msk, p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> > +	cs = task_cs(p);
> > +	while (cs != &top_cpuset) {
> > +		if (is_sched_load_balance(cs))
> > +			break;
> > +		cs = parent_cs(cs);
> > +	}
> > +
> > +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> > +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> > +}
> > +
> 
> Similar problems with the non-CONFIG_CPUSETS version in cpuset.h.
> 

OK, I will fix it.

Thanks,

Pingfan

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Juri Lelli 3 months, 2 weeks ago

Hi!

On 20/10/25 11:21, Pingfan Liu wrote:
> Hi Waiman,
> 
> I appreciate your time in reviewing my patch. Please see the comment
> belows.
> 
> On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > When testing kexec-reboot on a 144 cpus machine with
> > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > encounter the following bug:
> > > 
> > > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > > [   97.333236] Failed to offline CPU143 - error=-16
> > > [   97.333246] ------------[ cut here ]------------
> > > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > [   97.438028] sp : ffff800097c6b9a0
> > > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > [   97.514379] Call trace:
> > > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > > [   97.521769]  machine_shutdown+0x20/0x38
> > > [   97.525693]  kernel_kexec+0xc4/0xf0
> > > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > > [   97.542179]  do_el0_svc+0xb0/0xe8
> > > [   97.545562]  el0_svc+0x44/0x1d0
> > > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > [   97.563191] ---[ end trace 0000000000000000 ]---
> > > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > [   97.608502] PHYS_OFFSET: 0x80000000
> > > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > [   97.617580] Memory Limit: none
> > > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > > 
> > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > rebuild.
> > 
> > First of all, in an emergency situation when we need to shutdown the kernel,
> > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > ahead and ignore this dl_bw generated error?
> > 
> 
> Ah, sorry - the previous test example was misleading. Let me restate it
> as an equivalent operation on a system with 144 CPUs:
>   sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> 
> That extracts the hot-removal part, which is affected by the bug, from
> the kexec reboot process. It expects that only cpu0 is online, but in
> practice, the cpu143 refused to be offline due to this bug.

I confess I am still perplexed by this, considering the "particular"
nature of cppc worker that seems to be the only task that is able to
trigger this problem. First of all, is that indeed the case or are you
able to reproduce this problem with standard (non-kthread) DEADLINE
tasks as well?

I essentially wonder how cppc worker affinity/migration on hotplug is
handled. With your isolcpus configuration you have one isolated root
domain per isolated cpu, so if cppc worker is not migrated away from (in
the case above) cpu 143, then BW control might be right in saying we
can't offline that cpu, as the worker still has BW running there. This
is also why I fist wondered (and suggested) we remove cppc worker BW
from the picture (make it DEADLINE special) as we don't really seem to
have a reliable way to associate meaningful BW to it anyway.

Thanks,
Juri

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 3 months, 2 weeks ago

Hi Juri,

Thanks for following up on this topic. Please check my comment below.

On Mon, Oct 20, 2025 at 08:03:25AM +0200, Juri Lelli wrote:
> Hi!
> 
> On 20/10/25 11:21, Pingfan Liu wrote:
> > Hi Waiman,
> > 
> > I appreciate your time in reviewing my patch. Please see the comment
> > belows.
> > 
> > On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > > When testing kexec-reboot on a 144 cpus machine with
> > > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > > encounter the following bug:
> > > > 
> > > > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > > > [   97.333236] Failed to offline CPU143 - error=-16
> > > > [   97.333246] ------------[ cut here ]------------
> > > > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > > > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > > [   97.438028] sp : ffff800097c6b9a0
> > > > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > > [   97.514379] Call trace:
> > > > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > > > [   97.521769]  machine_shutdown+0x20/0x38
> > > > [   97.525693]  kernel_kexec+0xc4/0xf0
> > > > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > > > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > > > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > > > [   97.542179]  do_el0_svc+0xb0/0xe8
> > > > [   97.545562]  el0_svc+0x44/0x1d0
> > > > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > > > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > > > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > > [   97.563191] ---[ end trace 0000000000000000 ]---
> > > > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > > [   97.608502] PHYS_OFFSET: 0x80000000
> > > > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > > [   97.617580] Memory Limit: none
> > > > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > > > 
> > > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > > rebuild.
> > > 
> > > First of all, in an emergency situation when we need to shutdown the kernel,
> > > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > > ahead and ignore this dl_bw generated error?
> > > 
> > 
> > Ah, sorry - the previous test example was misleading. Let me restate it
> > as an equivalent operation on a system with 144 CPUs:
> >   sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > 
> > That extracts the hot-removal part, which is affected by the bug, from
> > the kexec reboot process. It expects that only cpu0 is online, but in
> > practice, the cpu143 refused to be offline due to this bug.
> 
> I confess I am still perplexed by this, considering the "particular"
> nature of cppc worker that seems to be the only task that is able to
> trigger this problem. First of all, is that indeed the case or are you
> able to reproduce this problem with standard (non-kthread) DEADLINE
> tasks as well?
> 

Yes, I can. I wrote a SCHED_DEADLINE task that waits indefinitely on a
semaphore (or, more precisely, for a very long period that may span the
entire CPU hot-removal process) to emulate waiting for an undetermined
driver input.  Then I spawned multiple instances of this program to
ensure that some of them run on CPU 72. When I attempted to offline CPUs
1–143 one by one, CPU 143 failed to go offline.

> I essentially wonder how cppc worker affinity/migration on hotplug is
> handled. With your isolcpus configuration you have one isolated root

The affinity/migration on hotplug work fine. The keypoint is that they
only handle the task on rq. For the blocked-state tasks (here it is cppc
worker), they just ignore them.

Thanks,

Pingfan

> domain per isolated cpu, so if cppc worker is not migrated away from (in
> the case above) cpu 143, then BW control might be right in saying we
> can't offline that cpu, as the worker still has BW running there. This
> is also why I fist wondered (and suggested) we remove cppc worker BW
> from the picture (make it DEADLINE special) as we don't really seem to
> have a reliable way to associate meaningful BW to it anyway.
> 
> Thanks,
> Juri
>

Re: [PATCHv3] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Juri Lelli 3 months, 2 weeks ago

On 20/10/25 21:34, Pingfan Liu wrote:
> Hi Juri,
> 
> Thanks for following up on this topic. Please check my comment below.
> 
> On Mon, Oct 20, 2025 at 08:03:25AM +0200, Juri Lelli wrote:
> > Hi!
> > 
> > On 20/10/25 11:21, Pingfan Liu wrote:
> > > Hi Waiman,
> > > 
> > > I appreciate your time in reviewing my patch. Please see the comment
> > > belows.
> > > 
> > > On Fri, Oct 17, 2025 at 01:52:45PM -0400, Waiman Long wrote:
> > > > On 10/17/25 8:26 AM, Pingfan Liu wrote:
> > > > > When testing kexec-reboot on a 144 cpus machine with
> > > > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > > > encounter the following bug:
> > > > > 
> > > > > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > > > > [   97.333236] Failed to offline CPU143 - error=-16
> > > > > [   97.333246] ------------[ cut here ]------------
> > > > > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > > > > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > > > > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > > > > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > > > > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > > > > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > > > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > > > > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > > > > [   97.438028] sp : ffff800097c6b9a0
> > > > > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > > > > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > > > > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > > > > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > > > > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > > > > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > > > > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > > > > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > > > > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > > > > [   97.514379] Call trace:
> > > > > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > > > > [   97.521769]  machine_shutdown+0x20/0x38
> > > > > [   97.525693]  kernel_kexec+0xc4/0xf0
> > > > > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > > > > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > > > > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > > > > [   97.542179]  do_el0_svc+0xb0/0xe8
> > > > > [   97.545562]  el0_svc+0x44/0x1d0
> > > > > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > > > > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > > > > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > > > > [   97.563191] ---[ end trace 0000000000000000 ]---
> > > > > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > > > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > > > > [   97.608502] PHYS_OFFSET: 0x80000000
> > > > > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > > > > [   97.617580] Memory Limit: none
> > > > > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > > > > 
> > > > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > > > When a CPU is inactive, its rd is set to def_root_domain. For an
> > > > > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > > > > migrated to CPU0, and its task_rq() information is stale. As a result,
> > > > > its bandwidth is wrongly accounted into def_root_domain during domain
> > > > > rebuild.
> > > > 
> > > > First of all, in an emergency situation when we need to shutdown the kernel,
> > > > does it really matter if dl_bw_activate() returns -EBUSY? Should we just go
> > > > ahead and ignore this dl_bw generated error?
> > > > 
> > > 
> > > Ah, sorry - the previous test example was misleading. Let me restate it
> > > as an equivalent operation on a system with 144 CPUs:
> > >   sudo bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > > 
> > > That extracts the hot-removal part, which is affected by the bug, from
> > > the kexec reboot process. It expects that only cpu0 is online, but in
> > > practice, the cpu143 refused to be offline due to this bug.
> > 
> > I confess I am still perplexed by this, considering the "particular"
> > nature of cppc worker that seems to be the only task that is able to
> > trigger this problem. First of all, is that indeed the case or are you
> > able to reproduce this problem with standard (non-kthread) DEADLINE
> > tasks as well?
> > 
> 
> Yes, I can. I wrote a SCHED_DEADLINE task that waits indefinitely on a
> semaphore (or, more precisely, for a very long period that may span the
> entire CPU hot-removal process) to emulate waiting for an undetermined
> driver input.  Then I spawned multiple instances of this program to
> ensure that some of them run on CPU 72. When I attempted to offline CPUs
> 1–143 one by one, CPU 143 failed to go offline.
> 
> > I essentially wonder how cppc worker affinity/migration on hotplug is
> > handled. With your isolcpus configuration you have one isolated root
> 
> The affinity/migration on hotplug work fine. The keypoint is that they
> only handle the task on rq. For the blocked-state tasks (here it is cppc
> worker), they just ignore them.

OK. Thanks for confirming/clarifying.