[PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Pingfan Liu posted 1 patch 3 months, 3 weeks ago
There is a newer version of this series
include/linux/cpuset.h  |  6 ++++++
kernel/cgroup/cpuset.c  | 15 +++++++++++++++
kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
3 files changed, 45 insertions(+), 6 deletions(-)
[PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pingfan Liu 3 months, 3 weeks ago
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
[   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
[   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
[   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
[   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
[   97.438028] sp : ffff800097c6b9a0
[   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
[   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
[   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
[   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
[   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
[   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
[   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
[   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
[   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
[   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
[   97.514379] Call trace:
[   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
[   97.521769]  machine_shutdown+0x20/0x38
[   97.525693]  kernel_kexec+0xc4/0xf0
[   97.529260]  __do_sys_reboot+0x24c/0x278
[   97.533272]  __arm64_sys_reboot+0x2c/0x40
[   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
[   97.542179]  do_el0_svc+0xb0/0xe8
[   97.545562]  el0_svc+0x44/0x1d0
[   97.548772]  el0t_64_sync_handler+0x120/0x130
[   97.553222]  el0t_64_sync+0x1a4/0x1a8
[   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
[   97.563191] ---[ end trace 0000000000000000 ]---
[   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
[   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
[   97.608502] PHYS_OFFSET: 0x80000000
[   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
[   97.617580] Memory Limit: none
[   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]

Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
When a CPU is inactive, its rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. As a result,
its bandwidth is wrongly accounted into def_root_domain during domain
rebuild.

The following rules stand for deadline sub-system:
  -1.any cpu belongs to a unique root domain at a given time
  -2.DL bandwidth checker ensures that the root domain has active cpus.
And for active cpu, cpu_rq(cpu)->rd always tracks a valid root domain.

Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset which later has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
parent root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the parent root domain. (parent root domain means the first ancestor
cpuset which is partition root)

This patch walks up the cpuset hierarchy to find the active CPUs in P's
root domain and retrieves valid rd from cpu_rq(cpu)->rd.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
---
 include/linux/cpuset.h  |  6 ++++++
 kernel/cgroup/cpuset.c  | 15 +++++++++++++++
 kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b51..478ae68bdfc8f 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
 
 extern void cpuset_print_current_mems_allowed(void);
 extern void cpuset_reset_sched_domains(void);
+extern struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p);
 
 /*
  * read_mems_allowed_begin is required when making decisions involving
@@ -276,6 +277,11 @@ static inline void cpuset_reset_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
+{
+	return cpu_active_mask;
+}
+
 static inline void cpuset_print_current_mems_allowed(void)
 {
 }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675d..25356d3f9d635 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1102,6 +1102,21 @@ void cpuset_reset_sched_domains(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+/* caller hold RCU read lock */
+struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
+{
+	struct cpuset *cs;
+
+	cs = task_cs(p);
+	while (cs != &top_cpuset) {
+		if (is_sched_load_balance(cs))
+			break;
+		cs = parent_cs(cs);
+	}
+
+	return cs->effective_cpus;
+}
+
 /**
  * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
  * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 72c1f72463c75..fe0aec279c19a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
+	struct cpumask *msk;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	/* prevent race among cpu hotplug, changing of partition_root_state */
+	lockdep_assert_cpus_held();
+	/*
+	 * If @p is in blocked state, task_cpu() may be not active. In that
+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
+	 * @p must belong to an root_domain at any given time, which must have
+	 * active rq, whose rq->rd traces the valid root domain.
+	 */
+	msk = cpuset_task_rd_effective_cpus(p);
+	cpu = cpumask_first_and(cpu_active_mask, msk);
+	/*
+	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
+	 * check prevents CPU hot removal from deactivating all CPUs in that
+	 * domain.
+	 */ 
+	BUG_ON(cpu >= nr_cpu_ids);
+	rq = cpu_rq(cpu);
+	/*
+	 * This point is under the protection of cpu_hotplug_lock. Hence
+	 * rq->rd is stable.
+	 */
 	dl_b = &rq->rd->dl_bw;
 	raw_spin_lock(&dl_b->lock);
-
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0
Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by kernel test robot 3 months, 3 weeks ago
Hi Pingfan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on tip/sched/core tip/master linus/master v6.18-rc1 next-20251016]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pingfan-Liu/sched-deadline-Walk-up-cpuset-hierarchy-to-decide-root-domain-when-hot-unplug/20251016-200452
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20251016120041.17990-1-piliu%40redhat.com
patch subject: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20251017/202510171039.kkg2ItuG-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251017/202510171039.kkg2ItuG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510171039.kkg2ItuG-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/linux/smp.h:13,
                    from include/linux/lockdep.h:14,
                    from include/linux/spinlock.h:63,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:7,
                    from include/linux/umh.h:4,
                    from include/linux/kmod.h:9,
                    from include/linux/module.h:18,
                    from init/main.c:18:
   include/linux/cpuset.h: In function 'cpuset_task_rd_effective_cpus':
>> include/linux/cpumask.h:125:28: warning: return discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     125 | #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
         |                           ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/cpuset.h:282:16: note: in expansion of macro 'cpu_active_mask'
     282 |         return cpu_active_mask;
         |                ^~~~~~~~~~~~~~~


vim +/const +125 include/linux/cpumask.h

^1da177e4c3f415 Linus Torvalds   2005-04-16   77  
^1da177e4c3f415 Linus Torvalds   2005-04-16   78  /*
^1da177e4c3f415 Linus Torvalds   2005-04-16   79   * The following particular system cpumasks and operations manage
b3199c025d1646e Rusty Russell    2008-12-30   80   * possible, present, active and online cpus.
^1da177e4c3f415 Linus Torvalds   2005-04-16   81   *
b3199c025d1646e Rusty Russell    2008-12-30   82   *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
b3199c025d1646e Rusty Russell    2008-12-30   83   *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
4e1a7df4548003f James Morse      2024-05-29   84   *     cpu_enabled_mask - has bit 'cpu' set iff cpu can be brought online
b3199c025d1646e Rusty Russell    2008-12-30   85   *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
b3199c025d1646e Rusty Russell    2008-12-30   86   *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
^1da177e4c3f415 Linus Torvalds   2005-04-16   87   *
b3199c025d1646e Rusty Russell    2008-12-30   88   *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
^1da177e4c3f415 Linus Torvalds   2005-04-16   89   *
57f728d59f005df Randy Dunlap     2023-07-31   90   *  The cpu_possible_mask is fixed at boot time, as the set of CPU IDs
b3199c025d1646e Rusty Russell    2008-12-30   91   *  that it is possible might ever be plugged in at anytime during the
b3199c025d1646e Rusty Russell    2008-12-30   92   *  life of that system boot.  The cpu_present_mask is dynamic(*),
b3199c025d1646e Rusty Russell    2008-12-30   93   *  representing which CPUs are currently plugged in.  And
b3199c025d1646e Rusty Russell    2008-12-30   94   *  cpu_online_mask is the dynamic subset of cpu_present_mask,
b3199c025d1646e Rusty Russell    2008-12-30   95   *  indicating those CPUs available for scheduling.
b3199c025d1646e Rusty Russell    2008-12-30   96   *
b3199c025d1646e Rusty Russell    2008-12-30   97   *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
^1da177e4c3f415 Linus Torvalds   2005-04-16   98   *  depending on what ACPI reports as currently plugged in, otherwise
b3199c025d1646e Rusty Russell    2008-12-30   99   *  cpu_present_mask is just a copy of cpu_possible_mask.
^1da177e4c3f415 Linus Torvalds   2005-04-16  100   *
b3199c025d1646e Rusty Russell    2008-12-30  101   *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
b3199c025d1646e Rusty Russell    2008-12-30  102   *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
^1da177e4c3f415 Linus Torvalds   2005-04-16  103   *
^1da177e4c3f415 Linus Torvalds   2005-04-16  104   * Subtleties:
57f728d59f005df Randy Dunlap     2023-07-31  105   * 1) UP ARCHes (NR_CPUS == 1, CONFIG_SMP not defined) hardcode
^1da177e4c3f415 Linus Torvalds   2005-04-16  106   *    assumption that their single CPU is online.  The UP
b3199c025d1646e Rusty Russell    2008-12-30  107   *    cpu_{online,possible,present}_masks are placebos.  Changing them
^1da177e4c3f415 Linus Torvalds   2005-04-16  108   *    will have no useful affect on the following num_*_cpus()
^1da177e4c3f415 Linus Torvalds   2005-04-16  109   *    and cpu_*() macros in the UP case.  This ugliness is a UP
^1da177e4c3f415 Linus Torvalds   2005-04-16  110   *    optimization - don't waste any instructions or memory references
^1da177e4c3f415 Linus Torvalds   2005-04-16  111   *    asking if you're online or how many CPUs there are if there is
^1da177e4c3f415 Linus Torvalds   2005-04-16  112   *    only one CPU.
^1da177e4c3f415 Linus Torvalds   2005-04-16  113   */
^1da177e4c3f415 Linus Torvalds   2005-04-16  114  
4b804c85dc37db6 Rasmus Villemoes 2016-01-20  115  extern struct cpumask __cpu_possible_mask;
4b804c85dc37db6 Rasmus Villemoes 2016-01-20  116  extern struct cpumask __cpu_online_mask;
4e1a7df4548003f James Morse      2024-05-29  117  extern struct cpumask __cpu_enabled_mask;
4b804c85dc37db6 Rasmus Villemoes 2016-01-20  118  extern struct cpumask __cpu_present_mask;
4b804c85dc37db6 Rasmus Villemoes 2016-01-20  119  extern struct cpumask __cpu_active_mask;
e40f74c535b8a0e Peter Zijlstra   2021-01-19  120  extern struct cpumask __cpu_dying_mask;
5aec01b834fd6f8 Rasmus Villemoes 2016-01-20  121  #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
5aec01b834fd6f8 Rasmus Villemoes 2016-01-20  122  #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
4e1a7df4548003f James Morse      2024-05-29  123  #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
5aec01b834fd6f8 Rasmus Villemoes 2016-01-20  124  #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
5aec01b834fd6f8 Rasmus Villemoes 2016-01-20 @125  #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
e40f74c535b8a0e Peter Zijlstra   2021-01-19  126  #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
b3199c025d1646e Rusty Russell    2008-12-30  127  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by kernel test robot 3 months, 3 weeks ago
Hi Pingfan,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on tip/sched/core tip/master linus/master v6.18-rc1 next-20251016]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pingfan-Liu/sched-deadline-Walk-up-cpuset-hierarchy-to-decide-root-domain-when-hot-unplug/20251016-200452
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20251016120041.17990-1-piliu%40redhat.com
patch subject: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20251017/202510170932.nTEduJGM-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 754ebc6ebb9fb9fbee7aef33478c74ea74949853)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251017/202510170932.nTEduJGM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510170932.nTEduJGM-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from init/main.c:53:
>> include/linux/cpuset.h:282:9: error: returning 'const struct cpumask *' from a function with result type 'struct cpumask *' discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
     282 |         return cpu_active_mask;
         |                ^~~~~~~~~~~~~~~
   include/linux/cpumask.h:125:27: note: expanded from macro 'cpu_active_mask'
     125 | #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +282 include/linux/cpuset.h

   279	
   280	static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
   281	{
 > 282		return cpu_active_mask;
   283	}
   284	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pierre Gondois 3 months, 3 weeks ago
Hello Pingfan,

On 10/16/25 14:00, Pingfan Liu wrote:
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
>
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> [   97.438028] sp : ffff800097c6b9a0
> [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> [   97.514379] Call trace:
> [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.521769]  machine_shutdown+0x20/0x38
> [   97.525693]  kernel_kexec+0xc4/0xf0
> [   97.529260]  __do_sys_reboot+0x24c/0x278
> [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> [   97.542179]  do_el0_svc+0xb0/0xe8
> [   97.545562]  el0_svc+0x44/0x1d0
> [   97.548772]  el0t_64_sync_handler+0x120/0x130
> [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> [   97.563191] ---[ end trace 0000000000000000 ]---
> [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> [   97.608502] PHYS_OFFSET: 0x80000000
> [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> [   97.617580] Memory Limit: none
> [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
>
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> When a CPU is inactive, its rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. As a result,
> its bandwidth is wrongly accounted into def_root_domain during domain
> rebuild.
>
> The following rules stand for deadline sub-system:
>    -1.any cpu belongs to a unique root domain at a given time
>    -2.DL bandwidth checker ensures that the root domain has active cpus.
> And for active cpu, cpu_rq(cpu)->rd always tracks a valid root domain.
>
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset which later has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> parent root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the parent root domain. (parent root domain means the first ancestor
> cpuset which is partition root)
>
> This patch walks up the cpuset hierarchy to find the active CPUs in P's
> root domain and retrieves valid rd from cpu_rq(cpu)->rd.
>
> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Pierre Gondois <pierre.gondois@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> To: linux-kernel@vger.kernel.org
> ---
>   include/linux/cpuset.h  |  6 ++++++
>   kernel/cgroup/cpuset.c  | 15 +++++++++++++++
>   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>   3 files changed, 45 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..478ae68bdfc8f 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
>   
>   extern void cpuset_print_current_mems_allowed(void);
>   extern void cpuset_reset_sched_domains(void);
> +extern struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p);
>   
>   /*
>    * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +277,11 @@ static inline void cpuset_reset_sched_domains(void)
>   	partition_sched_domains(1, NULL, NULL);
>   }
>   
> +static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> +{
> +	return cpu_active_mask;
> +}
> +
>   static inline void cpuset_print_current_mems_allowed(void)
>   {
>   }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..25356d3f9d635 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,21 @@ void cpuset_reset_sched_domains(void)
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> +/* caller hold RCU read lock */
> +struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> +{
> +	struct cpuset *cs;
> +
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	return cs->effective_cpus;
> +}
> +
>   /**
>    * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
>    * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 72c1f72463c75..fe0aec279c19a 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
>   	struct rq_flags rf;
>   	struct rq *rq;
>   	struct dl_bw *dl_b;
> +	unsigned int cpu;
> +	struct cpumask *msk;
>   
>   	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>   	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
>   		return;
>   	}
>   
> -	rq = __task_rq_lock(p, &rf);
> -
> +	/* prevent race among cpu hotplug, changing of partition_root_state */
> +	lockdep_assert_cpus_held();
> +	/*
> +	 * If @p is in blocked state, task_cpu() may be not active. In that
> +	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> +	 * @p must belong to an root_domain at any given time, which must have
> +	 * active rq, whose rq->rd traces the valid root domain.
> +	 */
> +	msk = cpuset_task_rd_effective_cpus(p);

For the cppc_fie worker, msk doesn't seem to exclude the isolated CPUs.
The patch seems to work on my setup, but only because the first active
CPU is selected. CPU0 is likely the primary CPU which is offlined last.

IIUC, this patch should work even if we select the last CPU of resulting 
mask,
but it fails on my setup:

cpumask_and(msk, cpu_active_mask, msk0);
cpu = cpumask_last(msk);

------

Also, just to note (as this might be another topic), but the patch 
doesn't solve
the case where many deadline tasks are created first:
   chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &

and we then kexec to another Image

> +	cpu = cpumask_first_and(cpu_active_mask, msk);
> +	/*
> +	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> +	 * check prevents CPU hot removal from deactivating all CPUs in that
> +	 * domain.
> +	 */
> +	BUG_ON(cpu >= nr_cpu_ids);
> +	rq = cpu_rq(cpu);
> +	/*
> +	 * This point is under the protection of cpu_hotplug_lock. Hence
> +	 * rq->rd is stable.
> +	 */
>   	dl_b = &rq->rd->dl_bw;
>   	raw_spin_lock(&dl_b->lock);
> -
>   	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> -
>   	raw_spin_unlock(&dl_b->lock);
> -
> -	task_rq_unlock(rq, p, &rf);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
>   }
>   
>   void dl_clear_root_domain(struct root_domain *rd)
Re: [PATCHv2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pingfan Liu 3 months, 3 weeks ago
Hi Pierre,

Thanks for your careful test and analysis. Please see the comments below.

On Thu, Oct 16, 2025 at 11:30 PM Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Hello Pingfan,
>
> On 10/16/25 14:00, Pingfan Liu wrote:
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> >
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
> > [   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
> > [   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
> > [   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
> > [   97.438028] sp : ffff800097c6b9a0
> > [   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
> > [   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
> > [   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
> > [   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
> > [   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
> > [   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
> > [   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
> > [   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
> > [   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
> > [   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
> > [   97.514379] Call trace:
> > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > [   97.521769]  machine_shutdown+0x20/0x38
> > [   97.525693]  kernel_kexec+0xc4/0xf0
> > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> > [   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
> > [   97.542179]  do_el0_svc+0xb0/0xe8
> > [   97.545562]  el0_svc+0x44/0x1d0
> > [   97.548772]  el0t_64_sync_handler+0x120/0x130
> > [   97.553222]  el0t_64_sync+0x1a4/0x1a8
> > [   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
> > [   97.563191] ---[ end trace 0000000000000000 ]---
> > [   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > [   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
> > [   97.608502] PHYS_OFFSET: 0x80000000
> > [   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
> > [   97.617580] Memory Limit: none
> > [   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> >
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > When a CPU is inactive, its rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. As a result,
> > its bandwidth is wrongly accounted into def_root_domain during domain
> > rebuild.
> >
> > The following rules stand for deadline sub-system:
> >    -1.any cpu belongs to a unique root domain at a given time
> >    -2.DL bandwidth checker ensures that the root domain has active cpus.
> > And for active cpu, cpu_rq(cpu)->rd always tracks a valid root domain.
> >
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset which later has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > parent root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the parent root domain. (parent root domain means the first ancestor
> > cpuset which is partition root)
> >
> > This patch walks up the cpuset hierarchy to find the active CPUs in P's
> > root domain and retrieves valid rd from cpu_rq(cpu)->rd.
> >
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Pierre Gondois <pierre.gondois@arm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > To: linux-kernel@vger.kernel.org
> > ---
> >   include/linux/cpuset.h  |  6 ++++++
> >   kernel/cgroup/cpuset.c  | 15 +++++++++++++++
> >   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >   3 files changed, 45 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..478ae68bdfc8f 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -130,6 +130,7 @@ extern void rebuild_sched_domains(void);
> >
> >   extern void cpuset_print_current_mems_allowed(void);
> >   extern void cpuset_reset_sched_domains(void);
> > +extern struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p);
> >
> >   /*
> >    * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +277,11 @@ static inline void cpuset_reset_sched_domains(void)
> >       partition_sched_domains(1, NULL, NULL);
> >   }
> >
> > +static inline struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> > +{
> > +     return cpu_active_mask;
> > +}
> > +
> >   static inline void cpuset_print_current_mems_allowed(void)
> >   {
> >   }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..25356d3f9d635 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,21 @@ void cpuset_reset_sched_domains(void)
> >       mutex_unlock(&cpuset_mutex);
> >   }
> >
> > +/* caller hold RCU read lock */
> > +struct cpumask *cpuset_task_rd_effective_cpus(struct task_struct *p)
> > +{
> > +     struct cpuset *cs;
> > +
> > +     cs = task_cs(p);
> > +     while (cs != &top_cpuset) {
> > +             if (is_sched_load_balance(cs))
> > +                     break;
> > +             cs = parent_cs(cs);
> > +     }
> > +
> > +     return cs->effective_cpus;
> > +}
> > +
> >   /**
> >    * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
> >    * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 72c1f72463c75..fe0aec279c19a 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
> >       struct rq_flags rf;
> >       struct rq *rq;
> >       struct dl_bw *dl_b;
> > +     unsigned int cpu;
> > +     struct cpumask *msk;
> >
> >       raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> >       if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> > @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
> >               return;
> >       }
> >
> > -     rq = __task_rq_lock(p, &rf);
> > -
> > +     /* prevent race among cpu hotplug, changing of partition_root_state */
> > +     lockdep_assert_cpus_held();
> > +     /*
> > +      * If @p is in blocked state, task_cpu() may be not active. In that
> > +      * case, rq->rd does not trace a correct root_domain. On the other hand,
> > +      * @p must belong to an root_domain at any given time, which must have
> > +      * active rq, whose rq->rd traces the valid root domain.
> > +      */
> > +     msk = cpuset_task_rd_effective_cpus(p);
>
> For the cppc_fie worker, msk doesn't seem to exclude the isolated CPUs.

You are right. I am distracted from the cpuset.partition_root_state.
But root_domain can be created from two sources:
cpuset.partition_root_state and isolcpus="domain" cmdline. My patch
overlooks the latter one. And finally it returns
top_cpuset.effective_cpus.

I think before diving into cpuset,
housekeeping_cpumask(HK_TYPE_DOMAIN) should be used as the first
filter. With it, isolated cpus are put aside and the left cpus will
obey the control of cpuset.

> The patch seems to work on my setup, but only because the first active
> CPU is selected. CPU0 is likely the primary CPU which is offlined last.
>
> IIUC, this patch should work even if we select the last CPU of resulting
> mask,
> but it fails on my setup:
>
> cpumask_and(msk, cpu_active_mask, msk0);
> cpu = cpumask_last(msk);
>

Good catch, again I really appreciate your careful test and analysis.

> ------
>
> Also, just to note (as this might be another topic), but the patch
> doesn't solve
> the case where many deadline tasks are created first:
>    chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &
>
> and we then kexec to another Image
>

Yes, it is not fixed in this patch. And I have another draft patch for
that issue.


Thanks,

Pingfan

> > +     cpu = cpumask_first_and(cpu_active_mask, msk);
> > +     /*
> > +      * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> > +      * check prevents CPU hot removal from deactivating all CPUs in that
> > +      * domain.
> > +      */
> > +     BUG_ON(cpu >= nr_cpu_ids);
> > +     rq = cpu_rq(cpu);
> > +     /*
> > +      * This point is under the protection of cpu_hotplug_lock. Hence
> > +      * rq->rd is stable.
> > +      */
> >       dl_b = &rq->rd->dl_bw;
> >       raw_spin_lock(&dl_b->lock);
> > -
> >       __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> > -
> >       raw_spin_unlock(&dl_b->lock);
> > -
> > -     task_rq_unlock(rq, p, &rf);
> > +     raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> >   }
> >
> >   void dl_clear_root_domain(struct root_domain *rd)
>