[PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr

Pingfan Liu posted 1 patch 2 days, 7 hours ago
kernel/sched/deadline.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
[PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Pingfan Liu 2 days, 7 hours ago
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
[   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
[   97.413371] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 2.0 07/12/2024
[   97.420400] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   97.427518] pc : smp_shutdown_nonboot_cpus+0x104/0x128
[   97.432778] lr : smp_shutdown_nonboot_cpus+0x11c/0x128
[   97.438028] sp : ffff800097c6b9a0
[   97.441411] x29: ffff800097c6b9a0 x28: ffff0000a099d800 x27: 0000000000000000
[   97.448708] x26: 0000000000000000 x25: 0000000000000000 x24: ffffb94aaaa8f218
[   97.456004] x23: ffffb94aaaabaae0 x22: ffffb94aaaa8f018 x21: 0000000000000000
[   97.463301] x20: ffffb94aaaa8fc10 x19: 000000000000008f x18: 00000000fffffffe
[   97.470598] x17: 0000000000000000 x16: ffffb94aa958fcd0 x15: ffff103acfca0b64
[   97.477894] x14: ffff800097c6b520 x13: 36312d3d726f7272 x12: ffff103acfc6ffa8
[   97.485191] x11: ffff103acf6f0000 x10: ffff103bc085c400 x9 : ffffb94aa88a0eb0
[   97.492488] x8 : 0000000000000001 x7 : 000000000017ffe8 x6 : c0000000fffeffff
[   97.499784] x5 : ffff003bdf62b408 x4 : 0000000000000000 x3 : 0000000000000000
[   97.507081] x2 : 0000000000000000 x1 : ffff0000a099d800 x0 : 0000000000000002
[   97.514379] Call trace:
[   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
[   97.521769]  machine_shutdown+0x20/0x38
[   97.525693]  kernel_kexec+0xc4/0xf0
[   97.529260]  __do_sys_reboot+0x24c/0x278
[   97.533272]  __arm64_sys_reboot+0x2c/0x40
[   97.537370]  invoke_syscall.constprop.0+0x74/0xd0
[   97.542179]  do_el0_svc+0xb0/0xe8
[   97.545562]  el0_svc+0x44/0x1d0
[   97.548772]  el0t_64_sync_handler+0x120/0x130
[   97.553222]  el0t_64_sync+0x1a4/0x1a8
[   97.556963] Code: a94363f7 a8c47bfd d50323bf d65f03c0 (d4210000)
[   97.563191] ---[ end trace 0000000000000000 ]---
[   97.595854] Kernel panic - not syncing: Oops - BUG: Fatal exception
[   97.602275] Kernel Offset: 0x394a28600000 from 0xffff800080000000
[   97.608502] PHYS_OFFSET: 0x80000000
[   97.612062] CPU features: 0x10,0000000d,002a6928,5667fea7
[   97.617580] Memory Limit: none
[   97.648626] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]

Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
When a CPU is inactive, its rd is set to def_root_domain. For an S-state
deadline task (in this case, "cppc_fie"), it was not migrated to CPU0,
and its task_rq() information is stale. As a result, its bandwidth is
wrongly accounted into def_root_domain during domain rebuild.

This patch uses the rd from the run queue of still-active CPU to get the
correct root domain.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
---
 kernel/sched/deadline.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f25301267e47..bb42b82d6366 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2913,6 +2913,7 @@ void dl_add_task_root_domain(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2920,16 +2921,23 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	lockdep_assert_cpus_held();
+	/*
+	 * If @p is not in R state, task_cpu() may be not active.  task_rq()'s
+	 * root_domain may be invalid. But the rest active cpus on cpus_ptr
+	 * share the same root domain.
+	 */
+	cpu = cpumask_first_and(cpu_active_mask, p->cpus_ptr);
+	rq = cpu_rq(cpu);
+	/*
+	 * This point is under the protection of cpu_hotplug_lock. Hence
+	 * rq->rd is stable.
+	 */
 	dl_b = &rq->rd->dl_bw;
 	raw_spin_lock(&dl_b->lock);
-
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Juri Lelli 2 days, 6 hours ago
Hello!

On 29/09/25 21:36, Pingfan Liu wrote:
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
> 
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1

Could you please confirm this is still reproducible with plain upstream
(e5f0a698b34e ("Linux 6.17") as of today)? I just wonder if we might be
missing some of the recent fixes around SCHED_DEADLINE.

Thanks,
Juri
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Pingfan Liu 1 day, 19 hours ago
Hi Juri,

On Mon, Sep 29, 2025 at 10:37 PM Juri Lelli <juri.lelli@redhat.com> wrote:
>
> Hello!
>
> On 29/09/25 21:36, Pingfan Liu wrote:
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> >
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [   97.353281] Modules linked in: rfkill sunrpc dax_hmem cxl_acpi cxl_port cxl_core einj vfat fat arm_smmuv3_pmu nvidia_cspmu arm_spe_pmu coresight_trbe arm_cspmu_module rndis_host ipmi_ssif cdc_ether i2c_smbus spi_nor usbnet ast coresight_tmc mii ixgbe i2c_algo_bit mdio mtd coresight_funnel coresight_stm stm_core coresight_etm4x coresight cppc_cpufreq loop fuse nfnetlink xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt nvme nvme_core nvme_auth i2c_tegra acpi_power_meter acpi_ipmi ipmi_devintf ipmi_msghandler dm_mirror dm_region_hash dm_log dm_mod
> > [   97.404119] CPU: 0 UID: 0 PID: 2583 Comm: kexec Kdump: loaded Not tainted 6.12.0-41.el10.aarch64 #1
>
> Could you please confirm this is still reproducible with plain upstream
> (e5f0a698b34e ("Linux 6.17") as of today)? I just wonder if we might be
> missing some of the recent fixes around SCHED_DEADLINE.
>

I can reproduce this bug with (9087e52ce85e Linux 6.17-rc7). I thought
that the last fix for SCHED_DEADLINE should be (a3a70caf79067
sched/deadline: Fix dl_server behaviour), which is included by -rc7
tag.

Is it good enough or should I have a test against  (e5f0a698b34e ("Linux 6.17")

Thanks,

Pingfan
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Peter Zijlstra 2 days, 7 hours ago
On Mon, Sep 29, 2025 at 09:36:02PM +0800, Pingfan Liu wrote:
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
> 
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!

> [   97.514379] Call trace:
> [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> [   97.521769]  machine_shutdown+0x20/0x38
> [   97.525693]  kernel_kexec+0xc4/0xf0
> [   97.529260]  __do_sys_reboot+0x24c/0x278
> [   97.533272]  __arm64_sys_reboot+0x2c/0x40

> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> When a CPU is inactive, its rd is set to def_root_domain. For an S-state

You mean a blocked task?

> deadline task (in this case, "cppc_fie"), it was not migrated to CPU0,
> and its task_rq() information is stale. As a result, its bandwidth is
> wrongly accounted into def_root_domain during domain rebuild.
> 
> This patch uses the rd from the run queue of still-active CPU to get the
> correct root domain.

That doesn't seem right in general. What if there are multiple root
domains; how does it know which to use?
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Pingfan Liu 1 day, 19 hours ago
Hi Peter,

On Mon, Sep 29, 2025 at 9:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Sep 29, 2025 at 09:36:02PM +0800, Pingfan Liu wrote:
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> >
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
>
> > [   97.514379] Call trace:
> > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > [   97.521769]  machine_shutdown+0x20/0x38
> > [   97.525693]  kernel_kexec+0xc4/0xf0
> > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
>
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > When a CPU is inactive, its rd is set to def_root_domain. For an S-state
>
> You mean a blocked task?
>

Yes.

> > deadline task (in this case, "cppc_fie"), it was not migrated to CPU0,
> > and its task_rq() information is stale. As a result, its bandwidth is
> > wrongly accounted into def_root_domain during domain rebuild.
> >
> > This patch uses the rd from the run queue of still-active CPU to get the
> > correct root domain.
>

Sorry that I haven't explained it clearly. I mean the still-active CPU
in task->cpus_ptr,

> That doesn't seem right in general. What if there are multiple root
> domains; how does it know which to use?
>

In the case of task->cpus_ptr, there should be only one root domain, right?

Thanks,

Pingfan
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Peter Zijlstra 1 day, 12 hours ago
On Tue, Sep 30, 2025 at 09:47:33AM +0800, Pingfan Liu wrote:

> > > This patch uses the rd from the run queue of still-active CPU to get the
> > > correct root domain.
> >
> 
> Sorry that I haven't explained it clearly. I mean the still-active CPU
> in task->cpus_ptr,
> 
> > That doesn't seem right in general. What if there are multiple root
> > domains; how does it know which to use?
> >
> 
> In the case of task->cpus_ptr, there should be only one root domain, right?

IIRC there was a corner case somewhere; something like clearing the old
cpuset load_balance flag on the root domain would not iterate all tasks
or so.

The result would be tasks with all-set cpumasks (the default value)
spread over multiple root domains. Every task would be caught in
whatever root domain it was at the time of toggle.

This might have been fixed, but I can't remember.
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Pingfan Liu 8 hours ago
 f

On Tue, Sep 30, 2025 at 5:03 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Sep 30, 2025 at 09:47:33AM +0800, Pingfan Liu wrote:
>
> > > > This patch uses the rd from the run queue of still-active CPU to get the
> > > > correct root domain.
> > >
> >
> > Sorry that I haven't explained it clearly. I mean the still-active CPU
> > in task->cpus_ptr,
> >
> > > That doesn't seem right in general. What if there are multiple root
> > > domains; how does it know which to use?
> > >
> >
> > In the case of task->cpus_ptr, there should be only one root domain, right?
>
> IIRC there was a corner case somewhere; something like clearing the old
> cpuset load_balance flag on the root domain would not iterate all tasks
> or so.
>
According to the current implementation, root_domain is the toppest
cpuset, except top_cpuset, with load_balance flag. So at the top
level, it should be several disjoint CPU sets. If a top level cpuset's
load_balance flag is cleared, the rebuilt root domain which covers
this cpuset's CPU should be the one corresponding to top_cpuset. If
this is true, I think there is always one root domain.

> The result would be tasks with all-set cpumasks (the default value)
> spread over multiple root domains. Every task would be caught in
> whatever root domain it was at the time of toggle.
>

If the above is true, the tasks will have top_cpuset's
root_domain->span in cpus_ptr. And this corner case will be avoided.

Does that make sense?

Thanks,

Pingfan
Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Juri Lelli 1 day, 14 hours ago
On 30/09/25 09:47, Pingfan Liu wrote:
> Hi Peter,
> 
> On Mon, Sep 29, 2025 at 9:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Sep 29, 2025 at 09:36:02PM +0800, Pingfan Liu wrote:
> > > When testing kexec-reboot on a 144 cpus machine with
> > > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > > encounter the following bug:
> > >
> > > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > > [   97.333236] Failed to offline CPU143 - error=-16
> > > [   97.333246] ------------[ cut here ]------------
> > > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> >
> > > [   97.514379] Call trace:
> > > [   97.516874]  smp_shutdown_nonboot_cpus+0x104/0x128
> > > [   97.521769]  machine_shutdown+0x20/0x38
> > > [   97.525693]  kernel_kexec+0xc4/0xf0
> > > [   97.529260]  __do_sys_reboot+0x24c/0x278
> > > [   97.533272]  __arm64_sys_reboot+0x2c/0x40
> >
> > > Tracking down this issue, I found that dl_bw_deactivate() returned
> > > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > > When a CPU is inactive, its rd is set to def_root_domain. For an S-state
> >
> > You mean a blocked task?
> >
> 
> Yes.
> 
> > > deadline task (in this case, "cppc_fie"), it was not migrated to CPU0,
> > > and its task_rq() information is stale. As a result, its bandwidth is
> > > wrongly accounted into def_root_domain during domain rebuild.
> > >
> > > This patch uses the rd from the run queue of still-active CPU to get the
> > > correct root domain.
> >
> 
> Sorry that I haven't explained it clearly. I mean the still-active CPU
> in task->cpus_ptr,
> 
> > That doesn't seem right in general. What if there are multiple root
> > domains; how does it know which to use?
> >

I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
tasks (like schedutil [1]). IIUC that is how it is thought to behave
already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
it is not "transparent" from a bandwidth tracking point of view.

1 - https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
2 - https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198

Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr
Posted by Peter Zijlstra 1 day, 12 hours ago
On Tue, Sep 30, 2025 at 08:20:06AM +0100, Juri Lelli wrote:

> I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
> tasks (like schedutil [1]). IIUC that is how it is thought to behave
> already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
> it is not "transparent" from a bandwidth tracking point of view.
> 
> 1 - https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
> 2 - https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198

Right, I remember that hack. Bit sad its spreading, but this CPPC thing
is very much like the schedutil one, so might as well do that I suppose.