Steal time based dynamic CPU resource management

[PATCH 08/17] sched/core: Implement CPU soft offline/online

Posted by Srikar Dronamraju 2 months ago

Scheduler already supports CPU online/offline. However for cases where
scheduler has to offline a CPU temporarily, the online/offline cost is
too high. Hence here is an attempt to come-up with soft-offline that
almost looks similar to offline without actually having to do the
full-offline. Since CPUs are not to be used temporarily for a short
duration, they will continue to be part of the CPU topology.

In the soft-offline, CPU will be marked as inactive, i.e removed from
the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
would be migrated out of the CPU's runqueue.

Similarly when onlined, CPU will be remarked as active, i.e. added to
cpu_active_mask, CPUs capacity would be restored.

Soft-offline is almost similar as 1st step of offline except rebuilding
the sched-domains. Since the other steps are not done including
rebuilding the sched-domain, the overhead of soft-offline would be less
compared to regular offline. A new cpumask is used to indicate
soft-offline is in progress and hence skips rebuilding the
sched-domains.

To push tasks out of the CPU, balance_push is modified to push tasks out
till there are runnable tasks on the runqueue or till the CPU is in dying
state.

Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/core.c            | 44 ++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index bbcfdf12aa6e..ed45d7db3e76 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -241,4 +241,5 @@ static inline int task_node(const struct task_struct *p)
 	return cpu_to_node(task_cpu(p));
 }
 
+extern void set_cpu_softoffline(int cpu, bool soft_offline);
 #endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89efff1e1ead..f66fd1e925b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8177,13 +8177,16 @@ static void balance_push(struct rq *rq)
 	 * Only active while going offline and when invoked on the outgoing
 	 * CPU.
 	 */
-	if (!cpu_dying(rq->cpu) || rq != this_rq())
+	if (cpu_active(rq->cpu) || rq != this_rq())
 		return;
 
 	/*
-	 * Ensure the thing is persistent until balance_push_set(.on = false);
+	 * Unless soft-offline, Ensure the thing is persistent until
+	 * balance_push_set(.on = false); In case of soft-offline, just
+	 * enough to push current non-pinned tasks out.
 	 */
-	rq->balance_callback = &balance_push_callback;
+	if (cpu_dying(rq->cpu) || rq->nr_running)
+		rq->balance_callback = &balance_push_callback;
 
 	/*
 	 * Both the cpu-hotplug and stop task are in this case and are
@@ -8392,6 +8395,8 @@ static inline void sched_smt_present_dec(int cpu)
 #endif
 }
 
+static struct cpumask cpu_softoffline_mask;
+
 int sched_cpu_activate(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -8411,7 +8416,10 @@ int sched_cpu_activate(unsigned int cpu)
 	if (sched_smp_initialized) {
 		sched_update_numa(cpu, true);
 		sched_domains_numa_masks_set(cpu);
-		cpuset_cpu_active();
+
+		/* For CPU soft-offline, dont need to rebuild sched-domains */
+		if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
+			cpuset_cpu_active();
 	}
 
 	scx_rq_activate(rq);
@@ -8485,7 +8493,11 @@ int sched_cpu_deactivate(unsigned int cpu)
 		return 0;
 
 	sched_update_numa(cpu, false);
-	cpuset_cpu_inactive(cpu);
+
+	/* For CPU soft-offline, dont need to rebuild sched-domains */
+	if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
+		cpuset_cpu_inactive(cpu);
+
 	sched_domains_numa_masks_clear(cpu);
 	return 0;
 }
@@ -10928,3 +10940,25 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 		set_next_task(rq, ctx->p);
 }
 #endif /* CONFIG_SCHED_CLASS_EXT */
+
+void set_cpu_softoffline(int cpu, bool soft_offline)
+{
+	struct sched_domain *sd;
+
+	if (!cpu_online(cpu))
+		return;
+
+	cpumask_set_cpu(cpu, &cpu_softoffline_mask);
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd)
+		update_group_capacity(sd, cpu);
+	rcu_read_unlock();
+
+	if (soft_offline)
+		sched_cpu_deactivate(cpu);
+	else
+		sched_cpu_activate(cpu);
+
+	cpumask_clear_cpu(cpu, &cpu_softoffline_mask);
+}
-- 
2.43.7

Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online

Posted by Peter Zijlstra 2 months ago

On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:
> Scheduler already supports CPU online/offline. However for cases where
> scheduler has to offline a CPU temporarily, the online/offline cost is
> too high. Hence here is an attempt to come-up with soft-offline that
> almost looks similar to offline without actually having to do the
> full-offline. Since CPUs are not to be used temporarily for a short
> duration, they will continue to be part of the CPU topology.
> 
> In the soft-offline, CPU will be marked as inactive, i.e removed from
> the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
> would be migrated out of the CPU's runqueue.
> 
> Similarly when onlined, CPU will be remarked as active, i.e. added to
> cpu_active_mask, CPUs capacity would be restored.
> 
> Soft-offline is almost similar as 1st step of offline except rebuilding
> the sched-domains. Since the other steps are not done including
> rebuilding the sched-domain, the overhead of soft-offline would be less
> compared to regular offline. A new cpumask is used to indicate
> soft-offline is in progress and hence skips rebuilding the
> sched-domains.

Note that your thing still very much includes the synchronize_rcu() that
a lot of the previous 'hotplug is too slow' crowd have complained about.

So I'm taking it that your steal time thing really isn't that 'fast'.

It might be good to mention the frequency at which you expect cores to
come and go with your setup.

Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online

Posted by Srikar Dronamraju 2 months ago

* Peter Zijlstra <peterz@infradead.org> [2025-12-05 17:07:23]:

> On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:
> > Scheduler already supports CPU online/offline. However for cases where
> > scheduler has to offline a CPU temporarily, the online/offline cost is
> > too high. Hence here is an attempt to come-up with soft-offline that
> > almost looks similar to offline without actually having to do the
> > full-offline. Since CPUs are not to be used temporarily for a short
> > duration, they will continue to be part of the CPU topology.
> > 
> > In the soft-offline, CPU will be marked as inactive, i.e removed from
> > the cpu_active_mask, CPUs capacity would be reduced and non-pinned tasks
> > would be migrated out of the CPU's runqueue.
> > 
> > Similarly when onlined, CPU will be remarked as active, i.e. added to
> > cpu_active_mask, CPUs capacity would be restored.
> > 
> > Soft-offline is almost similar as 1st step of offline except rebuilding
> > the sched-domains. Since the other steps are not done including
> > rebuilding the sched-domain, the overhead of soft-offline would be less
> > compared to regular offline. A new cpumask is used to indicate
> > soft-offline is in progress and hence skips rebuilding the
> > sched-domains.
> 
> Note that your thing still very much includes the synchronize_rcu() that
> a lot of the previous 'hotplug is too slow' crowd have complained about.
> 
> So I'm taking it that your steal time thing really isn't that 'fast'.

Yes, it does have synchronize_rcu()
> 
> It might be good to mention the frequency at which you expect cores to
> come and go with your setup.

We are expecting the cores to keep changing at a 1 second to 2second
frequency.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online

Posted by Peter Zijlstra 2 months ago

On Thu, Dec 04, 2025 at 11:23:56PM +0530, Srikar Dronamraju wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 89efff1e1ead..f66fd1e925b0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8177,13 +8177,16 @@ static void balance_push(struct rq *rq)
>  	 * Only active while going offline and when invoked on the outgoing
>  	 * CPU.
>  	 */
> -	if (!cpu_dying(rq->cpu) || rq != this_rq())
> +	if (cpu_active(rq->cpu) || rq != this_rq())
>  		return;
>  
>  	/*
> -	 * Ensure the thing is persistent until balance_push_set(.on = false);
> +	 * Unless soft-offline, Ensure the thing is persistent until
> +	 * balance_push_set(.on = false); In case of soft-offline, just
> +	 * enough to push current non-pinned tasks out.
>  	 */
> -	rq->balance_callback = &balance_push_callback;
> +	if (cpu_dying(rq->cpu) || rq->nr_running)
> +		rq->balance_callback = &balance_push_callback;
>  
>  	/*
>  	 * Both the cpu-hotplug and stop task are in this case and are
> @@ -8392,6 +8395,8 @@ static inline void sched_smt_present_dec(int cpu)
>  #endif
>  }
>  
> +static struct cpumask cpu_softoffline_mask;
> +
>  int sched_cpu_activate(unsigned int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -8411,7 +8416,10 @@ int sched_cpu_activate(unsigned int cpu)
>  	if (sched_smp_initialized) {
>  		sched_update_numa(cpu, true);
>  		sched_domains_numa_masks_set(cpu);
> -		cpuset_cpu_active();
> +
> +		/* For CPU soft-offline, dont need to rebuild sched-domains */
> +		if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
> +			cpuset_cpu_active();
>  	}
>  
>  	scx_rq_activate(rq);
> @@ -8485,7 +8493,11 @@ int sched_cpu_deactivate(unsigned int cpu)
>  		return 0;
>  
>  	sched_update_numa(cpu, false);
> -	cpuset_cpu_inactive(cpu);
> +
> +	/* For CPU soft-offline, dont need to rebuild sched-domains */
> +	if (!cpumask_test_cpu(cpu, &cpu_softoffline_mask))
> +		cpuset_cpu_inactive(cpu);
> +
>  	sched_domains_numa_masks_clear(cpu);
>  	return 0;
>  }
> @@ -10928,3 +10940,25 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>  		set_next_task(rq, ctx->p);
>  }
>  #endif /* CONFIG_SCHED_CLASS_EXT */
> +
> +void set_cpu_softoffline(int cpu, bool soft_offline)
> +{
> +	struct sched_domain *sd;
> +
> +	if (!cpu_online(cpu))
> +		return;
> +
> +	cpumask_set_cpu(cpu, &cpu_softoffline_mask);
> +
> +	rcu_read_lock();
> +	for_each_domain(cpu, sd)
> +		update_group_capacity(sd, cpu);
> +	rcu_read_unlock();
> +
> +	if (soft_offline)
> +		sched_cpu_deactivate(cpu);
> +	else
> +		sched_cpu_activate(cpu);
> +
> +	cpumask_clear_cpu(cpu, &cpu_softoffline_mask);
> +}

What happens if you then offline one of these softoffline CPUs? Doesn't
that do sched_cpu_deactivate() again?

Also, the way this seems to use softoffline_mask is as a hidden argument
to sched_cpu_{de,}activate() instead of as an actual mask.

Moreover, there does not seem to be any sort of serialization vs
concurrent set_cpu_softoffline() callers. At the very least
update_group_capacity() would end up with indeterminate results.

This all doesn't look 'robust'.

Re: [PATCH 08/17] sched/core: Implement CPU soft offline/online

Posted by Srikar Dronamraju 2 months ago

* Peter Zijlstra <peterz@infradead.org> [2025-12-05 17:03:26]:

Hi Peter, 


> 
> What happens if you then offline one of these softoffline CPUs? Doesn't
> that do sched_cpu_deactivate() again?
> 
> Also, the way this seems to use softoffline_mask is as a hidden argument
> to sched_cpu_{de,}activate() instead of as an actual mask.
> 
> Moreover, there does not seem to be any sort of serialization vs
> concurrent set_cpu_softoffline() callers. At the very least
> update_group_capacity() would end up with indeterminate results.
> 

To serialize soft_offline with actual offline, can we take cpu_maps_update_begin() / cpu_maps_update_done


> This all doesn't look 'robust'.

I figured out when Shrikanth Hegde reported a warning to me today evening.

Basically pin a task to CPU, and then run workload so that the load causes steal and then do a cpu offline 
Pinning just causes the window to be sure enough to hit the case easily.

[  804.464298] ------------[ cut here ]------------
[  804.464325] CPU capacity asymmetry not supported on SMT
[  804.464341] WARNING: CPU: 575 PID: 2926 at kernel/sched/topology.c:1677 sd_init+0x428/0x494
[  804.464355] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding tls rfkill ip_set nf_tables nfnetlink sunrpc pseries_rng vmx_crypto drm drm_panel_orientation_quirks xfs sd_mod sg ibmvscsi scsi_transport_srp ibmveth pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
[  804.464409] CPU: 575 UID: 0 PID: 2926 Comm: cpuhp/575 Kdump: loaded Not tainted 6.18.0-master+ #15 VOLUNTARY
[  804.464415] Hardware name: IBM,9080-HEU Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (OK1110_066) hv:phyp pSeries
[  804.464420] NIP:  c000000000215c4c LR: c000000000215c48 CTR: 00000000005d54a0
[  804.464425] REGS: c00001801cfff3c0 TRAP: 0700   Not tainted  (6.18.0-master+)
[  804.464429] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28828228  XER: 0000000c
[  804.464441] CFAR: c000000000171988 IRQMASK: 0
               GPR00: c000000000215c48 c00001801cfff660 c000000001c28100 000000000000002b
               GPR04: 0000000000000000 c00001801cfff470 c00001801cfff468 000001fff1280000
               GPR08: 0000000000000027 0000000000000000 0000000000000000 0000000000000001
               GPR12: c00001ffe182ffa8 c00001fff5d43b00 c00001804e999548 0000000000000000
               GPR16: 0000000000000000 c0000000015732e8 c00000000153f380 c00000012b337c18
               GPR20: c000000002edb660 0000000000000239 0000000000000004 c000018029a26200
               GPR24: 0000000000000000 c0000000029787c8 0000000000000002 c00000012b337c00
               GPR28: c00001804e7cb948 c000000002ee06d0 c00001804e7cb800 c0000000029787c8
[  804.464491] NIP [c000000000215c4c] sd_init+0x428/0x494
[  804.464496] LR [c000000000215c48] sd_init+0x424/0x494
[  804.464501] Call Trace:
[  804.464504] [c00001801cfff660] [c000000000215c48] sd_init+0x424/0x494 (unreliable)
[  804.464511] [c00001801cfff740] [c000000000226fd8] build_sched_domains+0x1c0/0x938
[  804.464517] [c00001801cfff850] [c000000000228f98] partition_sched_domains_locked+0x4a8/0x688
[  804.464523] [c00001801cfff940] [c000000000229244] partition_sched_domains+0x5c/0x84
[  804.464528] [c00001801cfff990] [c00000000031a020] rebuild_sched_domains_locked+0x1d8/0x260
[  804.464536] [c00001801cfff9f0] [c00000000031dde4] cpuset_handle_hotplug+0x564/0x728
[  804.464542] [c00001801cfffd80] [c0000000001d9fa8] sched_cpu_activate+0x2d4/0x2dc
[  804.464549] [c00001801cfffde0] [c00000000017567c] cpuhp_invoke_callback+0x26c/0xb20
[  804.464556] [c00001801cfffec0] [c000000000177554] cpuhp_thread_fun+0x210/0x2e8
[  804.464561] [c00001801cffff40] [c0000000001c1640] smpboot_thread_fn+0x200/0x2c0
[  804.464568] [c00001801cffff90] [c0000000001b5758] kthread+0x134/0x164
[  804.464575] [c00001801cffffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[  804.464581] Code: 4082fe5c 3d420120 894a2525 2c0a0000 4082fe4c 3c62ff95 39200001 3d420120 38639830 992a2525 4bf5bcbd 60000000 <0fe00000> 813e003c 4bfffe24 60000000
[  804.464598] ---[ end trace 0000000000000000 ]---


But this warning will still remain even if we take the cpu_maps_update_begin.

This comes due to
	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
		  "CPU capacity asymmetry not supported on SMT\n");

which was recently added by 
Commit c744dc4ab58d ("sched/topology: Rework CPU capacity asymmetry detection")
Is there a way to tweak this WARN_ONCE?

-- 
Thanks and Regards
Srikar Dronamraju