.../ABI/testing/sysfs-devices-system-cpu | 9 + Documentation/scheduler/sched-arch.rst | 37 +++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 1 + arch/powerpc/platforms/pseries/lpar.c | 223 ++++++++++++++++++ arch/powerpc/platforms/pseries/pseries.h | 1 + drivers/base/cpu.c | 59 +++++ include/linux/cpumask.h | 20 ++ include/linux/sched.h | 9 +- kernel/sched/core.c | 106 ++++++++- kernel/sched/debug.c | 5 +- kernel/sched/fair.c | 42 +++- kernel/sched/rt.c | 11 +- kernel/sched/sched.h | 9 + 14 files changed, 519 insertions(+), 14 deletions(-)
Detailed problem statement and some of the implementation choices were discussed earlier[1]. [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ This is likely the version which would be used for LPC2025 discussion on this topic. Feel free to provide your suggestion and hoping for a solution that works for different architectures and it's use cases. All the existing alternatives such as cpu hotplug, creating isolated partitions etc break the user affinity. Since number of CPUs to use change depending on the steal time, it is not driven by User. Hence it would be wrong to break the affinity. This series allows if the task is pinned only paravirt CPUs, it will continue running there. Changes compared v3[1]: - Introduced computation of steal time in powerpc code. - Derive number of CPUs to use and mark the remaining as paravirt based on steal values. - Provide debugfs knobs to alter how steal time values being used. - Removed static key check for paravirt CPUs (Yury) - Removed preempt_disable/enable while calling stopper (Prateek) - Made select_idle_sibling and friends aware of paravirt CPUs. - Removed 3 unused schedstat fields and introduced 2 related to paravirt handling. - Handled nohz_full case by enabling tick on it when there is CFS/RT on it. - Updated helper patch to override arch behaviour for easier debugging during development. - Kept Changes compared to v4[2]: - Last two patches were sent out separate instead of being with series. That created confusion. Those two patches are debug patches one can make use to check functionality across acrhitectures. Sorry about that. - Use DEVICE_ATTR_RW instead (greg) - Made it as PATCH since arch specific handling completes the functionality. [2]: https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ TODO: - Get performance numbers on PowerPC, x86 and S390. Hopefully by next week. Didn't want to hold the series till then. - The CPUs to mark as paravirt is very simple and doesn't work when vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be splice the numbers based on how many CPUs each NUMA node has. It is quite tricky to do specially since cpumask can be on stack too. Given NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head into solving it yet. Maybe there is easier way. - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc specific) - Userspace tools awareness such as irqbalance. - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host informs guest which/how many CPUs it has to use at this moment. This interface should work across archs with each arch doing its specific handling. - Determine the default values for steal time related knobs empirically and document them. - Need to check safety against CPU hotplug specially in process_steal. Applies cleanly on tip/master: commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b Thanks to srikar for providing the initial code around powerpc steal time handling code. Thanks to all who went through and provided reviews. PS: I haven't found a better name. Please suggest if you have any. Shrikanth Hegde (17): sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept cpumask: Introduce cpu_paravirt_mask sched/core: Dont allow to use CPU marked as paravirt sched/debug: Remove unused schedstats sched/fair: Add paravirt movements for proc sched file sched/fair: Pass current cpu in select_idle_sibling sched/fair: Don't consider paravirt CPUs for wakeup and load balance sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task sched/core: Add support for nohz_full CPUs sched/core: Push current task from paravirt CPU sysfs: Add paravirt CPU file powerpc: method to initialize ec and vp cores powerpc: enable/disable paravirt CPUs based on steal time powerpc: process steal values at fixed intervals powerpc: add debugfs file for controlling handling on steal values sysfs: Provide write method for paravirt sysfs: disable arch handling if paravirt file being written .../ABI/testing/sysfs-devices-system-cpu | 9 + Documentation/scheduler/sched-arch.rst | 37 +++ arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 1 + arch/powerpc/platforms/pseries/lpar.c | 223 ++++++++++++++++++ arch/powerpc/platforms/pseries/pseries.h | 1 + drivers/base/cpu.c | 59 +++++ include/linux/cpumask.h | 20 ++ include/linux/sched.h | 9 +- kernel/sched/core.c | 106 ++++++++- kernel/sched/debug.c | 5 +- kernel/sched/fair.c | 42 +++- kernel/sched/rt.c | 11 +- kernel/sched/sched.h | 9 + 14 files changed, 519 insertions(+), 14 deletions(-) -- 2.47.3
On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were
> discussed earlier[1].
>
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
>
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
If maintaining task affinity is the only problem that cpusets don't
offer, attached below is a very naive prototype that seems to work in
my case without hitting any obvious splats so far.
Idea is to keep task affinity untouched, but remove the CPUs from
the sched domains.
That way, all the balancing, and wakeups will steer away from these
CPUs automatically but once the CPUs are put back, the balancing will
automatically move tasks back.
I tested this with a bunch of spinners and with partitions and both
seem to work as expected. For real world VM based testing, I pinned 2
6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
either VMs pin to same set of physical cores.
Running 8 groups of perf bench sched messaging on each VM at the same
time gives the following numbers for total runtime:
All CPUs available in the VM: 88.775s & 91.002s (2 cores overlap)
Only 4 cores available in the VM: 67.365s & 73.015s (No cores overlap)
Note: The unavailable mask didn't change in my runs. I've noticed a
bit of delay before the load balancer moves the tasks to the CPU
going from unavailable to available - your mileage may vary depending
on the frequency of mask updates.
Following is the diff on top of tip/master:
(Very raw PoC; Only fair tasks are considered for now to push away)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..7c1cfdd7ffea 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
}
extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
+const struct cpumask *cpuset_unavailable_mask(void);
+bool cpuset_cpu_unavailable(int cpu);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..170aba16141e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -59,6 +59,7 @@ typedef enum {
FILE_EXCLUSIVE_CPULIST,
FILE_EFFECTIVE_XCPULIST,
FILE_ISOLATED_CPULIST,
+ FILE_UNAVAILABLE_CPULIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4aaad07b0bd1..22d38f2299c4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -87,6 +87,19 @@ static cpumask_var_t isolated_cpus;
static cpumask_var_t boot_hk_cpus;
static bool have_boot_isolcpus;
+/*
+ * CPUs that may be unavailable to run tasks as a result of physical
+ * constraints (vCPU being preempted, pCPU handling interrupt storm).
+ *
+ * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
+ * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
+ * should be avoided unless the task has specifically asked to be run
+ * only on these CPUs.
+ */
+static cpumask_var_t unavailable_cpus;
+static cpumask_var_t available_tmp_mask; /* For intermediate operations. */
+static bool cpu_turned_unavailable;
+
/* List of remote partition root children */
static struct list_head remote_children;
@@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
cpumask_and(doms[0], top_cpuset.effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
+ cpumask_andnot(doms[0], doms[0], unavailable_cpus);
goto done;
}
@@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
* The top cpuset may contain some boot time isolated
* CPUs that need to be excluded from the sched domain.
*/
- if (csa[i] == &top_cpuset)
+ if (csa[i] == &top_cpuset) {
cpumask_and(doms[i], csa[i]->effective_cpus,
housekeeping_cpumask(HK_TYPE_DOMAIN));
- else
- cpumask_copy(doms[i], csa[i]->effective_cpus);
+ cpumask_andnot(doms[i], doms[i], unavailable_cpus);
+ } else {
+ cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
+ }
if (dattr)
dattr[i] = SD_ATTR_INIT;
}
@@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
}
cpumask_or(dp, dp, csa[j]->effective_cpus);
cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
+ cpumask_andnot(dp, dp, unavailable_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, csa[j]);
}
@@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
}
EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
+/* Get the set of CPUs marked unavailable. */
+const struct cpumask *cpuset_unavailable_mask(void)
+{
+ return unavailable_cpus;
+}
+
+bool cpuset_cpu_unavailable(int cpu)
+{
+ return cpumask_test_cpu(cpu, unavailable_cpus);
+}
+
/**
* rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
* @parent: Parent cpuset containing all siblings
@@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
return 0;
}
+/**
+ * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
+ * @cs: the cpuset to consider
+ * @trialcs: trial cpuset
+ * @buf: buffer of cpu numbers written to this cpuset
+ *
+ * The tasks' cpumask will be updated if cs is a valid partition root.
+ */
+static int update_unavailable_cpumask(const char *buf)
+{
+ cpumask_var_t tmp;
+ int retval;
+
+ if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+ return -ENOMEM;
+
+ retval = cpulist_parse(buf, tmp);
+ if (retval < 0)
+ goto out;
+
+ /* Nothing to do if the CPUs didn't change */
+ if (cpumask_equal(tmp, unavailable_cpus))
+ goto out;
+
+ /* Save the CPUs that went unavailable to push task out. */
+ if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
+ cpu_turned_unavailable = true;
+
+ cpumask_copy(unavailable_cpus, tmp);
+ cpuset_force_rebuild();
+out:
+ free_cpumask_var(tmp);
+ return retval;
+}
+
+static void cpuset_notify_unavailable_cpus(void)
+{
+ /*
+ * Prevent being preempted by the stopper if the local CPU
+ * turned unavailable.
+ */
+ guard(preempt)();
+
+ sched_fair_notify_unavaialable_cpus(available_tmp_mask);
+ cpu_turned_unavailable = false;
+}
+
/*
* Migrate memory region from one set of nodes to another. This is
* performed asynchronously as it can be called from process migration path
@@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct cpuset *cs = css_cs(of_css(of));
+ int file_type = of_cft(of)->private;
struct cpuset *trialcs;
int retval = -ENODEV;
- /* root is read-only */
- if (cs == &top_cpuset)
+ /* root is read-only; except for unavailable mask */
+ if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
+ return -EACCES;
+
+ /* unavailable mask can be only set on root. */
+ if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
return -EACCES;
buf = strstrip(buf);
@@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
case FILE_MEMLIST:
retval = update_nodemask(cs, trialcs, buf);
break;
+ case FILE_UNAVAILABLE_CPULIST:
+ retval = update_unavailable_cpumask(buf);
+ break;
default:
retval = -EINVAL;
break;
@@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
free_cpuset(trialcs);
if (force_sd_rebuild)
rebuild_sched_domains_locked();
+ if (cpu_turned_unavailable)
+ cpuset_notify_unavailable_cpus();
out_unlock:
cpuset_full_unlock();
if (of_cft(of)->private == FILE_MEMLIST)
@@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
case FILE_ISOLATED_CPULIST:
seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
break;
+ case FILE_UNAVAILABLE_CPULIST:
+ seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
+ break;
default:
ret = -EINVAL;
}
@@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
.flags = CFTYPE_ONLY_ON_ROOT,
},
+ {
+ .name = "cpus.unavailable",
+ .seq_show = cpuset_common_seq_show,
+ .write = cpuset_write_resmask,
+ .max_write_len = (100U + 6 * NR_CPUS),
+ .private = FILE_UNAVAILABLE_CPULIST,
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ },
+
{ } /* terminate */
};
@@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
+ BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee7dfbf01792..13d0d9587aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
/* Non kernel threads are not allowed during either online or offline. */
if (!(p->flags & PF_KTHREAD))
- return cpu_active(cpu);
+ return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
/* KTHREAD_IS_PER_CPU is always allowed. */
if (kthread_is_per_cpu(p))
@@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
goto out;
}
+ /*
+ * Only user threads can be forced out of
+ * unavaialable CPUs.
+ */
+ if (p->flags & PF_KTHREAD)
+ goto rude;
+
+ /* Any unavailable CPUs that can run the task? */
+ for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
+ if (!task_allowed_on_cpu(p, dest_cpu))
+ continue;
+
+ /* Can we hoist this up to goto rude? */
+ if (is_migration_disabled(p))
+ continue;
+
+ if (cpu_active(dest_cpu))
+ goto out;
+ }
+rude:
/* No more Mr. Nice Guy. */
switch (state) {
case cpuset:
@@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
* via sched_ttwu_wakeup() for activation so the wakee incurs the cost
* of the wakeup instead of the waker.
*/
-static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
@@ -5365,7 +5385,9 @@ void sched_exec(void)
int dest_cpu;
scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
- dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
+ int wake_flags = WF_EXEC;
+
+ dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
if (dest_cpu == smp_processor_id())
return;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..e502cccdae64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
return ld_moved;
}
+static int unavailable_balance_cpu_stop(void *data)
+{
+ struct task_struct *p, *tmp;
+ struct rq *rq = data;
+ int this_cpu = cpu_of(rq);
+
+ guard(rq_lock_irq)(rq);
+
+ list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
+ int target_cpu;
+
+ /*
+ * Bail out if a concurrent change to unavailable_mask turned
+ * this CPU available.
+ */
+ rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
+ if (!rq->unavailable_balance)
+ break;
+
+ /* XXX: Does not deal with migration disabled tasks. */
+ target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
+ if ((unsigned int)target_cpu < nr_cpumask_bits) {
+ deactivate_task(rq, p, 0);
+ set_task_cpu(p, target_cpu);
+
+ /*
+ * Switch to move_queued_task() later.
+ * For PoC send an IPI and be done with it.
+ */
+ __ttwu_queue_wakelist(p, target_cpu, 0);
+ }
+ }
+
+ rq->unavailable_balance = 0;
+
+ return 0;
+}
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
+{
+ int cpu, this_cpu = smp_processor_id();
+
+ for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
+ struct rq *rq = cpu_rq(cpu);
+
+ /* Balance in progress. Tasks will be pushed out. */
+ if (rq->unavailable_balance)
+ return;
+
+ stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
+ rq, &rq->unavailable_balance_work);
+ rq->unavailable_balance = 1;
+ }
+}
+
static inline unsigned long
get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb80666addec..c21ffb128734 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1221,6 +1221,10 @@ struct rq {
int push_cpu;
struct cpu_stop_work active_balance_work;
+ /* For pushing out taks from unavailable CPUs. */
+ struct cpu_stop_work unavailable_balance_work;
+ int unavailable_balance;
+
/* CPU of this runqueue: */
int cpu;
int online;
@@ -2413,6 +2417,8 @@ extern const u32 sched_prio_to_wmult[40];
#define RETRY_TASK ((void *)-1UL)
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
struct affinity_context {
const struct cpumask *new_mask;
struct cpumask *user_mask;
base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c
--
Thanks and Regards,
Prateek
Hi Prateek.
Thank you very much for going throguh the series.
On 12/8/25 10:17 AM, K Prateek Nayak wrote:
> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices were
>> discussed earlier[1].
>>
>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion on
>> this topic. Feel free to provide your suggestion and hoping for a solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use change
>> depending on the steal time, it is not driven by User. Hence it would be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>
> If maintaining task affinity is the only problem that cpusets don't
> offer, attached below is a very naive prototype that seems to work in
> my case without hitting any obvious splats so far.
>
> Idea is to keep task affinity untouched, but remove the CPUs from
> the sched domains.
>
> That way, all the balancing, and wakeups will steer away from these
> CPUs automatically but once the CPUs are put back, the balancing will
> automatically move tasks back.
>
> I tested this with a bunch of spinners and with partitions and both
> seem to work as expected. For real world VM based testing, I pinned 2
> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
> either VMs pin to same set of physical cores.
>
> Running 8 groups of perf bench sched messaging on each VM at the same
> time gives the following numbers for total runtime:
>
> All CPUs available in the VM: 88.775s & 91.002s (2 cores overlap)
> Only 4 cores available in the VM: 67.365s & 73.015s (No cores overlap)
>
> Note: The unavailable mask didn't change in my runs. I've noticed a
> bit of delay before the load balancer moves the tasks to the CPU
> going from unavailable to available - your mileage may vary depending
Depends on the scale of systems. I have seen it unfolding is slower
compared to folding on large systems.
> on the frequency of mask updates.
>
What do you mean "The unavailable mask didn't change in my runs" ?
If so, how did it take effect?
> Following is the diff on top of tip/master:
>
> (Very raw PoC; Only fair tasks are considered for now to push away)
>
I skimmed through it. It is very close to the current approach.
Advantage:
Happens immediately instead of waiting for tick.
Current approach too can move all the tasks at one tick.
the concern could be latency being high and races around the list.
Disadvantages:
Causes a sched domain rebuild. Which is known to be expensive on large systems.
But since steal time changes are not very aggressive at this point, this overhead
maybe ok.
Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
complications too. Specially you can have cpusets in nested fashion. And all of this is
not user driven. i think cpuset is inherently user driven.
Impementation looks more complicated to me atleast at this point.
Current poc needs to enhanced to make arch specific triggers. That is doable.
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b5..7c1cfdd7ffea 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> }
>
> extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
> +const struct cpumask *cpuset_unavailable_mask(void);
> +bool cpuset_cpu_unavailable(int cpu);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 337608f408ce..170aba16141e 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -59,6 +59,7 @@ typedef enum {
> FILE_EXCLUSIVE_CPULIST,
> FILE_EFFECTIVE_XCPULIST,
> FILE_ISOLATED_CPULIST,
> + FILE_UNAVAILABLE_CPULIST,
> FILE_CPU_EXCLUSIVE,
> FILE_MEM_EXCLUSIVE,
> FILE_MEM_HARDWALL,
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4aaad07b0bd1..22d38f2299c4 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -87,6 +87,19 @@ static cpumask_var_t isolated_cpus;
> static cpumask_var_t boot_hk_cpus;
> static bool have_boot_isolcpus;
>
> +/*
> + * CPUs that may be unavailable to run tasks as a result of physical
> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
> + *
> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
> + * should be avoided unless the task has specifically asked to be run
> + * only on these CPUs.
> + */
> +static cpumask_var_t unavailable_cpus;
> +static cpumask_var_t available_tmp_mask; /* For intermediate operations. */
> +static bool cpu_turned_unavailable;
> +
This unavailable name is not probably right. When system boots, there is available_cpu
and that is fixed and not expected to change. It can confuse users.
> /* List of remote partition root children */
> static struct list_head remote_children;
>
> @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
> }
> cpumask_and(doms[0], top_cpuset.effective_cpus,
> housekeeping_cpumask(HK_TYPE_DOMAIN));
> + cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>
> goto done;
> }
> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
> * The top cpuset may contain some boot time isolated
> * CPUs that need to be excluded from the sched domain.
> */
> - if (csa[i] == &top_cpuset)
> + if (csa[i] == &top_cpuset) {
> cpumask_and(doms[i], csa[i]->effective_cpus,
> housekeeping_cpumask(HK_TYPE_DOMAIN));
> - else
> - cpumask_copy(doms[i], csa[i]->effective_cpus);
> + cpumask_andnot(doms[i], doms[i], unavailable_cpus);
> + } else {
> + cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
> + }
> if (dattr)
> dattr[i] = SD_ATTR_INIT;
> }
> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
> }
> cpumask_or(dp, dp, csa[j]->effective_cpus);
> cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
> + cpumask_andnot(dp, dp, unavailable_cpus);
> if (dattr)
> update_domain_attr_tree(dattr + nslot, csa[j]);
> }
> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
> }
> EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>
> +/* Get the set of CPUs marked unavailable. */
> +const struct cpumask *cpuset_unavailable_mask(void)
> +{
> + return unavailable_cpus;
> +}
> +
> +bool cpuset_cpu_unavailable(int cpu)
> +{
> + return cpumask_test_cpu(cpu, unavailable_cpus);
> +}
> +
> /**
> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
> * @parent: Parent cpuset containing all siblings
> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
> return 0;
> }
>
> +/**
> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
> + * @cs: the cpuset to consider
> + * @trialcs: trial cpuset
> + * @buf: buffer of cpu numbers written to this cpuset
> + *
> + * The tasks' cpumask will be updated if cs is a valid partition root.
> + */
> +static int update_unavailable_cpumask(const char *buf)
> +{
> + cpumask_var_t tmp;
> + int retval;
> +
> + if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
> + return -ENOMEM;
> +
> + retval = cpulist_parse(buf, tmp);
> + if (retval < 0)
> + goto out;
> +
> + /* Nothing to do if the CPUs didn't change */
> + if (cpumask_equal(tmp, unavailable_cpus))
> + goto out;
> +
> + /* Save the CPUs that went unavailable to push task out. */
> + if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
> + cpu_turned_unavailable = true;
> +
> + cpumask_copy(unavailable_cpus, tmp);
> + cpuset_force_rebuild();
I think this rebuilding sched domains could add quite overhead.
> +out:
> + free_cpumask_var(tmp);
> + return retval;
> +}
> +
> +static void cpuset_notify_unavailable_cpus(void)
> +{
> + /*
> + * Prevent being preempted by the stopper if the local CPU
> + * turned unavailable.
> + */
> + guard(preempt)();
> +
> + sched_fair_notify_unavaialable_cpus(available_tmp_mask);
> + cpu_turned_unavailable = false;
> +}
> +
> /*
> * Migrate memory region from one set of nodes to another. This is
> * performed asynchronously as it can be called from process migration path
> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off)
> {
> struct cpuset *cs = css_cs(of_css(of));
> + int file_type = of_cft(of)->private;
> struct cpuset *trialcs;
> int retval = -ENODEV;
>
> - /* root is read-only */
> - if (cs == &top_cpuset)
> + /* root is read-only; except for unavailable mask */
> + if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
> + return -EACCES;
> +
> + /* unavailable mask can be only set on root. */
> + if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
> return -EACCES;
>
> buf = strstrip(buf);
> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> case FILE_MEMLIST:
> retval = update_nodemask(cs, trialcs, buf);
> break;
> + case FILE_UNAVAILABLE_CPULIST:
> + retval = update_unavailable_cpumask(buf);
> + break;
> default:
> retval = -EINVAL;
> break;
> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> free_cpuset(trialcs);
> if (force_sd_rebuild)
> rebuild_sched_domains_locked();
> + if (cpu_turned_unavailable)
> + cpuset_notify_unavailable_cpus();
> out_unlock:
> cpuset_full_unlock();
> if (of_cft(of)->private == FILE_MEMLIST)
> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
> case FILE_ISOLATED_CPULIST:
> seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
> break;
> + case FILE_UNAVAILABLE_CPULIST:
> + seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
> + break;
> default:
> ret = -EINVAL;
> }
> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
> .flags = CFTYPE_ONLY_ON_ROOT,
> },
>
> + {
> + .name = "cpus.unavailable",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * NR_CPUS),
> + .private = FILE_UNAVAILABLE_CPULIST,
> + .flags = CFTYPE_ONLY_ON_ROOT,
> + },
> +
> { } /* terminate */
> };
>
> @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
> BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
> BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
> BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
> + BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
> + BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>
> cpumask_setall(top_cpuset.cpus_allowed);
> nodes_setall(top_cpuset.mems_allowed);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee7dfbf01792..13d0d9587aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>
> /* Non kernel threads are not allowed during either online or offline. */
> if (!(p->flags & PF_KTHREAD))
> - return cpu_active(cpu);
> + return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>
> /* KTHREAD_IS_PER_CPU is always allowed. */
> if (kthread_is_per_cpu(p))
> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> goto out;
> }
>
> + /*
> + * Only user threads can be forced out of
> + * unavaialable CPUs.
> + */
> + if (p->flags & PF_KTHREAD)
> + goto rude;
> +
> + /* Any unavailable CPUs that can run the task? */
> + for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
> + if (!task_allowed_on_cpu(p, dest_cpu))
> + continue;
> +
> + /* Can we hoist this up to goto rude? */
> + if (is_migration_disabled(p))
> + continue;
> +
> + if (cpu_active(dest_cpu))
> + goto out;
> + }
> +rude:
> /* No more Mr. Nice Guy. */
> switch (state) {
> case cpuset:
> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
> * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
> * of the wakeup instead of the waker.
> */
> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> {
> struct rq *rq = cpu_rq(cpu);
>
> @@ -5365,7 +5385,9 @@ void sched_exec(void)
> int dest_cpu;
>
> scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
> + int wake_flags = WF_EXEC;
> +
> + dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
Whats this logic?
> if (dest_cpu == smp_processor_id())
> return;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..e502cccdae64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> return ld_moved;
> }
>
> +static int unavailable_balance_cpu_stop(void *data)
> +{
> + struct task_struct *p, *tmp;
> + struct rq *rq = data;
> + int this_cpu = cpu_of(rq);
> +
> + guard(rq_lock_irq)(rq);
> +
> + list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
> + int target_cpu;
> +
> + /*
> + * Bail out if a concurrent change to unavailable_mask turned
> + * this CPU available.
> + */
> + rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
> + if (!rq->unavailable_balance)
> + break;
> +
> + /* XXX: Does not deal with migration disabled tasks. */
> + target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
This can cause it to go first CPU always and then load balancer to move it later on.
First should check the nodemask the current cpu is on to avoid NUMA costs.
> + if ((unsigned int)target_cpu < nr_cpumask_bits) {
> + deactivate_task(rq, p, 0);
> + set_task_cpu(p, target_cpu);
> +
> + /*
> + * Switch to move_queued_task() later.
> + * For PoC send an IPI and be done with it.
> + */
> + __ttwu_queue_wakelist(p, target_cpu, 0);
> + }
> + }
> +
> + rq->unavailable_balance = 0;
> +
> + return 0;
> +}
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
> +{
> + int cpu, this_cpu = smp_processor_id();
> +
> + for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
> + struct rq *rq = cpu_rq(cpu);
> +
> + /* Balance in progress. Tasks will be pushed out. */
> + if (rq->unavailable_balance)
> + return;
> +
Need to run stopper, if there is active current task. otherise that work
can be done here itself.
> + stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
> + rq, &rq->unavailable_balance_work);
> + rq->unavailable_balance = 1;
> + }
> +}
> +
> static inline unsigned long
> get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
> {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cb80666addec..c21ffb128734 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1221,6 +1221,10 @@ struct rq {
> int push_cpu;
> struct cpu_stop_work active_balance_work;
>
> + /* For pushing out taks from unavailable CPUs. */
> + struct cpu_stop_work unavailable_balance_work;
> + int unavailable_balance;
> +
> /* CPU of this runqueue: */
> int cpu;
> int online;
> @@ -2413,6 +2417,8 @@ extern const u32 sched_prio_to_wmult[40];
>
> #define RETRY_TASK ((void *)-1UL)
>
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
> struct affinity_context {
> const struct cpumask *new_mask;
> struct cpumask *user_mask;
>
> base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c
Hello Shrikanth,
Thank you for taking a look at the PoC.
On 12/8/2025 3:27 PM, Shrikanth Hegde wrote:
> Hi Prateek.
>
> Thank you very much for going throguh the series.
>
> On 12/8/25 10:17 AM, K Prateek Nayak wrote:
>> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>>> Detailed problem statement and some of the implementation choices were
>>> discussed earlier[1].
>>>
>>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>>
>>> This is likely the version which would be used for LPC2025 discussion on
>>> this topic. Feel free to provide your suggestion and hoping for a solution
>>> that works for different architectures and it's use cases.
>>>
>>> All the existing alternatives such as cpu hotplug, creating isolated
>>> partitions etc break the user affinity. Since number of CPUs to use change
>>> depending on the steal time, it is not driven by User. Hence it would be
>>> wrong to break the affinity. This series allows if the task is pinned
>>> only paravirt CPUs, it will continue running there.
>>
>> If maintaining task affinity is the only problem that cpusets don't
>> offer, attached below is a very naive prototype that seems to work in
>> my case without hitting any obvious splats so far.
>>
>> Idea is to keep task affinity untouched, but remove the CPUs from
>> the sched domains.
>>
>> That way, all the balancing, and wakeups will steer away from these
>> CPUs automatically but once the CPUs are put back, the balancing will
>> automatically move tasks back.
>>
>> I tested this with a bunch of spinners and with partitions and both
>> seem to work as expected. For real world VM based testing, I pinned 2
>> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
>> either VMs pin to same set of physical cores.
>>
>> Running 8 groups of perf bench sched messaging on each VM at the same
>> time gives the following numbers for total runtime:
>>
>> All CPUs available in the VM: 88.775s & 91.002s (2 cores overlap)
>> Only 4 cores available in the VM: 67.365s & 73.015s (No cores overlap)
>>
>> Note: The unavailable mask didn't change in my runs. I've noticed a
>> bit of delay before the load balancer moves the tasks to the CPU
>> going from unavailable to available - your mileage may vary depending
>
> Depends on the scale of systems. I have seen it unfolding is slower
> compared to folding on large systems.
>
>> on the frequency of mask updates.
>>
>
> What do you mean "The unavailable mask didn't change in my runs" ?
> If so, how did it take effect?
The unavailable mask was set with the last two cores so that there
is no overlap in the pCPU usage. The mask remained same throughout
the runtime of the benchmarks - no dynamism in modifying the masks
within the VM.
>
>> Following is the diff on top of tip/master:
>>
>> (Very raw PoC; Only fair tasks are considered for now to push away)
>>
>
> I skimmed through it. It is very close to the current approach.
>
> Advantage:
> Happens immediately instead of waiting for tick.
> Current approach too can move all the tasks at one tick.
> the concern could be latency being high and races around the list.
>
> Disadvantages:
>
> Causes a sched domain rebuild. Which is known to be expensive on large systems.
> But since steal time changes are not very aggressive at this point, this overhead
> maybe ok.
>
> Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
> complications too. Specially you can have cpusets in nested fashion. And all of this is
> not user driven. i think cpuset is inherently user driven.
For that reason I only kept this mask for root cgroup. Putting any
CPU on it is as good as removing them from all partitions.
>
> Impementation looks more complicated to me atleast at this point.
>
> Current poc needs to enhanced to make arch specific triggers. That is doable.
>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 2ddb256187b5..7c1cfdd7ffea 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>> }
>> extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
>> +const struct cpumask *cpuset_unavailable_mask(void);
>> +bool cpuset_cpu_unavailable(int cpu);
>> #else /* !CONFIG_CPUSETS */
>> static inline bool cpusets_enabled(void) { return false; }
>> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
>> index 337608f408ce..170aba16141e 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -59,6 +59,7 @@ typedef enum {
>> FILE_EXCLUSIVE_CPULIST,
>> FILE_EFFECTIVE_XCPULIST,
>> FILE_ISOLATED_CPULIST,
>> + FILE_UNAVAILABLE_CPULIST,
>> FILE_CPU_EXCLUSIVE,
>> FILE_MEM_EXCLUSIVE,
>> FILE_MEM_HARDWALL,
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 4aaad07b0bd1..22d38f2299c4 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -87,6 +87,19 @@ static cpumask_var_t isolated_cpus;
>> static cpumask_var_t boot_hk_cpus;
>> static bool have_boot_isolcpus;
>> +/*
>> + * CPUs that may be unavailable to run tasks as a result of physical
>> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
>> + *
>> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
>> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
>> + * should be avoided unless the task has specifically asked to be run
>> + * only on these CPUs.
>> + */
>> +static cpumask_var_t unavailable_cpus;
>> +static cpumask_var_t available_tmp_mask; /* For intermediate operations. */
>> +static bool cpu_turned_unavailable;
>> +
>
> This unavailable name is not probably right. When system boots, there is available_cpu
> and that is fixed and not expected to change. It can confuse users.
Ack! Just some name that I thought was appropriate. Too much
thought wasn't put into it ;)
>
>> /* List of remote partition root children */
>> static struct list_head remote_children;
>> @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> }
>> cpumask_and(doms[0], top_cpuset.effective_cpus,
>> housekeeping_cpumask(HK_TYPE_DOMAIN));
>> + cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>> goto done;
>> }
>> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> * The top cpuset may contain some boot time isolated
>> * CPUs that need to be excluded from the sched domain.
>> */
>> - if (csa[i] == &top_cpuset)
>> + if (csa[i] == &top_cpuset) {
>> cpumask_and(doms[i], csa[i]->effective_cpus,
>> housekeeping_cpumask(HK_TYPE_DOMAIN));
>> - else
>> - cpumask_copy(doms[i], csa[i]->effective_cpus);
>> + cpumask_andnot(doms[i], doms[i], unavailable_cpus);
>> + } else {
>> + cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
>> + }
>> if (dattr)
>> dattr[i] = SD_ATTR_INIT;
>> }
>> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>> }
>> cpumask_or(dp, dp, csa[j]->effective_cpus);
>> cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
>> + cpumask_andnot(dp, dp, unavailable_cpus);
>> if (dattr)
>> update_domain_attr_tree(dattr + nslot, csa[j]);
>> }
>> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
>> }
>> EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>> +/* Get the set of CPUs marked unavailable. */
>> +const struct cpumask *cpuset_unavailable_mask(void)
>> +{
>> + return unavailable_cpus;
>> +}
>> +
>> +bool cpuset_cpu_unavailable(int cpu)
>> +{
>> + return cpumask_test_cpu(cpu, unavailable_cpus);
>> +}
>> +
>> /**
>> * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>> * @parent: Parent cpuset containing all siblings
>> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>> return 0;
>> }
>> +/**
>> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
>> + * @cs: the cpuset to consider
>> + * @trialcs: trial cpuset
>> + * @buf: buffer of cpu numbers written to this cpuset
>> + *
>> + * The tasks' cpumask will be updated if cs is a valid partition root.
>> + */
>> +static int update_unavailable_cpumask(const char *buf)
>> +{
>> + cpumask_var_t tmp;
>> + int retval;
>> +
>> + if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
>> + return -ENOMEM;
>> +
>> + retval = cpulist_parse(buf, tmp);
>> + if (retval < 0)
>> + goto out;
>> +
>> + /* Nothing to do if the CPUs didn't change */
>> + if (cpumask_equal(tmp, unavailable_cpus))
>> + goto out;
>> +
>> + /* Save the CPUs that went unavailable to push task out. */
>> + if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
>> + cpu_turned_unavailable = true;
>> +
>> + cpumask_copy(unavailable_cpus, tmp);
>> + cpuset_force_rebuild();
>
> I think this rebuilding sched domains could add quite overhead.
I agree! But I somewhat dislike putting a cpumask_and() in a
bunch of places where we deal with sched_domain when we can
simply adjust the sched_domain to account for it - it is
definitely not performant but IMO, it is somewhat cleaner.
But if CPUs are transitioning in and out of the paravirt mask
as such a high rate, wouldn't you just end up pushing the
tasks away only to soon pull them back?
What changes so suddenly in the hypervisor that a paravirt
CPU is now fully available after a sec or two?
On a sidenote, we do have vcpu_is_preempted() - isn't that
sufficient enough to steer tasks away if we start being a
bit more aggressive about it? Do we need a mask?
>
>> +out:
>> + free_cpumask_var(tmp);
>> + return retval;
>> +}
>> +
>> +static void cpuset_notify_unavailable_cpus(void)
>> +{
>> + /*
>> + * Prevent being preempted by the stopper if the local CPU
>> + * turned unavailable.
>> + */
>> + guard(preempt)();
>> +
>> + sched_fair_notify_unavaialable_cpus(available_tmp_mask);
>> + cpu_turned_unavailable = false;
>> +}
>> +
>> /*
>> * Migrate memory region from one set of nodes to another. This is
>> * performed asynchronously as it can be called from process migration path
>> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>> char *buf, size_t nbytes, loff_t off)
>> {
>> struct cpuset *cs = css_cs(of_css(of));
>> + int file_type = of_cft(of)->private;
>> struct cpuset *trialcs;
>> int retval = -ENODEV;
>> - /* root is read-only */
>> - if (cs == &top_cpuset)
>> + /* root is read-only; except for unavailable mask */
>> + if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
>> + return -EACCES;
>> +
>> + /* unavailable mask can be only set on root. */
>> + if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
>> return -EACCES;
>> buf = strstrip(buf);
>> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>> case FILE_MEMLIST:
>> retval = update_nodemask(cs, trialcs, buf);
>> break;
>> + case FILE_UNAVAILABLE_CPULIST:
>> + retval = update_unavailable_cpumask(buf);
>> + break;
>> default:
>> retval = -EINVAL;
>> break;
>> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>> free_cpuset(trialcs);
>> if (force_sd_rebuild)
>> rebuild_sched_domains_locked();
>> + if (cpu_turned_unavailable)
>> + cpuset_notify_unavailable_cpus();
>> out_unlock:
>> cpuset_full_unlock();
>> if (of_cft(of)->private == FILE_MEMLIST)
>> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
>> case FILE_ISOLATED_CPULIST:
>> seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
>> break;
>> + case FILE_UNAVAILABLE_CPULIST:
>> + seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
>> + break;
>> default:
>> ret = -EINVAL;
>> }
>> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
>> .flags = CFTYPE_ONLY_ON_ROOT,
>> },
>> + {
>> + .name = "cpus.unavailable",
>> + .seq_show = cpuset_common_seq_show,
>> + .write = cpuset_write_resmask,
>> + .max_write_len = (100U + 6 * NR_CPUS),
>> + .private = FILE_UNAVAILABLE_CPULIST,
>> + .flags = CFTYPE_ONLY_ON_ROOT,
>> + },
>> +
>> { } /* terminate */
>> };
>> @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
>> BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>> BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>> BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
>> + BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
>> + BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>> cpumask_setall(top_cpuset.cpus_allowed);
>> nodes_setall(top_cpuset.mems_allowed);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ee7dfbf01792..13d0d9587aca 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>> /* Non kernel threads are not allowed during either online or offline. */
>> if (!(p->flags & PF_KTHREAD))
>> - return cpu_active(cpu);
>> + return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>> /* KTHREAD_IS_PER_CPU is always allowed. */
>> if (kthread_is_per_cpu(p))
>> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>> goto out;
>> }
>> + /*
>> + * Only user threads can be forced out of
>> + * unavaialable CPUs.
>> + */
>> + if (p->flags & PF_KTHREAD)
>> + goto rude;
>> +
>> + /* Any unavailable CPUs that can run the task? */
>> + for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
>> + if (!task_allowed_on_cpu(p, dest_cpu))
>> + continue;
>> +
>> + /* Can we hoist this up to goto rude? */
>> + if (is_migration_disabled(p))
>> + continue;
>> +
>> + if (cpu_active(dest_cpu))
>> + goto out;
>> + }
>> +rude:
>> /* No more Mr. Nice Guy. */
>> switch (state) {
>> case cpuset:
>> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
>> * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
>> * of the wakeup instead of the waker.
>> */
>> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> @@ -5365,7 +5385,9 @@ void sched_exec(void)
>> int dest_cpu;
>> scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>> - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
>> + int wake_flags = WF_EXEC;
>> +
>> + dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
>
> Whats this logic?
WF_EXEC path would not care about the unavailable CPUs and won't run
the select_fallback_rq() path if the sched_class->select_task() is
called directly.
>
>> if (dest_cpu == smp_processor_id())
>> return;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da46c3164537..e502cccdae64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>> return ld_moved;
>> }
>> +static int unavailable_balance_cpu_stop(void *data)
>> +{
>> + struct task_struct *p, *tmp;
>> + struct rq *rq = data;
>> + int this_cpu = cpu_of(rq);
>> +
>> + guard(rq_lock_irq)(rq);
>> +
>> + list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
>> + int target_cpu;
>> +
>> + /*
>> + * Bail out if a concurrent change to unavailable_mask turned
>> + * this CPU available.
>> + */
>> + rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
>> + if (!rq->unavailable_balance)
>> + break;
>> +
>> + /* XXX: Does not deal with migration disabled tasks. */
>> + target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
>
> This can cause it to go first CPU always and then load balancer to move it later on.
> First should check the nodemask the current cpu is on to avoid NUMA costs.
Ack! I agree there is plenty of room for optimizations.
>
>> + if ((unsigned int)target_cpu < nr_cpumask_bits) {
>> + deactivate_task(rq, p, 0);
>> + set_task_cpu(p, target_cpu);
>> +
>> + /*
>> + * Switch to move_queued_task() later.
>> + * For PoC send an IPI and be done with it.
>> + */
>> + __ttwu_queue_wakelist(p, target_cpu, 0);
>> + }
>> + }
>> +
>> + rq->unavailable_balance = 0;
>> +
>> + return 0;
>> +}
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
>> +{
>> + int cpu, this_cpu = smp_processor_id();
>> +
>> + for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
>> + struct rq *rq = cpu_rq(cpu);
>> +
>> + /* Balance in progress. Tasks will be pushed out. */
>> + if (rq->unavailable_balance)
>> + return;
>> +
>
> Need to run stopper, if there is active current task. otherise that work
> can be done here itself.
Ack! My thinking was to not take a rq_lock early and let stopper
run and then push all queued fair tasks out with rq_lock held.
>
>> + stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
>> + rq, &rq->unavailable_balance_work);
>> + rq->unavailable_balance = 1;
>> + }
>> +}
>> +
--
Thanks and Regards,
Prateek
On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote: > Detailed problem statement and some of the implementation choices > were > discussed earlier[1]. > > [1]: > https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ > > This is likely the version which would be used for LPC2025 discussion > on > this topic. Feel free to provide your suggestion and hoping for a > solution > that works for different architectures and it's use cases. > > All the existing alternatives such as cpu hotplug, creating isolated > partitions etc break the user affinity. Since number of CPUs to use > change > depending on the steal time, it is not driven by User. Hence it would > be > wrong to break the affinity. This series allows if the task is pinned > only paravirt CPUs, it will continue running there. > > Changes compared v3[1]: > > - Introduced computation of steal time in powerpc code. > - Derive number of CPUs to use and mark the remaining as paravirt > based > on steal values. > - Provide debugfs knobs to alter how steal time values being used. > - Removed static key check for paravirt CPUs (Yury) > - Removed preempt_disable/enable while calling stopper (Prateek) > - Made select_idle_sibling and friends aware of paravirt CPUs. > - Removed 3 unused schedstat fields and introduced 2 related to > paravirt > handling. > - Handled nohz_full case by enabling tick on it when there is CFS/RT > on > it. > - Updated helper patch to override arch behaviour for easier > debugging > during development. > - Kept > > Changes compared to v4[2]: > - Last two patches were sent out separate instead of being with > series. > That created confusion. Those two patches are debug patches one can > make use to check functionality across acrhitectures. Sorry about > that. > - Use DEVICE_ATTR_RW instead (greg) > - Made it as PATCH since arch specific handling completes the > functionality. > > [2]: > https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ > > TODO: > > - Get performance numbers on PowerPC, x86 and S390. Hopefully by next > week. Didn't want to hold the series till then. > > - The CPUs to mark as paravirt is very simple and doesn't work when > vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be > splice > the numbers based on how many CPUs each NUMA node has. It is quite > tricky to do specially since cpumask can be on stack too. Given > NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head > into > solving it yet. Maybe there is easier way. > > - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc > specific) > > - Userspace tools awareness such as irqbalance. > > - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host > informs > guest which/how many CPUs it has to use at this moment. This > interface > should work across archs with each arch doing its specific > handling. > > - Determine the default values for steal time related knobs > empirically and document them. > > - Need to check safety against CPU hotplug specially in > process_steal. > > > Applies cleanly on tip/master: > commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b > > > Thanks to srikar for providing the initial code around powerpc steal > time handling code. Thanks to all who went through and provided > reviews. > > PS: I haven't found a better name. Please suggest if you have any. > > Shrikanth Hegde (17): > sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept > cpumask: Introduce cpu_paravirt_mask > sched/core: Dont allow to use CPU marked as paravirt > sched/debug: Remove unused schedstats > sched/fair: Add paravirt movements for proc sched file > sched/fair: Pass current cpu in select_idle_sibling > sched/fair: Don't consider paravirt CPUs for wakeup and load > balance > sched/rt: Don't select paravirt CPU for wakeup and push/pull rt > task > sched/core: Add support for nohz_full CPUs > sched/core: Push current task from paravirt CPU > sysfs: Add paravirt CPU file > powerpc: method to initialize ec and vp cores > powerpc: enable/disable paravirt CPUs based on steal time > powerpc: process steal values at fixed intervals > powerpc: add debugfs file for controlling handling on steal values > sysfs: Provide write method for paravirt > sysfs: disable arch handling if paravirt file being written > > .../ABI/testing/sysfs-devices-system-cpu | 9 + > Documentation/scheduler/sched-arch.rst | 37 +++ > arch/powerpc/include/asm/smp.h | 1 + > arch/powerpc/kernel/smp.c | 1 + > arch/powerpc/platforms/pseries/lpar.c | 223 > ++++++++++++++++++ > arch/powerpc/platforms/pseries/pseries.h | 1 + > drivers/base/cpu.c | 59 +++++ > include/linux/cpumask.h | 20 ++ > include/linux/sched.h | 9 +- > kernel/sched/core.c | 106 ++++++++- > kernel/sched/debug.c | 5 +- > kernel/sched/fair.c | 42 +++- > kernel/sched/rt.c | 11 +- > kernel/sched/sched.h | 9 + > 14 files changed, 519 insertions(+), 14 deletions(-) The capability to temporarily exclude CPUs from scheduling might be beneficial for s390x, where users often run Linux using a proprietary hypervisor called PR/SM and with high overcommit. In these circumstances virtual CPUs may not be scheduled by a hypervisor for a very long time. Today we have an upstream feature called "Hiperdispatch", which determines that this is about to happen and uses Capacity Aware Scheduling to prevent processes from being placed on the affected CPUs. However, at least when used for this purpose, Capacity Aware Scheduling is best effort and fails to move tasks away from the affected CPUs under high load. Therefore I have decided to smoke test this series. For the purposes of smoke testing, I set up a number of KVM virtual machines and start the same benchmark inside each one. Then I collect and compare the aggregate throughput numbers. I have not done testing with PR/SM yet, but I plan to do this and report back. I also have not tested this with VMs that are not 100% utilized yet. Benchmark parameters: $ sysbench cpu run --threads=$(nproc) --time=10 $ schbench -r 10 --json --no-locking $ hackbench --groups 10 --process --loops 5000 $ pgbench -h $WORKDIR --client=$(nproc) --time=10 Figures: s390x (16 host CPUs): Benchmark #VMs #CPUs/VM ΔRPS (%) ----------- ------ ---------- ---------- hackbench 16 4 60.58% pgbench 16 4 50.01% hackbench 8 8 46.18% hackbench 4 8 43.54% hackbench 2 16 43.23% hackbench 12 4 42.92% hackbench 8 4 35.53% hackbench 4 16 30.98% pgbench 12 4 18.41% hackbench 2 24 7.32% pgbench 8 4 6.84% pgbench 2 24 3.38% pgbench 2 16 3.02% pgbench 4 16 2.08% hackbench 2 32 1.46% pgbench 4 8 1.30% schbench 2 16 0.72% schbench 4 8 -0.09% schbench 4 4 -0.20% schbench 8 8 -0.41% sysbench 8 4 -0.46% sysbench 4 8 -0.53% schbench 8 4 -0.65% sysbench 2 16 -0.76% schbench 2 8 -0.77% sysbench 8 8 -1.72% schbench 2 24 -1.98% schbench 12 4 -2.03% sysbench 12 4 -2.13% pgbench 2 32 -3.15% sysbench 16 4 -3.17% schbench 16 4 -3.50% sysbench 2 8 -4.01% pgbench 8 8 -4.10% schbench 4 16 -5.93% sysbench 4 4 -5.94% pgbench 2 4 -6.40% hackbench 2 8 -10.04% hackbench 4 4 -10.91% pgbench 4 4 -11.05% sysbench 2 24 -13.07% sysbench 4 16 -13.59% hackbench 2 4 -13.96% pgbench 2 8 -16.16% schbench 2 4 -24.14% schbench 2 32 -24.25% sysbench 2 4 -24.98% sysbench 2 32 -32.84% x86_64 (32 host CPUs): Benchmark #VMs #CPUs/VM ΔRPS (%) ----------- ------ ---------- ---------- hackbench 4 32 87.02% hackbench 8 16 48.45% hackbench 4 24 47.95% hackbench 2 8 42.74% hackbench 2 32 34.90% pgbench 16 8 27.87% pgbench 12 8 25.17% hackbench 8 8 24.92% hackbench 16 8 22.41% hackbench 16 4 20.83% pgbench 8 16 20.40% hackbench 12 8 20.37% hackbench 4 16 20.36% pgbench 16 4 16.60% pgbench 8 8 14.92% hackbench 12 4 14.49% pgbench 4 32 9.49% pgbench 2 32 7.26% hackbench 2 24 6.54% pgbench 4 4 4.67% pgbench 8 4 3.24% pgbench 12 4 2.66% hackbench 4 8 2.53% pgbench 4 8 1.96% hackbench 2 16 1.93% schbench 4 32 1.24% pgbench 2 8 0.82% schbench 4 4 0.69% schbench 2 32 0.44% schbench 2 16 0.25% schbench 12 8 -0.02% sysbench 2 4 -0.02% schbench 4 24 -0.12% sysbench 2 16 -0.17% schbench 12 4 -0.18% schbench 2 4 -0.19% sysbench 4 8 -0.23% schbench 8 4 -0.24% sysbench 2 8 -0.24% schbench 4 8 -0.28% sysbench 8 4 -0.30% schbench 4 16 -0.37% schbench 2 24 -0.39% schbench 8 16 -0.49% schbench 2 8 -0.67% pgbench 4 16 -0.68% schbench 8 8 -0.83% sysbench 4 4 -0.92% schbench 16 4 -0.94% sysbench 12 4 -0.98% sysbench 8 16 -1.52% sysbench 16 4 -1.57% pgbench 2 4 -1.62% sysbench 12 8 -1.69% schbench 16 8 -1.97% sysbench 8 8 -2.08% hackbench 8 4 -2.11% pgbench 4 24 -3.20% pgbench 2 24 -3.35% sysbench 2 24 -3.81% pgbench 2 16 -4.55% sysbench 4 16 -5.10% sysbench 16 8 -6.56% sysbench 2 32 -8.24% sysbench 4 32 -13.54% sysbench 4 24 -13.62% hackbench 2 4 -15.40% hackbench 4 4 -17.71% There are some huge wins, especially for hackbench, which corresponds to Shrikanth's findings. There are some significant degradations too, which I plan to debug. This may simply have to do with the simplistic heuristic I am using for testing [1]. sysbench, for example, is not supposed to benefit from this series, because it is not affected by overcommit. However, it definitely should not degrade by 30%. Interestingly enough, this happens only with certain combinations of VM and CPU counts, and this is reproducible. Initially I have seen degradations as bad as -80% with schbench. It turned out this was caused by userspace per-CPU locking it implements; turning it off caused the degradation to go away. To me this looks like something synthetic and not something used by real-world application, but please correct me if I am wrong - then this will have to be resolved. One note regarding the PARAVIRT Kconfig gating: s390x does not select PARAVIRT today. For example, steal time we determine based on CPU timers and clocks, and not hypervisor hints. For now I had to add dummy paravirt headers to test this series. But I would appreciate if Kconfig gating was removed. Others have already commented on the naming, and I would agree that "paravirt" is really misleading. I cannot say that the previous "cpu- avoid" one was perfect, but it was much better. [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/
On 12/4/25 6:58 PM, Ilya Leoshkevich wrote: > On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote: >> Detailed problem statement and some of the implementation choices >> were >> discussed earlier[1]. >> >> [1]: >> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ >> >> This is likely the version which would be used for LPC2025 discussion >> on >> this topic. Feel free to provide your suggestion and hoping for a >> solution >> that works for different architectures and it's use cases. >> >> All the existing alternatives such as cpu hotplug, creating isolated >> partitions etc break the user affinity. Since number of CPUs to use >> change >> depending on the steal time, it is not driven by User. Hence it would >> be >> wrong to break the affinity. This series allows if the task is pinned >> only paravirt CPUs, it will continue running there. >> >> Changes compared v3[1]: >> >> - Introduced computation of steal time in powerpc code. >> - Derive number of CPUs to use and mark the remaining as paravirt >> based >> on steal values. >> - Provide debugfs knobs to alter how steal time values being used. >> - Removed static key check for paravirt CPUs (Yury) >> - Removed preempt_disable/enable while calling stopper (Prateek) >> - Made select_idle_sibling and friends aware of paravirt CPUs. >> - Removed 3 unused schedstat fields and introduced 2 related to >> paravirt >> handling. >> - Handled nohz_full case by enabling tick on it when there is CFS/RT >> on >> it. >> - Updated helper patch to override arch behaviour for easier >> debugging >> during development. >> - Kept >> >> Changes compared to v4[2]: >> - Last two patches were sent out separate instead of being with >> series. >> That created confusion. Those two patches are debug patches one can >> make use to check functionality across acrhitectures. Sorry about >> that. >> - Use DEVICE_ATTR_RW instead (greg) >> - Made it as PATCH since arch specific handling completes the >> functionality. >> >> [2]: >> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ >> >> TODO: >> >> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next >> week. Didn't want to hold the series till then. >> >> - The CPUs to mark as paravirt is very simple and doesn't work when >> vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be >> splice >> the numbers based on how many CPUs each NUMA node has. It is quite >> tricky to do specially since cpumask can be on stack too. Given >> NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head >> into >> solving it yet. Maybe there is easier way. >> >> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc >> specific) >> >> - Userspace tools awareness such as irqbalance. >> >> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host >> informs >> guest which/how many CPUs it has to use at this moment. This >> interface >> should work across archs with each arch doing its specific >> handling. >> >> - Determine the default values for steal time related knobs >> empirically and document them. >> >> - Need to check safety against CPU hotplug specially in >> process_steal. >> >> >> Applies cleanly on tip/master: >> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b >> >> >> Thanks to srikar for providing the initial code around powerpc steal >> time handling code. Thanks to all who went through and provided >> reviews. >> >> PS: I haven't found a better name. Please suggest if you have any. >> >> Shrikanth Hegde (17): >> sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept >> cpumask: Introduce cpu_paravirt_mask >> sched/core: Dont allow to use CPU marked as paravirt >> sched/debug: Remove unused schedstats >> sched/fair: Add paravirt movements for proc sched file >> sched/fair: Pass current cpu in select_idle_sibling >> sched/fair: Don't consider paravirt CPUs for wakeup and load >> balance >> sched/rt: Don't select paravirt CPU for wakeup and push/pull rt >> task >> sched/core: Add support for nohz_full CPUs >> sched/core: Push current task from paravirt CPU >> sysfs: Add paravirt CPU file >> powerpc: method to initialize ec and vp cores >> powerpc: enable/disable paravirt CPUs based on steal time >> powerpc: process steal values at fixed intervals >> powerpc: add debugfs file for controlling handling on steal values >> sysfs: Provide write method for paravirt >> sysfs: disable arch handling if paravirt file being written >> >> .../ABI/testing/sysfs-devices-system-cpu | 9 + >> Documentation/scheduler/sched-arch.rst | 37 +++ >> arch/powerpc/include/asm/smp.h | 1 + >> arch/powerpc/kernel/smp.c | 1 + >> arch/powerpc/platforms/pseries/lpar.c | 223 >> ++++++++++++++++++ >> arch/powerpc/platforms/pseries/pseries.h | 1 + >> drivers/base/cpu.c | 59 +++++ >> include/linux/cpumask.h | 20 ++ >> include/linux/sched.h | 9 +- >> kernel/sched/core.c | 106 ++++++++- >> kernel/sched/debug.c | 5 +- >> kernel/sched/fair.c | 42 +++- >> kernel/sched/rt.c | 11 +- >> kernel/sched/sched.h | 9 + >> 14 files changed, 519 insertions(+), 14 deletions(-) > > The capability to temporarily exclude CPUs from scheduling might be > beneficial for s390x, where users often run Linux using a proprietary > hypervisor called PR/SM and with high overcommit. In these > circumstances virtual CPUs may not be scheduled by a hypervisor for a > very long time. > > Today we have an upstream feature called "Hiperdispatch", which > determines that this is about to happen and uses Capacity Aware > Scheduling to prevent processes from being placed on the affected CPUs. > However, at least when used for this purpose, Capacity Aware Scheduling > is best effort and fails to move tasks away from the affected CPUs > under high load. > > Therefore I have decided to smoke test this series. > > For the purposes of smoke testing, I set up a number of KVM virtual > machines and start the same benchmark inside each one. Then I collect > and compare the aggregate throughput numbers. I have not done testing > with PR/SM yet, but I plan to do this and report back. I also have not > tested this with VMs that are not 100% utilized yet. > Best results would be when it works as HW hint from hypervisor. > Benchmark parameters: > > $ sysbench cpu run --threads=$(nproc) --time=10 > $ schbench -r 10 --json --no-locking > $ hackbench --groups 10 --process --loops 5000 > $ pgbench -h $WORKDIR --client=$(nproc) --time=10 > > Figures: > > s390x (16 host CPUs): > > Benchmark #VMs #CPUs/VM ΔRPS (%) > ----------- ------ ---------- ---------- > hackbench 16 4 60.58% > pgbench 16 4 50.01% > hackbench 8 8 46.18% > hackbench 4 8 43.54% > hackbench 2 16 43.23% > hackbench 12 4 42.92% > hackbench 8 4 35.53% > hackbench 4 16 30.98% > pgbench 12 4 18.41% > hackbench 2 24 7.32% > pgbench 8 4 6.84% > pgbench 2 24 3.38% > pgbench 2 16 3.02% > pgbench 4 16 2.08% > hackbench 2 32 1.46% > pgbench 4 8 1.30% > schbench 2 16 0.72% > schbench 4 8 -0.09% > schbench 4 4 -0.20% > schbench 8 8 -0.41% > sysbench 8 4 -0.46% > sysbench 4 8 -0.53% > schbench 8 4 -0.65% > sysbench 2 16 -0.76% > schbench 2 8 -0.77% > sysbench 8 8 -1.72% > schbench 2 24 -1.98% > schbench 12 4 -2.03% > sysbench 12 4 -2.13% > pgbench 2 32 -3.15% > sysbench 16 4 -3.17% > schbench 16 4 -3.50% > sysbench 2 8 -4.01% > pgbench 8 8 -4.10% > schbench 4 16 -5.93% > sysbench 4 4 -5.94% > pgbench 2 4 -6.40% > hackbench 2 8 -10.04% > hackbench 4 4 -10.91% > pgbench 4 4 -11.05% > sysbench 2 24 -13.07% > sysbench 4 16 -13.59% > hackbench 2 4 -13.96% > pgbench 2 8 -16.16% > schbench 2 4 -24.14% > schbench 2 32 -24.25% > sysbench 2 4 -24.98% > sysbench 2 32 -32.84% > > x86_64 (32 host CPUs): > > Benchmark #VMs #CPUs/VM ΔRPS (%) > ----------- ------ ---------- ---------- > hackbench 4 32 87.02% > hackbench 8 16 48.45% > hackbench 4 24 47.95% > hackbench 2 8 42.74% > hackbench 2 32 34.90% > pgbench 16 8 27.87% > pgbench 12 8 25.17% > hackbench 8 8 24.92% > hackbench 16 8 22.41% > hackbench 16 4 20.83% > pgbench 8 16 20.40% > hackbench 12 8 20.37% > hackbench 4 16 20.36% > pgbench 16 4 16.60% > pgbench 8 8 14.92% > hackbench 12 4 14.49% > pgbench 4 32 9.49% > pgbench 2 32 7.26% > hackbench 2 24 6.54% > pgbench 4 4 4.67% > pgbench 8 4 3.24% > pgbench 12 4 2.66% > hackbench 4 8 2.53% > pgbench 4 8 1.96% > hackbench 2 16 1.93% > schbench 4 32 1.24% > pgbench 2 8 0.82% > schbench 4 4 0.69% > schbench 2 32 0.44% > schbench 2 16 0.25% > schbench 12 8 -0.02% > sysbench 2 4 -0.02% > schbench 4 24 -0.12% > sysbench 2 16 -0.17% > schbench 12 4 -0.18% > schbench 2 4 -0.19% > sysbench 4 8 -0.23% > schbench 8 4 -0.24% > sysbench 2 8 -0.24% > schbench 4 8 -0.28% > sysbench 8 4 -0.30% > schbench 4 16 -0.37% > schbench 2 24 -0.39% > schbench 8 16 -0.49% > schbench 2 8 -0.67% > pgbench 4 16 -0.68% > schbench 8 8 -0.83% > sysbench 4 4 -0.92% > schbench 16 4 -0.94% > sysbench 12 4 -0.98% > sysbench 8 16 -1.52% > sysbench 16 4 -1.57% > pgbench 2 4 -1.62% > sysbench 12 8 -1.69% > schbench 16 8 -1.97% > sysbench 8 8 -2.08% > hackbench 8 4 -2.11% > pgbench 4 24 -3.20% > pgbench 2 24 -3.35% > sysbench 2 24 -3.81% > pgbench 2 16 -4.55% > sysbench 4 16 -5.10% > sysbench 16 8 -6.56% > sysbench 2 32 -8.24% > sysbench 4 32 -13.54% > sysbench 4 24 -13.62% > hackbench 2 4 -15.40% > hackbench 4 4 -17.71% > > There are some huge wins, especially for hackbench, which corresponds > to Shrikanth's findings. There are some significant degradations too, > which I plan to debug. This may simply have to do with the simplistic > heuristic I am using for testing [1]. > Thank you very much!! for running these numbers. > sysbench, for example, is not supposed to benefit from this series, > because it is not affected by overcommit. However, it definitely should > not degrade by 30%. Interestingly enough, this happens only with > certain combinations of VM and CPU counts, and this is reproducible. > is the host baremetal? is those cases cpufreq governer ramp up or down might play a role. (speculating) > Initially I have seen degradations as bad as -80% with schbench. It > turned out this was caused by userspace per-CPU locking it implements; > turning it off caused the degradation to go away. To me this looks like > something synthetic and not something used by real-world application, > but please correct me if I am wrong - then this will have to be > resolved. > That's nice to hear. I was concerned with schbench rps. Now i am bit relieved. Is this with schbench -L option? I ran with it. and regression i was seeing earlier is gone now. > > One note regarding the PARAVIRT Kconfig gating: s390x does not > select PARAVIRT today. For example, steal time we determine based on > CPU timers and clocks, and not hypervisor hints. For now I had to add > dummy paravirt headers to test this series. But I would appreciate if > Kconfig gating was removed. > Keeping PARAVIRT checks on is probably right thing. I will wait to see if anyone objects. > Others have already commented on the naming, and I would agree that > "paravirt" is really misleading. I cannot say that the previous "cpu- > avoid" one was perfect, but it was much better. > > > [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/ Will look into it. one thing to to be careful are CPU numbers.
On Fri, Dec 05, 2025 at 11:00:18AM +0530, Shrikanth Hegde wrote: > > > On 12/4/25 6:58 PM, Ilya Leoshkevich wrote: > > On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote: ... > > Others have already commented on the naming, and I would agree that > > "paravirt" is really misleading. I cannot say that the previous "cpu- > > avoid" one was perfect, but it was much better. It was my suggestion to switch names. cpu-avoid is definitely a no-go. Because it doesn't explain anything and only confuses. I suggested 'paravirt' (notice - only suggested) because the patch series is mainly discussing paravirtualized VMs. But now I'm not even sure that the idea of the series is: 1. Applicable only to paravirtualized VMs; and 2. Preemption and rescheduling throttling requires another in-kernel concept other than nohs, isolcpus, cgroups and similar. Shrikanth, can you please clarify the scope of the new feature? Would it be useful for non-paravirtualized VMs, for example? Any other task-cpu bonding problems? On previous rounds you tried to implement the same with cgroups, as far as I understood. Can you discuss that? What exactly can't be done with the existing kernel APIs? Thanks, Yury > > [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/ > > Will look into it. one thing to to be careful are CPU numbers.
Hi, Sorry for delay in response. Just landed yesterday from LPC.
>>> Others have already commented on the naming, and I would agree that
>>> "paravirt" is really misleading. I cannot say that the previous "cpu-
>>> avoid" one was perfect, but it was much better.
>
> It was my suggestion to switch names. cpu-avoid is definitely a
> no-go. Because it doesn't explain anything and only confuses.
>
> I suggested 'paravirt' (notice - only suggested) because the patch
> series is mainly discussing paravirtualized VMs. But now I'm not even
> sure that the idea of the series is:
>
> 1. Applicable only to paravirtualized VMs; and
> 2. Preemption and rescheduling throttling requires another in-kernel
> concept other than nohs, isolcpus, cgroups and similar.
>
> Shrikanth, can you please clarify the scope of the new feature? Would
> it be useful for non-paravirtualized VMs, for example? Any other
> task-cpu bonding problems?
Current scope of the feature in virtulaized environment where the idea is
to do co-operative folding in each VM based on hint(either HW hint or steal time).
If you see from macro level, this is framework which allows one to avoid some vCPUs(In
Guest) to achieve better throughput or latency. So one could come up with more usecases
even in non-paravirtualized VMs. For example, one crazy idea such as avoid using SMT siblings
when the system utilization is low to achieve higher ipc(instruction per cycle) value.
>
> On previous rounds you tried to implement the same with cgroups, as
> far as I understood. Can you discuss that? What exactly can't be done
> with the existing kernel APIs?
>
> Thanks,
> Yury
>
We discussed this in Sched-MC this year.
https://youtu.be/zf-MBoUIz1Q?t=8581
Currently explored options.
1. CPU Hotplug - slow. Some efforts underway to speed it up.
2. Creating isolated cpusets - Faster. still involves sched domain rebuilds.
The reason why they both won't work is that they break user affinities in the guest.
i.e guest can do "taskset -c <some_vcpus> <workload>, when the
last vCPU goes offline(guest vCPU hotplug) in that list of vCPUs
the affinity mask is reset and workload can run on online vCPUs and it
doesn't set back to earlier value. That is okay for hotlug or isolated cpusets
since it is driven by user in the guest. So user is aware of it.
Whereas here, the change is driven by the system than user in the guest.
So it cannot break user-space affinities.
So we need a new interface to drive this. I think it is better if it is
non cgroup based framework since cgroup is usually user driven.
(correct me if i am wrong).
PS:
There were some confusion around this affinity breaking. Note it is guest vCPU being marked and
guest vCPU being hotplugged. Task affinied workload was running in guest. Host CPUs(pCPU) are not
hotplugged.
---
I had discussion with vincent in hallway, idea is to use the push framework bits and set the
CPU Capacity=1 (lowest value and consider it as special value) and use a static key check to do
this stuff only when HW says to do so.
Such as (considering name as paravirt):
static inline bool cpu_paravirt(int cpu)
{
if (static_branch_unlikely(&cpu_paravirt_framework))
return arch_scale_cpu_capacity(cpu) == 1;
return false;
}
Rest of the bits remain same. I found an issue with current series where setting affinity
is going wrong after cpu is marked paravirt, i will fix it next version. will do some more
testing and send next version in 2026.
Happy Holidays!
On 11/19/25 6:14 PM, Shrikanth Hegde wrote: > Detailed problem statement and some of the implementation choices were > discussed earlier[1]. Performance data on x86 and PowerPC: ++++++++++++++++++++++++++++++++++++++++++++++++ PowerPC: LPAR(VM) Running on powerVM hypervisor ++++++++++++++++++++++++++++++++++++++++++++++++ Host: 126 cores available in pool. VM1: 96VP/64EC - 768 CPUs VM2: 72VP/48EC - 576 CPUs (VP- Virtual Processor core), (EC - Entitled Cores) steal_check_frequency:1 steal_ratio_high:400 steal_ratio_low:150 Scenarios: Secario 1: (Major improvement) VM1 is running daytrader[1] and VM2 is running stress-ng --cpu=$(nproc) Note: High gains. In the upstream the steal time was around 15%. With series it comes down to 3%. With further tuning it could be reduced. upstream +series daytrader 1x 1.7x <<- 70% gain throughput ----------- Scenario 2: (improves thread_count < num_cpus) VM1 is running schbench and VM2 is running stress-ng --cpu=$(nproc) Note: Values are average of 5 runs and they are wakeup latencies schbench -t 400 upstream +series 50.0th: 18.00 16.60 90.0th: 174.00 46.80 99.0th: 3197.60 928.80 99.9th: 6203.20 4539.20 average rps: 39665.61 42334.65 schbench -t 600 upstream +series 50.0th: 23.80 19.80 90.0th: 917.20 439.00 99.0th: 5582.40 3869.60 99.9th: 8982.40 6574.40 average rps: 39541.00 40018.11 ----------- Scenario 3: (Improves) VM1 is running hackbench and VM2 is running stress-ng --cpu=$(nproc) Note: Values are average of 10 runs and 20000 loops. Process 10 groups 2.84 2.62 Process 20 groups 5.39 4.48 Process 30 groups 7.51 6.29 Process 40 groups 9.88 7.42 Process 50 groups 12.46 9.54 Process 60 groups 14.76 12.09 thread 10 groups 2.93 2.70 thread 20 groups 5.79 4.78 Process(Pipe) 10 groups 2.31 2.18 Process(Pipe) 20 groups 3.32 3.26 Process(Pipe) 30 groups 4.19 4.14 Process(Pipe) 40 groups 5.18 5.53 Process(Pipe) 50 groups 6.57 6.80 Process(Pipe) 60 groups 8.21 8.13 thread(Pipe) 10 groups 2.42 2.24 thread(Pipe) 20 groups 3.62 3.42 ----------- Notes: Numbers might be very favorable since VM2 is constantly running and has some CPUs marked as paravirt when there is steal time and thresholds also might have played a role. Will plan to run same workload i.e hackbench and schbench on both VM's and see the behavior. VM1 is CPUs distributed equally across Nodes, while VM2 is not. Since CPUs are marked paravirt based on core count, some nodes on VM2 would have left unused and that could have added a boot for VM1 performance specially for daytrader. [1]: Daytrader is real life benchmark which does stock trading simulation. https://www.ibm.com/docs/en/linux-on-systems?topic=descriptions-daytrader-benchmark-application https://cwiki.apache.org/confluence/display/GMOxDOC12/Daytrader TODO: Get numbers with very high concurrency of hackbench/schbench. +++++++++++++++++++++++++++++++ on x86_64 (Laptop running KVMs) +++++++++++++++++++++++++++++++ Host: 8 CPUs. Two VM. Each spawned with -smp 8. ----------- Scenario 1: Both VM's are running hackbench 10 process 10000 loops. Values are average of 3 runs. High steal of close 50% was seen when running upstream. So marked 4-7 as paravirt by writing to sysfs file. Since laptop has lot of host tasks running, there will be still be steal time. hackbench 10 groups upstream +series (4-7 marked as paravirt) (seconds) 58 54.42 Note: Having 5 groups helps too. But when concurrency goes such as very high(40 groups), it regress. ----------- Scenario 2: Both VM's are running schbench. Values are average of 2 runs. "schbench -t 4 -r 30 -i 30" (latencies improve but rps is slightly less) wakeup latencies upstream +series(4-7 marked as paravirt) 50.0th 25.5 13.5 90.0th 70.0 30.0 99.0th 2588.0 1992.0 99.9th 3844.0 6032.0 average rps: 338 326 schbench -t 8 -r 30 -i 30 (Major degradation of rps) wakeup latencies upstream +series(4-7 marked as paravirt) 50.0th 15.0 11.5 90.0th 1630.0 2844.0 99.0th 4314.0 6624.0 99.9th 8572.0 10896.0 average rps: 393 240.5 Anything higher also regress. Need to see why it might be? Maybe too many context switches since number of threads are too high and CPUs available is less.
On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: > Detailed problem statement and some of the implementation choices were > discussed earlier[1]. > > [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ > > This is likely the version which would be used for LPC2025 discussion on > this topic. Feel free to provide your suggestion and hoping for a solution > that works for different architectures and it's use cases. > > All the existing alternatives such as cpu hotplug, creating isolated > partitions etc break the user affinity. Since number of CPUs to use change > depending on the steal time, it is not driven by User. Hence it would be > wrong to break the affinity. This series allows if the task is pinned > only paravirt CPUs, it will continue running there. > > Changes compared v3[1]: There is no "v" for this series :(
Hi Greg. On 11/24/25 10:35 PM, Greg KH wrote: > On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: >> Detailed problem statement and some of the implementation choices were >> discussed earlier[1]. >> >> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/ >> >> This is likely the version which would be used for LPC2025 discussion on >> this topic. Feel free to provide your suggestion and hoping for a solution >> that works for different architectures and it's use cases. >> >> All the existing alternatives such as cpu hotplug, creating isolated >> partitions etc break the user affinity. Since number of CPUs to use change >> depending on the steal time, it is not driven by User. Hence it would be >> wrong to break the affinity. This series allows if the task is pinned >> only paravirt CPUs, it will continue running there. >> >> Changes compared v3[1]: > > There is no "v" for this series :( > I thought about adding v1. I made it as PATCH from RFC PATCH since functionally it should be complete now with arch bits. Since it is v1, I remember usually people send out without adding v1. after v1 had tags such as v2. I will keep v2 for the next series.
Hi Shrikanth, Le 25/11/2025 à 03:39, Shrikanth Hegde a écrit : > Hi Greg. > > On 11/24/25 10:35 PM, Greg KH wrote: >> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote: >>> Detailed problem statement and some of the implementation choices were >>> discussed earlier[1]. >>> >>> [1]: https://eur01.safelinks.protection.outlook.com/? >>> url=https%3A%2F%2Flore.kernel.org%2Fall%2F20250910174210.1969750-1- >>> sshegde%40linux.ibm.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc7e5a5830fcb4c796d4808de2bcbe09d%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638996351808032890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cV8RTPdV3So1GwQ9uVYgUuGxSfxutSezpaNBq6RYn%2FI%3D&reserved=0 >>> >>> This is likely the version which would be used for LPC2025 discussion on >>> this topic. Feel free to provide your suggestion and hoping for a >>> solution >>> that works for different architectures and it's use cases. >>> >>> All the existing alternatives such as cpu hotplug, creating isolated >>> partitions etc break the user affinity. Since number of CPUs to use >>> change >>> depending on the steal time, it is not driven by User. Hence it would be >>> wrong to break the affinity. This series allows if the task is pinned >>> only paravirt CPUs, it will continue running there. >>> >>> Changes compared v3[1]: >> >> There is no "v" for this series :( >> > > I thought about adding v1. > > I made it as PATCH from RFC PATCH since functionally it should > be complete now with arch bits. Since it is v1, I remember usually > people send out without adding v1. after v1 had tags such as v2. > > I will keep v2 for the next series. > But you are listing changes compared to v3, how can it be a v1 ? Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here [1]. [1] https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/ Christophe
Hi Christophe, Greg >>> >>> There is no "v" for this series :( >>> >> >> I thought about adding v1. >> >> I made it as PATCH from RFC PATCH since functionally it should >> be complete now with arch bits. Since it is v1, I remember usually >> people send out without adding v1. after v1 had tags such as v2. >> >> I will keep v2 for the next series. >> > > But you are listing changes compared to v3, how can it be a v1 ? > Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here > [1]. > > [1] https://lore.kernel.org/all/20251119062100.1112520-1- > sshegde@linux.ibm.com/ > > Christophe Sorry about the confusion in numbers. Hopefully below helps for reviewing. If there are no objections, I will keep next one as v2. Please let me know. Revision logs: ++++++++++++++++++++++++++++++++++++++ RFC PATCH v4 -> PATCH (This series) ++++++++++++++++++++++++++++++++++++++ - Last two patches were sent out separate instead of being with series. Sent it as part of series. - Use DEVICE_ATTR_RW instead (greg) - Made it as PATCH since arch specific handling completes the functionality. +++++++++++++++++++++++++++++++++ RFC PATCH v3 -> RFC PATCH v4 +++++++++++++++++++++++++++++++++ - Introduced computation of steal time in powerpc code. - Derive number of CPUs to use and mark the remaining as paravirt based on steal values. - Provide debugfs knobs to alter how steal time values being used. - Removed static key check for paravirt CPUs (Yury) - Removed preempt_disable/enable while calling stopper (Prateek) - Made select_idle_sibling and friends aware of paravirt CPUs. - Removed 3 unused schedstat fields and introduced 2 related to paravirt handling. - Handled nohz_full case by enabling tick on it when there is CFS/RT on it. - Updated debug patch to override arch behavior for easier debugging during development. - Kept the method to push only current task out instead of moving all task's on rq given the complexity of later. +++++++++++++++++++++++++++++++++ RFC v2 -> RFC PATCH v3 +++++++++++++++++++++++++++++++++ - Renamed to paravirt_cpus_mask - Folded the changes under CONFIG_PARAVIRT. - Fixed the crash due work_buf corruption while using stop_one_cpu_nowait. - Added sysfs documentation. - Copy most of __balance_push_cpu_stop to new one, this helps it move the code out of CONFIG_HOTPLUG_CPU. - Some of the code movement suggested. +++++++++++++++++++++++++++++++++ RFC PATCH -> RFC v2 +++++++++++++++++++++++++++++++++ - Renamed to cpu_avoid_mask in place of cpu_parked_mask. - Used a static key such that no impact to regular case. - add sysfs file to show avoid CPUs. - Make RT understand avoid CPUs. - Add documentation patch - Took care of reported compile error when NR_CPUS=1 PATCH : https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/ RFC PATCH v4 : https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/#r RFC PATCH v3 : https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/#r RFC v2 : https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/#r RFC PATCH : https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
© 2016 - 2026 Red Hat, Inc.