[v1] Paravirt CPUs and push task for less vCPU preemption

[PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months, 2 weeks ago

Detailed problem statement and some of the implementation choices were 
discussed earlier[1].

[1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/

This is likely the version which would be used for LPC2025 discussion on
this topic. Feel free to provide your suggestion and hoping for a solution
that works for different architectures and it's use cases.

All the existing alternatives such as cpu hotplug, creating isolated
partitions etc break the user affinity. Since number of CPUs to use change
depending on the steal time, it is not driven by User. Hence it would be
wrong to break the affinity. This series allows if the task is pinned
only paravirt CPUs, it will continue running there.

Changes compared v3[1]:

- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt based
  on steal values. 
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to paravirt
  handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT on
  it.
- Updated helper patch to override arch behaviour for easier debugging
  during development.
- Kept 

Changes compared to v4[2]:
- Last two patches were sent out separate instead of being with series.
  That created confusion. Those two patches are debug patches one can
  make use to check functionality across acrhitectures. Sorry about
  that.
- Use DEVICE_ATTR_RW instead (greg)
- Made it as PATCH since arch specific handling completes the
  functionality.

[2]: https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

TODO: 

- Get performance numbers on PowerPC, x86 and S390. Hopefully by next
  week. Didn't want to hold the series till then.

- The CPUs to mark as paravirt is very simple and doesn't work when
  vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be splice
  the numbers based on how many CPUs each NUMA node has. It is quite
  tricky to do specially since cpumask can be on stack too. Given
  NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head into
  solving it yet. Maybe there is easier way.

- DLPAR Add/Remove needs to call init of EC/VP cores (powerpc specific)

- Userspace tools awareness such as irqbalance. 

- Delve into design of hint from Hyeprvisor(HW Hint). i.e Host informs
  guest which/how many CPUs it has to use at this moment. This interface
  should work across archs with each arch doing its specific handling.

- Determine the default values for steal time related knobs
  empirically and document them.

- Need to check safety against CPU hotplug specially in process_steal.


Applies cleanly on tip/master:
commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b


Thanks to srikar for providing the initial code around powerpc steal
time handling code. Thanks to all who went through and provided reviews.

PS: I haven't found a better name. Please suggest if you have any.

Shrikanth Hegde (17):
  sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  cpumask: Introduce cpu_paravirt_mask
  sched/core: Dont allow to use CPU marked as paravirt
  sched/debug: Remove unused schedstats
  sched/fair: Add paravirt movements for proc sched file
  sched/fair: Pass current cpu in select_idle_sibling
  sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  sched/core: Add support for nohz_full CPUs
  sched/core: Push current task from paravirt CPU
  sysfs: Add paravirt CPU file
  powerpc: method to initialize ec and vp cores
  powerpc: enable/disable paravirt CPUs based on steal time
  powerpc: process steal values at fixed intervals
  powerpc: add debugfs file for controlling handling on steal values
  sysfs: Provide write method for paravirt
  sysfs: disable arch handling if paravirt file being written

 .../ABI/testing/sysfs-devices-system-cpu      |   9 +
 Documentation/scheduler/sched-arch.rst        |  37 +++
 arch/powerpc/include/asm/smp.h                |   1 +
 arch/powerpc/kernel/smp.c                     |   1 +
 arch/powerpc/platforms/pseries/lpar.c         | 223 ++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 drivers/base/cpu.c                            |  59 +++++
 include/linux/cpumask.h                       |  20 ++
 include/linux/sched.h                         |   9 +-
 kernel/sched/core.c                           | 106 ++++++++-
 kernel/sched/debug.c                          |   5 +-
 kernel/sched/fair.c                           |  42 +++-
 kernel/sched/rt.c                             |  11 +-
 kernel/sched/sched.h                          |   9 +
 14 files changed, 519 insertions(+), 14 deletions(-)

-- 
2.47.3

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by K Prateek Nayak 2 months ago

On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were 
> discussed earlier[1].
> 
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.

If maintaining task affinity is the only problem that cpusets don't
offer, attached below is a very naive prototype that seems to work in
my case without hitting any obvious splats so far.

Idea is to keep task affinity untouched, but remove the CPUs from
the sched domains.

That way, all the balancing, and wakeups will steer away from these
CPUs automatically but once the CPUs are put back, the balancing will
automatically move tasks back.

I tested this with a bunch of spinners and with partitions and both
seem to work as expected. For real world VM based testing, I pinned 2
6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
either VMs pin to same set of physical cores.

Running 8 groups of perf bench sched messaging on each VM at the same
time gives the following numbers for total runtime:

All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)

Note: The unavailable mask didn't change in my runs. I've noticed a
bit of delay before the load balancer moves the tasks to the CPU
going from unavailable to available - your mileage may vary depending
on the frequency of mask updates.

Following is the diff on top of tip/master:

(Very raw PoC; Only fair tasks are considered for now to push away)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..7c1cfdd7ffea 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 }
 
 extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
+const struct cpumask *cpuset_unavailable_mask(void);
+bool cpuset_cpu_unavailable(int cpu);
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 337608f408ce..170aba16141e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -59,6 +59,7 @@ typedef enum {
 	FILE_EXCLUSIVE_CPULIST,
 	FILE_EFFECTIVE_XCPULIST,
 	FILE_ISOLATED_CPULIST,
+	FILE_UNAVAILABLE_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4aaad07b0bd1..22d38f2299c4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -87,6 +87,19 @@ static cpumask_var_t	isolated_cpus;
 static cpumask_var_t	boot_hk_cpus;
 static bool		have_boot_isolcpus;
 
+/*
+ * CPUs that may be unavailable to run tasks as a result of physical
+ * constraints (vCPU being preempted, pCPU handling interrupt storm).
+ *
+ * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
+ * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
+ * should be avoided unless the task has specifically asked to be run
+ * only on these CPUs.
+ */
+static cpumask_var_t	unavailable_cpus;
+static cpumask_var_t	available_tmp_mask;	/* For intermediate operations. */
+static bool 		cpu_turned_unavailable;
+
 /* List of remote partition root children */
 static struct list_head remote_children;
 
@@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		}
 		cpumask_and(doms[0], top_cpuset.effective_cpus,
 			    housekeeping_cpumask(HK_TYPE_DOMAIN));
+		cpumask_andnot(doms[0], doms[0], unavailable_cpus);
 
 		goto done;
 	}
@@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
 			 * The top cpuset may contain some boot time isolated
 			 * CPUs that need to be excluded from the sched domain.
 			 */
-			if (csa[i] == &top_cpuset)
+			if (csa[i] == &top_cpuset) {
 				cpumask_and(doms[i], csa[i]->effective_cpus,
 					    housekeeping_cpumask(HK_TYPE_DOMAIN));
-			else
-				cpumask_copy(doms[i], csa[i]->effective_cpus);
+				cpumask_andnot(doms[i], doms[i], unavailable_cpus);
+			 } else {
+				cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
+			 }
 			if (dattr)
 				dattr[i] = SD_ATTR_INIT;
 		}
@@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 				}
 				cpumask_or(dp, dp, csa[j]->effective_cpus);
 				cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
+				cpumask_andnot(dp, dp, unavailable_cpus);
 				if (dattr)
 					update_domain_attr_tree(dattr + nslot, csa[j]);
 			}
@@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
 }
 EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
 
+/* Get the set of CPUs marked unavailable. */
+const struct cpumask *cpuset_unavailable_mask(void)
+{
+	return unavailable_cpus;
+}
+
+bool cpuset_cpu_unavailable(int cpu)
+{
+	return  cpumask_test_cpu(cpu, unavailable_cpus);
+}
+
 /**
  * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
  * @parent: Parent cpuset containing all siblings
@@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	return 0;
 }
 
+/**
+ * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
+ * @cs: the cpuset to consider
+ * @trialcs: trial cpuset
+ * @buf: buffer of cpu numbers written to this cpuset
+ *
+ * The tasks' cpumask will be updated if cs is a valid partition root.
+ */
+static int update_unavailable_cpumask(const char *buf)
+{
+	cpumask_var_t tmp;
+	int retval;
+
+	if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
+		return -ENOMEM;
+
+	retval = cpulist_parse(buf, tmp);
+	if (retval < 0)
+		goto out;
+
+	/* Nothing to do if the CPUs didn't change */
+	if (cpumask_equal(tmp, unavailable_cpus))
+		goto out;
+
+	/* Save the CPUs that went unavailable to push task out. */
+	if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
+		cpu_turned_unavailable = true;
+
+	cpumask_copy(unavailable_cpus, tmp);
+	cpuset_force_rebuild();
+out:
+	free_cpumask_var(tmp);
+	return retval;
+}
+
+static void cpuset_notify_unavailable_cpus(void)
+{
+	/*
+	 * Prevent being preempted by the stopper if the local CPU
+	 * turned unavailable.
+	 */
+	guard(preempt)();
+
+	sched_fair_notify_unavaialable_cpus(available_tmp_mask);
+	cpu_turned_unavailable = false;
+}
+
 /*
  * Migrate memory region from one set of nodes to another.  This is
  * performed asynchronously as it can be called from process migration path
@@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	struct cpuset *cs = css_cs(of_css(of));
+	int file_type = of_cft(of)->private;
 	struct cpuset *trialcs;
 	int retval = -ENODEV;
 
-	/* root is read-only */
-	if (cs == &top_cpuset)
+	/* root is read-only; except for unavailable mask */
+	if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
+		return -EACCES;
+
+	/* unavailable mask can be only set on root. */
+	if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
 		return -EACCES;
 
 	buf = strstrip(buf);
@@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	case FILE_MEMLIST:
 		retval = update_nodemask(cs, trialcs, buf);
 		break;
+	case FILE_UNAVAILABLE_CPULIST:
+		retval = update_unavailable_cpumask(buf);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	free_cpuset(trialcs);
 	if (force_sd_rebuild)
 		rebuild_sched_domains_locked();
+	if (cpu_turned_unavailable)
+		cpuset_notify_unavailable_cpus();
 out_unlock:
 	cpuset_full_unlock();
 	if (of_cft(of)->private == FILE_MEMLIST)
@@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_ISOLATED_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
 		break;
+	case FILE_UNAVAILABLE_CPULIST:
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
 		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
+	{
+		.name = "cpus.unavailable",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_UNAVAILABLE_CPULIST,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee7dfbf01792..13d0d9587aca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 			goto out;
 		}
 
+		/*
+		 * Only user threads can be forced out of
+		 * unavaialable CPUs.
+		 */
+		if (p->flags & PF_KTHREAD)
+			goto rude;
+
+		/* Any unavailable CPUs that can run the task? */
+		for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
+			if (!task_allowed_on_cpu(p, dest_cpu))
+				continue;
+
+			/* Can we hoist this up to goto rude? */
+			if (is_migration_disabled(p))
+				continue;
+
+			if (cpu_active(dest_cpu))
+				goto out;
+		}
+rude:
 		/* No more Mr. Nice Guy. */
 		switch (state) {
 		case cpuset:
@@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
  * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
  * of the wakeup instead of the waker.
  */
-static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
@@ -5365,7 +5385,9 @@ void sched_exec(void)
 	int dest_cpu;
 
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
-		dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
+		int wake_flags = WF_EXEC;
+
+		dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
 		if (dest_cpu == smp_processor_id())
 			return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..e502cccdae64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	return ld_moved;
 }
 
+static int unavailable_balance_cpu_stop(void *data)
+{
+	struct task_struct *p, *tmp;
+	struct rq *rq = data;
+	int this_cpu = cpu_of(rq);
+
+	guard(rq_lock_irq)(rq);
+
+	list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
+		int target_cpu;
+
+		/*
+		 * Bail out if a concurrent change to unavailable_mask turned
+		 * this CPU available.
+		 */
+		rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
+		if (!rq->unavailable_balance)
+			break;
+
+		/* XXX: Does not deal with migration disabled tasks. */
+		target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
+		if ((unsigned int)target_cpu < nr_cpumask_bits) {
+			deactivate_task(rq, p, 0);
+			set_task_cpu(p, target_cpu);
+
+			/*
+			 * Switch to move_queued_task() later.
+			 * For PoC send an IPI and be done with it.
+			 */
+			__ttwu_queue_wakelist(p, target_cpu, 0);
+		}
+	}
+
+	rq->unavailable_balance = 0;
+
+	return 0;
+}
+
+void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
+{
+	int cpu, this_cpu = smp_processor_id();
+
+	for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
+		struct rq *rq = cpu_rq(cpu);
+
+		/* Balance in progress. Tasks will be pushed out. */
+		if (rq->unavailable_balance)
+			return;
+
+		stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
+				    rq, &rq->unavailable_balance_work);
+		rq->unavailable_balance = 1;
+	}
+}
+
 static inline unsigned long
 get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb80666addec..c21ffb128734 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1221,6 +1221,10 @@ struct rq {
 	int			push_cpu;
 	struct cpu_stop_work	active_balance_work;
 
+	/* For pushing out taks from unavailable CPUs. */
+	struct cpu_stop_work	unavailable_balance_work;
+	int			unavailable_balance;
+
 	/* CPU of this runqueue: */
 	int			cpu;
 	int			online;
@@ -2413,6 +2417,8 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
 struct affinity_context {
 	const struct cpumask	*new_mask;
 	struct cpumask		*user_mask;

base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c
-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months ago

Hi Prateek.

Thank you very much for going throguh the series.

On 12/8/25 10:17 AM, K Prateek Nayak wrote:
> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices were
>> discussed earlier[1].
>>
>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion on
>> this topic. Feel free to provide your suggestion and hoping for a solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use change
>> depending on the steal time, it is not driven by User. Hence it would be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
> 
> If maintaining task affinity is the only problem that cpusets don't
> offer, attached below is a very naive prototype that seems to work in
> my case without hitting any obvious splats so far.
> 
> Idea is to keep task affinity untouched, but remove the CPUs from
> the sched domains.
> 
> That way, all the balancing, and wakeups will steer away from these
> CPUs automatically but once the CPUs are put back, the balancing will
> automatically move tasks back.
> 
> I tested this with a bunch of spinners and with partitions and both
> seem to work as expected. For real world VM based testing, I pinned 2
> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
> either VMs pin to same set of physical cores.
> 
> Running 8 groups of perf bench sched messaging on each VM at the same
> time gives the following numbers for total runtime:
> 
> All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
> Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)
> 
> Note: The unavailable mask didn't change in my runs. I've noticed a
> bit of delay before the load balancer moves the tasks to the CPU
> going from unavailable to available - your mileage may vary depending

Depends on the scale of systems. I have seen it unfolding is slower
compared to folding on large systems.

> on the frequency of mask updates.
> 

What do you mean "The unavailable mask didn't change in my runs" ?
If so, how did it take effect?

> Following is the diff on top of tip/master:
> 
> (Very raw PoC; Only fair tasks are considered for now to push away)
> 

I skimmed through it. It is very close to the current approach.

Advantage:
Happens immediately instead of waiting for tick.
Current approach too can move all the tasks at one tick.
the concern could be latency being high and races around the list.

Disadvantages:

Causes a sched domain rebuild. Which is known to be expensive on large systems.
But since steal time changes are not very aggressive at this point, this overhead
maybe ok.

Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
complications too. Specially you can have cpusets in nested fashion. And all of this is
not user driven. i think cpuset is inherently user driven.

Impementation looks more complicated to me atleast at this point.

Current poc needs to enhanced to make arch specific triggers. That is doable.

> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b5..7c1cfdd7ffea 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>   }
>   
>   extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
> +const struct cpumask *cpuset_unavailable_mask(void);
> +bool cpuset_cpu_unavailable(int cpu);
>   #else /* !CONFIG_CPUSETS */
>   
>   static inline bool cpusets_enabled(void) { return false; }
> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
> index 337608f408ce..170aba16141e 100644
> --- a/kernel/cgroup/cpuset-internal.h
> +++ b/kernel/cgroup/cpuset-internal.h
> @@ -59,6 +59,7 @@ typedef enum {
>   	FILE_EXCLUSIVE_CPULIST,
>   	FILE_EFFECTIVE_XCPULIST,
>   	FILE_ISOLATED_CPULIST,
> +	FILE_UNAVAILABLE_CPULIST,
>   	FILE_CPU_EXCLUSIVE,
>   	FILE_MEM_EXCLUSIVE,
>   	FILE_MEM_HARDWALL,
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4aaad07b0bd1..22d38f2299c4 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -87,6 +87,19 @@ static cpumask_var_t	isolated_cpus;
>   static cpumask_var_t	boot_hk_cpus;
>   static bool		have_boot_isolcpus;
>   
> +/*
> + * CPUs that may be unavailable to run tasks as a result of physical
> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
> + *
> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
> + * should be avoided unless the task has specifically asked to be run
> + * only on these CPUs.
> + */
> +static cpumask_var_t	unavailable_cpus;
> +static cpumask_var_t	available_tmp_mask;	/* For intermediate operations. */
> +static bool 		cpu_turned_unavailable;
> +

This unavailable name is not probably right. When system boots, there is available_cpu
and that is fixed and not expected to change. It can confuse users.

>   /* List of remote partition root children */
>   static struct list_head remote_children;
>   
> @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   		}
>   		cpumask_and(doms[0], top_cpuset.effective_cpus,
>   			    housekeeping_cpumask(HK_TYPE_DOMAIN));
> +		cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>   
>   		goto done;
>   	}
> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   			 * The top cpuset may contain some boot time isolated
>   			 * CPUs that need to be excluded from the sched domain.
>   			 */
> -			if (csa[i] == &top_cpuset)
> +			if (csa[i] == &top_cpuset) {
>   				cpumask_and(doms[i], csa[i]->effective_cpus,
>   					    housekeeping_cpumask(HK_TYPE_DOMAIN));
> -			else
> -				cpumask_copy(doms[i], csa[i]->effective_cpus);
> +				cpumask_andnot(doms[i], doms[i], unavailable_cpus);
> +			 } else {
> +				cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
> +			 }
>   			if (dattr)
>   				dattr[i] = SD_ATTR_INIT;
>   		}
> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>   				}
>   				cpumask_or(dp, dp, csa[j]->effective_cpus);
>   				cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
> +				cpumask_andnot(dp, dp, unavailable_cpus);
>   				if (dattr)
>   					update_domain_attr_tree(dattr + nslot, csa[j]);
>   			}
> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
>   }
>   EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>   
> +/* Get the set of CPUs marked unavailable. */
> +const struct cpumask *cpuset_unavailable_mask(void)
> +{
> +	return unavailable_cpus;
> +}
> +
> +bool cpuset_cpu_unavailable(int cpu)
> +{
> +	return  cpumask_test_cpu(cpu, unavailable_cpus);
> +}
> +
>   /**
>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>    * @parent: Parent cpuset containing all siblings
> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>   	return 0;
>   }
>   
> +/**
> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
> + * @cs: the cpuset to consider
> + * @trialcs: trial cpuset
> + * @buf: buffer of cpu numbers written to this cpuset
> + *
> + * The tasks' cpumask will be updated if cs is a valid partition root.
> + */
> +static int update_unavailable_cpumask(const char *buf)
> +{
> +	cpumask_var_t tmp;
> +	int retval;
> +
> +	if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	retval = cpulist_parse(buf, tmp);
> +	if (retval < 0)
> +		goto out;
> +
> +	/* Nothing to do if the CPUs didn't change */
> +	if (cpumask_equal(tmp, unavailable_cpus))
> +		goto out;
> +
> +	/* Save the CPUs that went unavailable to push task out. */
> +	if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
> +		cpu_turned_unavailable = true;
> +
> +	cpumask_copy(unavailable_cpus, tmp);
> +	cpuset_force_rebuild();

I think this rebuilding sched domains could add quite overhead.

> +out:
> +	free_cpumask_var(tmp);
> +	return retval;
> +}
> +
> +static void cpuset_notify_unavailable_cpus(void)
> +{
> +	/*
> +	 * Prevent being preempted by the stopper if the local CPU
> +	 * turned unavailable.
> +	 */
> +	guard(preempt)();
> +
> +	sched_fair_notify_unavaialable_cpus(available_tmp_mask);
> +	cpu_turned_unavailable = false;
> +}
> +
>   /*
>    * Migrate memory region from one set of nodes to another.  This is
>    * performed asynchronously as it can be called from process migration path
> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   				    char *buf, size_t nbytes, loff_t off)
>   {
>   	struct cpuset *cs = css_cs(of_css(of));
> +	int file_type = of_cft(of)->private;
>   	struct cpuset *trialcs;
>   	int retval = -ENODEV;
>   
> -	/* root is read-only */
> -	if (cs == &top_cpuset)
> +	/* root is read-only; except for unavailable mask */
> +	if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
> +		return -EACCES;
> +
> +	/* unavailable mask can be only set on root. */
> +	if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
>   		return -EACCES;
>   
>   	buf = strstrip(buf);
> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	case FILE_MEMLIST:
>   		retval = update_nodemask(cs, trialcs, buf);
>   		break;
> +	case FILE_UNAVAILABLE_CPULIST:
> +		retval = update_unavailable_cpumask(buf);
> +		break;
>   	default:
>   		retval = -EINVAL;
>   		break;
> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	free_cpuset(trialcs);
>   	if (force_sd_rebuild)
>   		rebuild_sched_domains_locked();
> +	if (cpu_turned_unavailable)
> +		cpuset_notify_unavailable_cpus();
>   out_unlock:
>   	cpuset_full_unlock();
>   	if (of_cft(of)->private == FILE_MEMLIST)
> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
>   	case FILE_ISOLATED_CPULIST:
>   		seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
>   		break;
> +	case FILE_UNAVAILABLE_CPULIST:
> +		seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
> +		break;
>   	default:
>   		ret = -EINVAL;
>   	}
> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
>   		.flags = CFTYPE_ONLY_ON_ROOT,
>   	},
>   
> +	{
> +		.name = "cpus.unavailable",
> +		.seq_show = cpuset_common_seq_show,
> +		.write = cpuset_write_resmask,
> +		.max_write_len = (100U + 6 * NR_CPUS),
> +		.private = FILE_UNAVAILABLE_CPULIST,
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +	},
> +
>   	{ }	/* terminate */
>   };
>   
> @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
>   	BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>   	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>   	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
> +	BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
> +	BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>   
>   	cpumask_setall(top_cpuset.cpus_allowed);
>   	nodes_setall(top_cpuset.mems_allowed);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee7dfbf01792..13d0d9587aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>   
>   	/* Non kernel threads are not allowed during either online or offline. */
>   	if (!(p->flags & PF_KTHREAD))
> -		return cpu_active(cpu);
> +		return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>   
>   	/* KTHREAD_IS_PER_CPU is always allowed. */
>   	if (kthread_is_per_cpu(p))
> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   			goto out;
>   		}
>   
> +		/*
> +		 * Only user threads can be forced out of
> +		 * unavaialable CPUs.
> +		 */
> +		if (p->flags & PF_KTHREAD)
> +			goto rude;
> +
> +		/* Any unavailable CPUs that can run the task? */
> +		for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
> +			if (!task_allowed_on_cpu(p, dest_cpu))
> +				continue;
> +
> +			/* Can we hoist this up to goto rude? */
> +			if (is_migration_disabled(p))
> +				continue;
> +
> +			if (cpu_active(dest_cpu))
> +				goto out;
> +		}
> +rude:
>   		/* No more Mr. Nice Guy. */
>   		switch (state) {
>   		case cpuset:
> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
>    * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
>    * of the wakeup instead of the waker.
>    */
> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>   {
>   	struct rq *rq = cpu_rq(cpu);
>   
> @@ -5365,7 +5385,9 @@ void sched_exec(void)
>   	int dest_cpu;
>   
>   	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> -		dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
> +		int wake_flags = WF_EXEC;
> +
> +		dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);

Whats this logic?

>   		if (dest_cpu == smp_processor_id())
>   			return;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..e502cccdae64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   	return ld_moved;
>   }
>   
> +static int unavailable_balance_cpu_stop(void *data)
> +{
> +	struct task_struct *p, *tmp;
> +	struct rq *rq = data;
> +	int this_cpu = cpu_of(rq);
> +
> +	guard(rq_lock_irq)(rq);
> +
> +	list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
> +		int target_cpu;
> +
> +		/*
> +		 * Bail out if a concurrent change to unavailable_mask turned
> +		 * this CPU available.
> +		 */
> +		rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
> +		if (!rq->unavailable_balance)
> +			break;
> +
> +		/* XXX: Does not deal with migration disabled tasks. */
> +		target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());

This can cause it to go first CPU always and then load balancer to move it later on.
First should check the nodemask the current cpu is on to avoid NUMA costs.

> +		if ((unsigned int)target_cpu < nr_cpumask_bits) {
> +			deactivate_task(rq, p, 0);
> +			set_task_cpu(p, target_cpu);
> +
> +			/*
> +			 * Switch to move_queued_task() later.
> +			 * For PoC send an IPI and be done with it.
> +			 */
> +			__ttwu_queue_wakelist(p, target_cpu, 0);
> +		}
> +	}
> +
> +	rq->unavailable_balance = 0;
> +
> +	return 0;
> +}
> +
> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
> +{
> +	int cpu, this_cpu = smp_processor_id();
> +
> +	for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		/* Balance in progress. Tasks will be pushed out. */
> +		if (rq->unavailable_balance)
> +			return;
> +

Need to run stopper, if there is active current task. otherise that work
can be done here itself.

> +		stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
> +				    rq, &rq->unavailable_balance_work);
> +		rq->unavailable_balance = 1;
> +	}
> +}
> +
>   static inline unsigned long
>   get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>   {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cb80666addec..c21ffb128734 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1221,6 +1221,10 @@ struct rq {
>   	int			push_cpu;
>   	struct cpu_stop_work	active_balance_work;
>   
> +	/* For pushing out taks from unavailable CPUs. */
> +	struct cpu_stop_work	unavailable_balance_work;
> +	int			unavailable_balance;
> +
>   	/* CPU of this runqueue: */
>   	int			cpu;
>   	int			online;
> @@ -2413,6 +2417,8 @@ extern const u32		sched_prio_to_wmult[40];
>   
>   #define RETRY_TASK		((void *)-1UL)
>   
> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
>   struct affinity_context {
>   	const struct cpumask	*new_mask;
>   	struct cpumask		*user_mask;
> 
> base-commit: 5e8f8a25efb277ac6f61f553f0c533ff1402bd7c

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by K Prateek Nayak 2 months ago

Hello Shrikanth,

Thank you for taking a look at the PoC.

On 12/8/2025 3:27 PM, Shrikanth Hegde wrote:
> Hi Prateek.
> 
> Thank you very much for going throguh the series.
> 
> On 12/8/25 10:17 AM, K Prateek Nayak wrote:
>> On 11/19/2025 6:14 PM, Shrikanth Hegde wrote:
>>> Detailed problem statement and some of the implementation choices were
>>> discussed earlier[1].
>>>
>>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>>
>>> This is likely the version which would be used for LPC2025 discussion on
>>> this topic. Feel free to provide your suggestion and hoping for a solution
>>> that works for different architectures and it's use cases.
>>>
>>> All the existing alternatives such as cpu hotplug, creating isolated
>>> partitions etc break the user affinity. Since number of CPUs to use change
>>> depending on the steal time, it is not driven by User. Hence it would be
>>> wrong to break the affinity. This series allows if the task is pinned
>>> only paravirt CPUs, it will continue running there.
>>
>> If maintaining task affinity is the only problem that cpusets don't
>> offer, attached below is a very naive prototype that seems to work in
>> my case without hitting any obvious splats so far.
>>
>> Idea is to keep task affinity untouched, but remove the CPUs from
>> the sched domains.
>>
>> That way, all the balancing, and wakeups will steer away from these
>> CPUs automatically but once the CPUs are put back, the balancing will
>> automatically move tasks back.
>>
>> I tested this with a bunch of spinners and with partitions and both
>> seem to work as expected. For real world VM based testing, I pinned 2
>> 6C/12C VMs to a 8C/16T LLC with 1:1 pinning - 2 virtual cores from
>> either VMs pin to same set of physical cores.
>>
>> Running 8 groups of perf bench sched messaging on each VM at the same
>> time gives the following numbers for total runtime:
>>
>> All CPUs available in the VM:      88.775s & 91.002s  (2 cores overlap)
>> Only 4 cores available in the VM:  67.365s & 73.015s  (No cores overlap)
>>
>> Note: The unavailable mask didn't change in my runs. I've noticed a
>> bit of delay before the load balancer moves the tasks to the CPU
>> going from unavailable to available - your mileage may vary depending
> 
> Depends on the scale of systems. I have seen it unfolding is slower
> compared to folding on large systems.
> 
>> on the frequency of mask updates.
>>
> 
> What do you mean "The unavailable mask didn't change in my runs" ?
> If so, how did it take effect?

The unavailable mask was set with the last two cores so that there
is no overlap in the pCPU usage. The mask remained same throughout
the runtime of the benchmarks - no dynamism in modifying the masks
within the VM.

> 
>> Following is the diff on top of tip/master:
>>
>> (Very raw PoC; Only fair tasks are considered for now to push away)
>>
> 
> I skimmed through it. It is very close to the current approach.
> 
> Advantage:
> Happens immediately instead of waiting for tick.
> Current approach too can move all the tasks at one tick.
> the concern could be latency being high and races around the list.
> 
> Disadvantages:
> 
> Causes a sched domain rebuild. Which is known to be expensive on large systems.
> But since steal time changes are not very aggressive at this point, this overhead
> maybe ok.
> 
> Keeping the interface in cpuset maybe tricky. there could multiple cpusets, and different versions
> complications too. Specially you can have cpusets in nested fashion. And all of this is
> not user driven. i think cpuset is inherently user driven.

For that reason I only kept this mask for root cgroup. Putting any
CPU on it is as good as removing them from all partitions.

> 
> Impementation looks more complicated to me atleast at this point.
> 
> Current poc needs to enhanced to make arch specific triggers. That is doable.
> 
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index 2ddb256187b5..7c1cfdd7ffea 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -174,6 +174,10 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>   }
>>     extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask);
>> +const struct cpumask *cpuset_unavailable_mask(void);
>> +bool cpuset_cpu_unavailable(int cpu);
>>   #else /* !CONFIG_CPUSETS */
>>     static inline bool cpusets_enabled(void) { return false; }
>> diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
>> index 337608f408ce..170aba16141e 100644
>> --- a/kernel/cgroup/cpuset-internal.h
>> +++ b/kernel/cgroup/cpuset-internal.h
>> @@ -59,6 +59,7 @@ typedef enum {
>>       FILE_EXCLUSIVE_CPULIST,
>>       FILE_EFFECTIVE_XCPULIST,
>>       FILE_ISOLATED_CPULIST,
>> +    FILE_UNAVAILABLE_CPULIST,
>>       FILE_CPU_EXCLUSIVE,
>>       FILE_MEM_EXCLUSIVE,
>>       FILE_MEM_HARDWALL,
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 4aaad07b0bd1..22d38f2299c4 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -87,6 +87,19 @@ static cpumask_var_t    isolated_cpus;
>>   static cpumask_var_t    boot_hk_cpus;
>>   static bool        have_boot_isolcpus;
>>   +/*
>> + * CPUs that may be unavailable to run tasks as a result of physical
>> + * constraints (vCPU being preempted, pCPU handling interrupt storm).
>> + *
>> + * Unlike isolated_cpus, the unavailable_cpus are simply excluded from
>> + * HK_TYPE_DOMAIN but leave the tasks affinity untouched. These CPUs
>> + * should be avoided unless the task has specifically asked to be run
>> + * only on these CPUs.
>> + */
>> +static cpumask_var_t    unavailable_cpus;
>> +static cpumask_var_t    available_tmp_mask;    /* For intermediate operations. */
>> +static bool         cpu_turned_unavailable;
>> +
> 
> This unavailable name is not probably right. When system boots, there is available_cpu
> and that is fixed and not expected to change. It can confuse users.

Ack! Just some name that I thought was appropriate. Too much
thought wasn't put into it ;)

> 
>>   /* List of remote partition root children */
>>   static struct list_head remote_children;
>>   @@ -844,6 +857,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>           }
>>           cpumask_and(doms[0], top_cpuset.effective_cpus,
>>                   housekeeping_cpumask(HK_TYPE_DOMAIN));
>> +        cpumask_andnot(doms[0], doms[0], unavailable_cpus);
>>             goto done;
>>       }
>> @@ -960,11 +974,13 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>                * The top cpuset may contain some boot time isolated
>>                * CPUs that need to be excluded from the sched domain.
>>                */
>> -            if (csa[i] == &top_cpuset)
>> +            if (csa[i] == &top_cpuset) {
>>                   cpumask_and(doms[i], csa[i]->effective_cpus,
>>                           housekeeping_cpumask(HK_TYPE_DOMAIN));
>> -            else
>> -                cpumask_copy(doms[i], csa[i]->effective_cpus);
>> +                cpumask_andnot(doms[i], doms[i], unavailable_cpus);
>> +             } else {
>> +                cpumask_andnot(doms[i], csa[i]->effective_cpus, unavailable_cpus);
>> +             }
>>               if (dattr)
>>                   dattr[i] = SD_ATTR_INIT;
>>           }
>> @@ -985,6 +1001,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>                   }
>>                   cpumask_or(dp, dp, csa[j]->effective_cpus);
>>                   cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN));
>> +                cpumask_andnot(dp, dp, unavailable_cpus);
>>                   if (dattr)
>>                       update_domain_attr_tree(dattr + nslot, csa[j]);
>>               }
>> @@ -1418,6 +1435,17 @@ bool cpuset_cpu_is_isolated(int cpu)
>>   }
>>   EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
>>   +/* Get the set of CPUs marked unavailable. */
>> +const struct cpumask *cpuset_unavailable_mask(void)
>> +{
>> +    return unavailable_cpus;
>> +}
>> +
>> +bool cpuset_cpu_unavailable(int cpu)
>> +{
>> +    return  cpumask_test_cpu(cpu, unavailable_cpus);
>> +}
>> +
>>   /**
>>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>>    * @parent: Parent cpuset containing all siblings
>> @@ -2612,6 +2640,53 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
>>       return 0;
>>   }
>>   +/**
>> + * update_exclusive_cpumask - update the exclusive_cpus mask of a cpuset
>> + * @cs: the cpuset to consider
>> + * @trialcs: trial cpuset
>> + * @buf: buffer of cpu numbers written to this cpuset
>> + *
>> + * The tasks' cpumask will be updated if cs is a valid partition root.
>> + */
>> +static int update_unavailable_cpumask(const char *buf)
>> +{
>> +    cpumask_var_t tmp;
>> +    int retval;
>> +
>> +    if (!alloc_cpumask_var(&tmp, GFP_KERNEL))
>> +        return -ENOMEM;
>> +
>> +    retval = cpulist_parse(buf, tmp);
>> +    if (retval < 0)
>> +        goto out;
>> +
>> +    /* Nothing to do if the CPUs didn't change */
>> +    if (cpumask_equal(tmp, unavailable_cpus))
>> +        goto out;
>> +
>> +    /* Save the CPUs that went unavailable to push task out. */
>> +    if (cpumask_andnot(available_tmp_mask, tmp, unavailable_cpus))
>> +        cpu_turned_unavailable = true;
>> +
>> +    cpumask_copy(unavailable_cpus, tmp);
>> +    cpuset_force_rebuild();
> 
> I think this rebuilding sched domains could add quite overhead.

I agree! But I somewhat dislike putting a cpumask_and() in a
bunch of places where we deal with sched_domain when we can
simply adjust the sched_domain to account for it - it is
definitely not performant but IMO, it is somewhat cleaner.

But if CPUs are transitioning in and out of the paravirt mask
as such a high rate, wouldn't you just end up pushing the
tasks away only to soon pull them back?

What changes so suddenly in the hypervisor that a paravirt
CPU is now fully available after a sec or two?

On a sidenote, we do have vcpu_is_preempted() - isn't that
sufficient enough to steer tasks away if we start being a
bit more aggressive about it? Do we need a mask?

> 
>> +out:
>> +    free_cpumask_var(tmp);
>> +    return retval;
>> +}
>> +
>> +static void cpuset_notify_unavailable_cpus(void)
>> +{
>> +    /*
>> +     * Prevent being preempted by the stopper if the local CPU
>> +     * turned unavailable.
>> +     */
>> +    guard(preempt)();
>> +
>> +    sched_fair_notify_unavaialable_cpus(available_tmp_mask);
>> +    cpu_turned_unavailable = false;
>> +}
>> +
>>   /*
>>    * Migrate memory region from one set of nodes to another.  This is
>>    * performed asynchronously as it can be called from process migration path
>> @@ -3302,11 +3377,16 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>                       char *buf, size_t nbytes, loff_t off)
>>   {
>>       struct cpuset *cs = css_cs(of_css(of));
>> +    int file_type = of_cft(of)->private;
>>       struct cpuset *trialcs;
>>       int retval = -ENODEV;
>>   -    /* root is read-only */
>> -    if (cs == &top_cpuset)
>> +    /* root is read-only; except for unavailable mask */
>> +    if (file_type != FILE_UNAVAILABLE_CPULIST && cs == &top_cpuset)
>> +        return -EACCES;
>> +
>> +    /* unavailable mask can be only set on root. */
>> +    if (file_type == FILE_UNAVAILABLE_CPULIST && cs != &top_cpuset)
>>           return -EACCES;
>>         buf = strstrip(buf);
>> @@ -3330,6 +3410,9 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>       case FILE_MEMLIST:
>>           retval = update_nodemask(cs, trialcs, buf);
>>           break;
>> +    case FILE_UNAVAILABLE_CPULIST:
>> +        retval = update_unavailable_cpumask(buf);
>> +        break;
>>       default:
>>           retval = -EINVAL;
>>           break;
>> @@ -3338,6 +3421,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>       free_cpuset(trialcs);
>>       if (force_sd_rebuild)
>>           rebuild_sched_domains_locked();
>> +    if (cpu_turned_unavailable)
>> +        cpuset_notify_unavailable_cpus();
>>   out_unlock:
>>       cpuset_full_unlock();
>>       if (of_cft(of)->private == FILE_MEMLIST)
>> @@ -3386,6 +3471,9 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
>>       case FILE_ISOLATED_CPULIST:
>>           seq_printf(sf, "%*pbl\n", cpumask_pr_args(isolated_cpus));
>>           break;
>> +    case FILE_UNAVAILABLE_CPULIST:
>> +        seq_printf(sf, "%*pbl\n", cpumask_pr_args(unavailable_cpus));
>> +        break;
>>       default:
>>           ret = -EINVAL;
>>       }
>> @@ -3524,6 +3612,15 @@ static struct cftype dfl_files[] = {
>>           .flags = CFTYPE_ONLY_ON_ROOT,
>>       },
>>   +    {
>> +        .name = "cpus.unavailable",
>> +        .seq_show = cpuset_common_seq_show,
>> +        .write = cpuset_write_resmask,
>> +        .max_write_len = (100U + 6 * NR_CPUS),
>> +        .private = FILE_UNAVAILABLE_CPULIST,
>> +        .flags = CFTYPE_ONLY_ON_ROOT,
>> +    },
>> +
>>       { }    /* terminate */
>>   };
>>   @@ -3814,6 +3911,8 @@ int __init cpuset_init(void)
>>       BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
>>       BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
>>       BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
>> +    BUG_ON(!zalloc_cpumask_var(&unavailable_cpus, GFP_KERNEL));
>> +    BUG_ON(!zalloc_cpumask_var(&available_tmp_mask, GFP_KERNEL));
>>         cpumask_setall(top_cpuset.cpus_allowed);
>>       nodes_setall(top_cpuset.mems_allowed);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ee7dfbf01792..13d0d9587aca 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2396,7 +2396,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>>         /* Non kernel threads are not allowed during either online or offline. */
>>       if (!(p->flags & PF_KTHREAD))
>> -        return cpu_active(cpu);
>> +        return (cpu_active(cpu) && !cpuset_cpu_unavailable(cpu));
>>         /* KTHREAD_IS_PER_CPU is always allowed. */
>>       if (kthread_is_per_cpu(p))
>> @@ -3451,6 +3451,26 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>               goto out;
>>           }
>>   +        /*
>> +         * Only user threads can be forced out of
>> +         * unavaialable CPUs.
>> +         */
>> +        if (p->flags & PF_KTHREAD)
>> +            goto rude;
>> +
>> +        /* Any unavailable CPUs that can run the task? */
>> +        for_each_cpu(dest_cpu, cpuset_unavailable_mask()) {
>> +            if (!task_allowed_on_cpu(p, dest_cpu))
>> +                continue;
>> +
>> +            /* Can we hoist this up to goto rude? */
>> +            if (is_migration_disabled(p))
>> +                continue;
>> +
>> +            if (cpu_active(dest_cpu))
>> +                goto out;
>> +        }
>> +rude:
>>           /* No more Mr. Nice Guy. */
>>           switch (state) {
>>           case cpuset:
>> @@ -3766,7 +3786,7 @@ bool call_function_single_prep_ipi(int cpu)
>>    * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
>>    * of the wakeup instead of the waker.
>>    */
>> -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>> +void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>>   {
>>       struct rq *rq = cpu_rq(cpu);
>>   @@ -5365,7 +5385,9 @@ void sched_exec(void)
>>       int dest_cpu;
>>         scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>> -        dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
>> +        int wake_flags = WF_EXEC;
>> +
>> +        dest_cpu = select_task_rq(p, task_cpu(p), &wake_flags);
> 
> Whats this logic?

WF_EXEC path would not care about the unavailable CPUs and won't run
the select_fallback_rq() path if the sched_class->select_task() is
called directly.

> 
>>           if (dest_cpu == smp_processor_id())
>>               return;
>>   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da46c3164537..e502cccdae64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12094,6 +12094,61 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>       return ld_moved;
>>   }
>>   +static int unavailable_balance_cpu_stop(void *data)
>> +{
>> +    struct task_struct *p, *tmp;
>> +    struct rq *rq = data;
>> +    int this_cpu = cpu_of(rq);
>> +
>> +    guard(rq_lock_irq)(rq);
>> +
>> +    list_for_each_entry_safe(p, tmp, &rq->cfs_tasks, se.group_node) {
>> +        int target_cpu;
>> +
>> +        /*
>> +         * Bail out if a concurrent change to unavailable_mask turned
>> +         * this CPU available.
>> +         */
>> +        rq->unavailable_balance = cpumask_test_cpu(this_cpu, cpuset_unavailable_mask());
>> +        if (!rq->unavailable_balance)
>> +            break;
>> +
>> +        /* XXX: Does not deal with migration disabled tasks. */
>> +        target_cpu = cpumask_first_andnot(p->cpus_ptr, cpuset_unavailable_mask());
> 
> This can cause it to go first CPU always and then load balancer to move it later on.
> First should check the nodemask the current cpu is on to avoid NUMA costs.

Ack! I agree there is plenty of room for optimizations.

> 
>> +        if ((unsigned int)target_cpu < nr_cpumask_bits) {
>> +            deactivate_task(rq, p, 0);
>> +            set_task_cpu(p, target_cpu);
>> +
>> +            /*
>> +             * Switch to move_queued_task() later.
>> +             * For PoC send an IPI and be done with it.
>> +             */
>> +            __ttwu_queue_wakelist(p, target_cpu, 0);
>> +        }
>> +    }
>> +
>> +    rq->unavailable_balance = 0;
>> +
>> +    return 0;
>> +}
>> +
>> +void sched_fair_notify_unavaialable_cpus(struct cpumask *unavailable_mask)
>> +{
>> +    int cpu, this_cpu = smp_processor_id();
>> +
>> +    for_each_cpu_wrap(cpu, unavailable_mask, this_cpu + 1) {
>> +        struct rq *rq = cpu_rq(cpu);
>> +
>> +        /* Balance in progress. Tasks will be pushed out. */
>> +        if (rq->unavailable_balance)
>> +            return;
>> +
> 
> Need to run stopper, if there is active current task. otherise that work
> can be done here itself.

Ack! My thinking was to not take a rq_lock early and let stopper
run and then push all queued fair tasks out with rq_lock held.

> 
>> +        stop_one_cpu_nowait(cpu, unavailable_balance_cpu_stop,
>> +                    rq, &rq->unavailable_balance_work);
>> +        rq->unavailable_balance = 1;
>> +    }
>> +}
>> +

-- 
Thanks and Regards,
Prateek

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Ilya Leoshkevich 2 months ago

On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices
> were 
> discussed earlier[1].
> 
> [1]:
> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion
> on
> this topic. Feel free to provide your suggestion and hoping for a
> solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use
> change
> depending on the steal time, it is not driven by User. Hence it would
> be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:
> 
> - Introduced computation of steal time in powerpc code.
> - Derive number of CPUs to use and mark the remaining as paravirt
> based
>   on steal values. 
> - Provide debugfs knobs to alter how steal time values being used.
> - Removed static key check for paravirt CPUs (Yury)
> - Removed preempt_disable/enable while calling stopper (Prateek)
> - Made select_idle_sibling and friends aware of paravirt CPUs.
> - Removed 3 unused schedstat fields and introduced 2 related to
> paravirt
>   handling.
> - Handled nohz_full case by enabling tick on it when there is CFS/RT
> on
>   it.
> - Updated helper patch to override arch behaviour for easier
> debugging
>   during development.
> - Kept 
> 
> Changes compared to v4[2]:
> - Last two patches were sent out separate instead of being with
> series.
>   That created confusion. Those two patches are debug patches one can
>   make use to check functionality across acrhitectures. Sorry about
>   that.
> - Use DEVICE_ATTR_RW instead (greg)
> - Made it as PATCH since arch specific handling completes the
>   functionality.
> 
> [2]:
> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
> 
> TODO: 
> 
> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>   week. Didn't want to hold the series till then.
> 
> - The CPUs to mark as paravirt is very simple and doesn't work when
>   vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
> splice
>   the numbers based on how many CPUs each NUMA node has. It is quite
>   tricky to do specially since cpumask can be on stack too. Given
>   NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
> into
>   solving it yet. Maybe there is easier way.
> 
> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
> specific)
> 
> - Userspace tools awareness such as irqbalance. 
> 
> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
> informs
>   guest which/how many CPUs it has to use at this moment. This
> interface
>   should work across archs with each arch doing its specific
> handling.
> 
> - Determine the default values for steal time related knobs
>   empirically and document them.
> 
> - Need to check safety against CPU hotplug specially in
> process_steal.
> 
> 
> Applies cleanly on tip/master:
> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
> 
> 
> Thanks to srikar for providing the initial code around powerpc steal
> time handling code. Thanks to all who went through and provided
> reviews.
> 
> PS: I haven't found a better name. Please suggest if you have any.
> 
> Shrikanth Hegde (17):
>   sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>   cpumask: Introduce cpu_paravirt_mask
>   sched/core: Dont allow to use CPU marked as paravirt
>   sched/debug: Remove unused schedstats
>   sched/fair: Add paravirt movements for proc sched file
>   sched/fair: Pass current cpu in select_idle_sibling
>   sched/fair: Don't consider paravirt CPUs for wakeup and load
> balance
>   sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
> task
>   sched/core: Add support for nohz_full CPUs
>   sched/core: Push current task from paravirt CPU
>   sysfs: Add paravirt CPU file
>   powerpc: method to initialize ec and vp cores
>   powerpc: enable/disable paravirt CPUs based on steal time
>   powerpc: process steal values at fixed intervals
>   powerpc: add debugfs file for controlling handling on steal values
>   sysfs: Provide write method for paravirt
>   sysfs: disable arch handling if paravirt file being written
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>  Documentation/scheduler/sched-arch.rst        |  37 +++
>  arch/powerpc/include/asm/smp.h                |   1 +
>  arch/powerpc/kernel/smp.c                     |   1 +
>  arch/powerpc/platforms/pseries/lpar.c         | 223
> ++++++++++++++++++
>  arch/powerpc/platforms/pseries/pseries.h      |   1 +
>  drivers/base/cpu.c                            |  59 +++++
>  include/linux/cpumask.h                       |  20 ++
>  include/linux/sched.h                         |   9 +-
>  kernel/sched/core.c                           | 106 ++++++++-
>  kernel/sched/debug.c                          |   5 +-
>  kernel/sched/fair.c                           |  42 +++-
>  kernel/sched/rt.c                             |  11 +-
>  kernel/sched/sched.h                          |   9 +
>  14 files changed, 519 insertions(+), 14 deletions(-)

The capability to temporarily exclude CPUs from scheduling might be
beneficial for s390x, where users often run Linux using a proprietary
hypervisor called PR/SM and with high overcommit. In these
circumstances virtual CPUs may not be scheduled by a hypervisor for a
very long time.

Today we have an upstream feature called "Hiperdispatch", which
determines that this is about to happen and uses Capacity Aware
Scheduling to prevent processes from being placed on the affected CPUs.
However, at least when used for this purpose, Capacity Aware Scheduling
is best effort and fails to move tasks away from the affected CPUs
under high load.

Therefore I have decided to smoke test this series.

For the purposes of smoke testing, I set up a number of KVM virtual
machines and start the same benchmark inside each one. Then I collect
and compare the aggregate throughput numbers. I have not done testing
with PR/SM yet, but I plan to do this and report back. I also have not
tested this with VMs that are not 100% utilized yet.

Benchmark parameters:

$ sysbench cpu run --threads=$(nproc) --time=10
$ schbench -r 10 --json --no-locking 
$ hackbench --groups 10 --process --loops 5000
$ pgbench -h $WORKDIR --client=$(nproc) --time=10

Figures:

s390x (16 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench        16           4  60.58%
pgbench          16           4  50.01%
hackbench         8           8  46.18%
hackbench         4           8  43.54%
hackbench         2          16  43.23%
hackbench        12           4  42.92%
hackbench         8           4  35.53%
hackbench         4          16  30.98%
pgbench          12           4  18.41%
hackbench         2          24  7.32%
pgbench           8           4  6.84%
pgbench           2          24  3.38%
pgbench           2          16  3.02%
pgbench           4          16  2.08%
hackbench         2          32  1.46%
pgbench           4           8  1.30%
schbench          2          16  0.72%
schbench          4           8  -0.09%
schbench          4           4  -0.20%
schbench          8           8  -0.41%
sysbench          8           4  -0.46%
sysbench          4           8  -0.53%
schbench          8           4  -0.65%
sysbench          2          16  -0.76%
schbench          2           8  -0.77%
sysbench          8           8  -1.72%
schbench          2          24  -1.98%
schbench         12           4  -2.03%
sysbench         12           4  -2.13%
pgbench           2          32  -3.15%
sysbench         16           4  -3.17%
schbench         16           4  -3.50%
sysbench          2           8  -4.01%
pgbench           8           8  -4.10%
schbench          4          16  -5.93%
sysbench          4           4  -5.94%
pgbench           2           4  -6.40%
hackbench         2           8  -10.04%
hackbench         4           4  -10.91%
pgbench           4           4  -11.05%
sysbench          2          24  -13.07%
sysbench          4          16  -13.59%
hackbench         2           4  -13.96%
pgbench           2           8  -16.16%
schbench          2           4  -24.14%
schbench          2          32  -24.25%
sysbench          2           4  -24.98%
sysbench          2          32  -32.84%

x86_64 (32 host CPUs):

Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
-----------  ------  ----------  ----------
hackbench         4          32  87.02%
hackbench         8          16  48.45%
hackbench         4          24  47.95%
hackbench         2           8  42.74%
hackbench         2          32  34.90%
pgbench          16           8  27.87%
pgbench          12           8  25.17%
hackbench         8           8  24.92%
hackbench        16           8  22.41%
hackbench        16           4  20.83%
pgbench           8          16  20.40%
hackbench        12           8  20.37%
hackbench         4          16  20.36%
pgbench          16           4  16.60%
pgbench           8           8  14.92%
hackbench        12           4  14.49%
pgbench           4          32  9.49%
pgbench           2          32  7.26%
hackbench         2          24  6.54%
pgbench           4           4  4.67%
pgbench           8           4  3.24%
pgbench          12           4  2.66%
hackbench         4           8  2.53%
pgbench           4           8  1.96%
hackbench         2          16  1.93%
schbench          4          32  1.24%
pgbench           2           8  0.82%
schbench          4           4  0.69%
schbench          2          32  0.44%
schbench          2          16  0.25%
schbench         12           8  -0.02%
sysbench          2           4  -0.02%
schbench          4          24  -0.12%
sysbench          2          16  -0.17%
schbench         12           4  -0.18%
schbench          2           4  -0.19%
sysbench          4           8  -0.23%
schbench          8           4  -0.24%
sysbench          2           8  -0.24%
schbench          4           8  -0.28%
sysbench          8           4  -0.30%
schbench          4          16  -0.37%
schbench          2          24  -0.39%
schbench          8          16  -0.49%
schbench          2           8  -0.67%
pgbench           4          16  -0.68%
schbench          8           8  -0.83%
sysbench          4           4  -0.92%
schbench         16           4  -0.94%
sysbench         12           4  -0.98%
sysbench          8          16  -1.52%
sysbench         16           4  -1.57%
pgbench           2           4  -1.62%
sysbench         12           8  -1.69%
schbench         16           8  -1.97%
sysbench          8           8  -2.08%
hackbench         8           4  -2.11%
pgbench           4          24  -3.20%
pgbench           2          24  -3.35%
sysbench          2          24  -3.81%
pgbench           2          16  -4.55%
sysbench          4          16  -5.10%
sysbench         16           8  -6.56%
sysbench          2          32  -8.24%
sysbench          4          32  -13.54%
sysbench          4          24  -13.62%
hackbench         2           4  -15.40%
hackbench         4           4  -17.71%

There are some huge wins, especially for hackbench, which corresponds
to Shrikanth's findings. There are some significant degradations too,
which I plan to debug. This may simply have to do with the simplistic
heuristic I am using for testing [1].

sysbench, for example, is not supposed to benefit from this series,
because it is not affected by overcommit. However, it definitely should
not degrade by 30%. Interestingly enough, this happens only with
certain combinations of VM and CPU counts, and this is reproducible.

Initially I have seen degradations as bad as -80% with schbench. It
turned out this was caused by userspace per-CPU locking it implements;
turning it off caused the degradation to go away. To me this looks like
something synthetic and not something used by real-world application,
but please correct me if I am wrong - then this will have to be
resolved.


One note regarding the PARAVIRT Kconfig gating: s390x does not
select PARAVIRT	today. For example, steal time we determine based on
CPU timers and clocks, and not hypervisor hints. For now I had to add
dummy paravirt headers to test this series. But I would appreciate if
Kconfig gating was removed.

Others have already commented on the naming, and I would agree that
"paravirt" is really misleading. I cannot say that the previous "cpu-
avoid" one was perfect, but it was much better.


[1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months ago


On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices
>> were
>> discussed earlier[1].
>>
>> [1]:
>> https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion
>> on
>> this topic. Feel free to provide your suggestion and hoping for a
>> solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use
>> change
>> depending on the steal time, it is not driven by User. Hence it would
>> be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
>>
>> - Introduced computation of steal time in powerpc code.
>> - Derive number of CPUs to use and mark the remaining as paravirt
>> based
>>    on steal values.
>> - Provide debugfs knobs to alter how steal time values being used.
>> - Removed static key check for paravirt CPUs (Yury)
>> - Removed preempt_disable/enable while calling stopper (Prateek)
>> - Made select_idle_sibling and friends aware of paravirt CPUs.
>> - Removed 3 unused schedstat fields and introduced 2 related to
>> paravirt
>>    handling.
>> - Handled nohz_full case by enabling tick on it when there is CFS/RT
>> on
>>    it.
>> - Updated helper patch to override arch behaviour for easier
>> debugging
>>    during development.
>> - Kept
>>
>> Changes compared to v4[2]:
>> - Last two patches were sent out separate instead of being with
>> series.
>>    That created confusion. Those two patches are debug patches one can
>>    make use to check functionality across acrhitectures. Sorry about
>>    that.
>> - Use DEVICE_ATTR_RW instead (greg)
>> - Made it as PATCH since arch specific handling completes the
>>    functionality.
>>
>> [2]:
>> https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/
>>
>> TODO:
>>
>> - Get performance numbers on PowerPC, x86 and S390. Hopefully by next
>>    week. Didn't want to hold the series till then.
>>
>> - The CPUs to mark as paravirt is very simple and doesn't work when
>>    vCPUs aren't spread out uniformly across NUMA nodes. Ideal would be
>> splice
>>    the numbers based on how many CPUs each NUMA node has. It is quite
>>    tricky to do specially since cpumask can be on stack too. Given
>>    NR_CPUS can be 8192 and nr_possible_nodes 32. Haven't got my head
>> into
>>    solving it yet. Maybe there is easier way.
>>
>> - DLPAR Add/Remove needs to call init of EC/VP cores (powerpc
>> specific)
>>
>> - Userspace tools awareness such as irqbalance.
>>
>> - Delve into design of hint from Hyeprvisor(HW Hint). i.e Host
>> informs
>>    guest which/how many CPUs it has to use at this moment. This
>> interface
>>    should work across archs with each arch doing its specific
>> handling.
>>
>> - Determine the default values for steal time related knobs
>>    empirically and document them.
>>
>> - Need to check safety against CPU hotplug specially in
>> process_steal.
>>
>>
>> Applies cleanly on tip/master:
>> commit c2ef745151b21d4dcc4b29a1eabf1096f5ba544b
>>
>>
>> Thanks to srikar for providing the initial code around powerpc steal
>> time handling code. Thanks to all who went through and provided
>> reviews.
>>
>> PS: I haven't found a better name. Please suggest if you have any.
>>
>> Shrikanth Hegde (17):
>>    sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
>>    cpumask: Introduce cpu_paravirt_mask
>>    sched/core: Dont allow to use CPU marked as paravirt
>>    sched/debug: Remove unused schedstats
>>    sched/fair: Add paravirt movements for proc sched file
>>    sched/fair: Pass current cpu in select_idle_sibling
>>    sched/fair: Don't consider paravirt CPUs for wakeup and load
>> balance
>>    sched/rt: Don't select paravirt CPU for wakeup and push/pull rt
>> task
>>    sched/core: Add support for nohz_full CPUs
>>    sched/core: Push current task from paravirt CPU
>>    sysfs: Add paravirt CPU file
>>    powerpc: method to initialize ec and vp cores
>>    powerpc: enable/disable paravirt CPUs based on steal time
>>    powerpc: process steal values at fixed intervals
>>    powerpc: add debugfs file for controlling handling on steal values
>>    sysfs: Provide write method for paravirt
>>    sysfs: disable arch handling if paravirt file being written
>>
>>   .../ABI/testing/sysfs-devices-system-cpu      |   9 +
>>   Documentation/scheduler/sched-arch.rst        |  37 +++
>>   arch/powerpc/include/asm/smp.h                |   1 +
>>   arch/powerpc/kernel/smp.c                     |   1 +
>>   arch/powerpc/platforms/pseries/lpar.c         | 223
>> ++++++++++++++++++
>>   arch/powerpc/platforms/pseries/pseries.h      |   1 +
>>   drivers/base/cpu.c                            |  59 +++++
>>   include/linux/cpumask.h                       |  20 ++
>>   include/linux/sched.h                         |   9 +-
>>   kernel/sched/core.c                           | 106 ++++++++-
>>   kernel/sched/debug.c                          |   5 +-
>>   kernel/sched/fair.c                           |  42 +++-
>>   kernel/sched/rt.c                             |  11 +-
>>   kernel/sched/sched.h                          |   9 +
>>   14 files changed, 519 insertions(+), 14 deletions(-)
> 
> The capability to temporarily exclude CPUs from scheduling might be
> beneficial for s390x, where users often run Linux using a proprietary
> hypervisor called PR/SM and with high overcommit. In these
> circumstances virtual CPUs may not be scheduled by a hypervisor for a
> very long time.
> 
> Today we have an upstream feature called "Hiperdispatch", which
> determines that this is about to happen and uses Capacity Aware
> Scheduling to prevent processes from being placed on the affected CPUs.
> However, at least when used for this purpose, Capacity Aware Scheduling
> is best effort and fails to move tasks away from the affected CPUs
> under high load.
> 
> Therefore I have decided to smoke test this series.
> 
> For the purposes of smoke testing, I set up a number of KVM virtual
> machines and start the same benchmark inside each one. Then I collect
> and compare the aggregate throughput numbers. I have not done testing
> with PR/SM yet, but I plan to do this and report back. I also have not
> tested this with VMs that are not 100% utilized yet.
> 

Best results would be when it works as HW hint from hypervisor.

> Benchmark parameters:
> 
> $ sysbench cpu run --threads=$(nproc) --time=10
> $ schbench -r 10 --json --no-locking
> $ hackbench --groups 10 --process --loops 5000
> $ pgbench -h $WORKDIR --client=$(nproc) --time=10
> 
> Figures:
> 
> s390x (16 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench        16           4  60.58%
> pgbench          16           4  50.01%
> hackbench         8           8  46.18%
> hackbench         4           8  43.54%
> hackbench         2          16  43.23%
> hackbench        12           4  42.92%
> hackbench         8           4  35.53%
> hackbench         4          16  30.98%
> pgbench          12           4  18.41%
> hackbench         2          24  7.32%
> pgbench           8           4  6.84%
> pgbench           2          24  3.38%
> pgbench           2          16  3.02%
> pgbench           4          16  2.08%
> hackbench         2          32  1.46%
> pgbench           4           8  1.30%
> schbench          2          16  0.72%
> schbench          4           8  -0.09%
> schbench          4           4  -0.20%
> schbench          8           8  -0.41%
> sysbench          8           4  -0.46%
> sysbench          4           8  -0.53%
> schbench          8           4  -0.65%
> sysbench          2          16  -0.76%
> schbench          2           8  -0.77%
> sysbench          8           8  -1.72%
> schbench          2          24  -1.98%
> schbench         12           4  -2.03%
> sysbench         12           4  -2.13%
> pgbench           2          32  -3.15%
> sysbench         16           4  -3.17%
> schbench         16           4  -3.50%
> sysbench          2           8  -4.01%
> pgbench           8           8  -4.10%
> schbench          4          16  -5.93%
> sysbench          4           4  -5.94%
> pgbench           2           4  -6.40%
> hackbench         2           8  -10.04%
> hackbench         4           4  -10.91%
> pgbench           4           4  -11.05%
> sysbench          2          24  -13.07%
> sysbench          4          16  -13.59%
> hackbench         2           4  -13.96%
> pgbench           2           8  -16.16%
> schbench          2           4  -24.14%
> schbench          2          32  -24.25%
> sysbench          2           4  -24.98%
> sysbench          2          32  -32.84%
> 
> x86_64 (32 host CPUs):
> 
> Benchmark      #VMs    #CPUs/VM  ΔRPS (%)
> -----------  ------  ----------  ----------
> hackbench         4          32  87.02%
> hackbench         8          16  48.45%
> hackbench         4          24  47.95%
> hackbench         2           8  42.74%
> hackbench         2          32  34.90%
> pgbench          16           8  27.87%
> pgbench          12           8  25.17%
> hackbench         8           8  24.92%
> hackbench        16           8  22.41%
> hackbench        16           4  20.83%
> pgbench           8          16  20.40%
> hackbench        12           8  20.37%
> hackbench         4          16  20.36%
> pgbench          16           4  16.60%
> pgbench           8           8  14.92%
> hackbench        12           4  14.49%
> pgbench           4          32  9.49%
> pgbench           2          32  7.26%
> hackbench         2          24  6.54%
> pgbench           4           4  4.67%
> pgbench           8           4  3.24%
> pgbench          12           4  2.66%
> hackbench         4           8  2.53%
> pgbench           4           8  1.96%
> hackbench         2          16  1.93%
> schbench          4          32  1.24%
> pgbench           2           8  0.82%
> schbench          4           4  0.69%
> schbench          2          32  0.44%
> schbench          2          16  0.25%
> schbench         12           8  -0.02%
> sysbench          2           4  -0.02%
> schbench          4          24  -0.12%
> sysbench          2          16  -0.17%
> schbench         12           4  -0.18%
> schbench          2           4  -0.19%
> sysbench          4           8  -0.23%
> schbench          8           4  -0.24%
> sysbench          2           8  -0.24%
> schbench          4           8  -0.28%
> sysbench          8           4  -0.30%
> schbench          4          16  -0.37%
> schbench          2          24  -0.39%
> schbench          8          16  -0.49%
> schbench          2           8  -0.67%
> pgbench           4          16  -0.68%
> schbench          8           8  -0.83%
> sysbench          4           4  -0.92%
> schbench         16           4  -0.94%
> sysbench         12           4  -0.98%
> sysbench          8          16  -1.52%
> sysbench         16           4  -1.57%
> pgbench           2           4  -1.62%
> sysbench         12           8  -1.69%
> schbench         16           8  -1.97%
> sysbench          8           8  -2.08%
> hackbench         8           4  -2.11%
> pgbench           4          24  -3.20%
> pgbench           2          24  -3.35%
> sysbench          2          24  -3.81%
> pgbench           2          16  -4.55%
> sysbench          4          16  -5.10%
> sysbench         16           8  -6.56%
> sysbench          2          32  -8.24%
> sysbench          4          32  -13.54%
> sysbench          4          24  -13.62%
> hackbench         2           4  -15.40%
> hackbench         4           4  -17.71%
> 
> There are some huge wins, especially for hackbench, which corresponds
> to Shrikanth's findings. There are some significant degradations too,
> which I plan to debug. This may simply have to do with the simplistic
> heuristic I am using for testing [1].
> 

Thank you very much!! for running these numbers.

> sysbench, for example, is not supposed to benefit from this series,
> because it is not affected by overcommit. However, it definitely should
> not degrade by 30%. Interestingly enough, this happens only with
> certain combinations of VM and CPU counts, and this is reproducible.
> 

is the host baremetal? is those cases cpufreq governer ramp up or down
might play a role. (speculating)

> Initially I have seen degradations as bad as -80% with schbench. It
> turned out this was caused by userspace per-CPU locking it implements;
> turning it off caused the degradation to go away. To me this looks like
> something synthetic and not something used by real-world application,
> but please correct me if I am wrong - then this will have to be
> resolved.
> 

That's nice to hear. I was concerned with schbench rps. Now i am bit relieved.


Is this with schbench -L option?
I ran with it. and regression i was seeing earlier is gone now.

> 
> One note regarding the PARAVIRT Kconfig gating: s390x does not
> select PARAVIRT	today. For example, steal time we determine based on
> CPU timers and clocks, and not hypervisor hints. For now I had to add
> dummy paravirt headers to test this series. But I would appreciate if
> Kconfig gating was removed.
> 

Keeping PARAVIRT checks on is probably right thing. I will wait to see if
anyone objects.

> Others have already commented on the naming, and I would agree that
> "paravirt" is really misleading. I cannot say that the previous "cpu-
> avoid" one was perfect, but it was much better.
> 
> 
> [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/

Will look into it. one thing to to be careful are CPU numbers.

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Yury Norov 1 month, 3 weeks ago

On Fri, Dec 05, 2025 at 11:00:18AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 12/4/25 6:58 PM, Ilya Leoshkevich wrote:
> > On Wed, 2025-11-19 at 18:14 +0530, Shrikanth Hegde wrote:

...

> > Others have already commented on the naming, and I would agree that
> > "paravirt" is really misleading. I cannot say that the previous "cpu-
> > avoid" one was perfect, but it was much better.

It was my suggestion to switch names. cpu-avoid is definitely a
no-go. Because it doesn't explain anything and only confuses.

I suggested 'paravirt' (notice - only suggested) because the patch
series is mainly discussing paravirtualized VMs. But now I'm not even
sure that the idea of the series is:

1. Applicable only to paravirtualized VMs; and 
2. Preemption and rescheduling throttling requires another in-kernel
   concept other than nohs, isolcpus, cgroups and similar.

Shrikanth, can you please clarify the scope of the new feature? Would
it be useful for non-paravirtualized VMs, for example? Any other
task-cpu bonding problems?

On previous rounds you tried to implement the same with cgroups, as
far as I understood. Can you discuss that? What exactly can't be done
with the existing kernel APIs?

Thanks,
Yury

> > [1] https://github.com/iii-i/linux/commits/iii/poc/cpu-avoid/v3/
> 
> Will look into it. one thing to to be careful are CPU numbers.

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 1 month, 3 weeks ago

Hi, Sorry for delay in response. Just landed yesterday from LPC.

>>> Others have already commented on the naming, and I would agree that
>>> "paravirt" is really misleading. I cannot say that the previous "cpu-
>>> avoid" one was perfect, but it was much better.
>   
> It was my suggestion to switch names. cpu-avoid is definitely a
> no-go. Because it doesn't explain anything and only confuses.
> 
> I suggested 'paravirt' (notice - only suggested) because the patch
> series is mainly discussing paravirtualized VMs. But now I'm not even
> sure that the idea of the series is:
> 
> 1. Applicable only to paravirtualized VMs; and
> 2. Preemption and rescheduling throttling requires another in-kernel
>     concept other than nohs, isolcpus, cgroups and similar.
> 
> Shrikanth, can you please clarify the scope of the new feature? Would
> it be useful for non-paravirtualized VMs, for example? Any other
> task-cpu bonding problems?

Current scope of the feature in virtulaized environment where the idea is
to do co-operative folding in each VM based on hint(either HW hint or steal time).

If you see from macro level, this is framework which allows one to avoid some vCPUs(In
Guest) to achieve better throughput or latency. So one could come up with more usecases
even in non-paravirtualized VMs. For example, one crazy idea such as avoid using SMT siblings
when the system utilization is low to achieve higher ipc(instruction per cycle) value.

> 
> On previous rounds you tried to implement the same with cgroups, as
> far as I understood. Can you discuss that? What exactly can't be done
> with the existing kernel APIs?
> 
> Thanks,
> Yury
> 

We discussed this in Sched-MC this year.
https://youtu.be/zf-MBoUIz1Q?t=8581


Currently explored options.

1. CPU Hotplug - slow. Some efforts underway to speed it up.
2. Creating isolated cpusets - Faster. still involves sched domain rebuilds.

The reason why they both won't work is that they break user affinities in the guest.
i.e guest can do "taskset -c <some_vcpus> <workload>, when the
last vCPU goes offline(guest vCPU hotplug) in that list of vCPUs
the affinity mask is reset and workload can run on online vCPUs and it
doesn't set back to earlier value. That is okay for hotlug or isolated cpusets
since it is driven by user in the guest. So user is aware of it.

Whereas here, the change is driven by the system than user in the guest.
So it cannot break user-space affinities.
So we need a new interface to drive this. I think it is better if it is
non cgroup based framework since cgroup is usually user driven.
(correct me if i am wrong).

PS:
There were some confusion around this affinity breaking. Note it is guest vCPU being marked and
guest vCPU being hotplugged. Task affinied workload was running in guest. Host CPUs(pCPU) are not
hotplugged.

---

I had discussion with vincent in hallway, idea is to use the push framework bits and set the
CPU Capacity=1 (lowest value and consider it as special value) and use a static key check to do
this stuff only when HW says to do so.
Such as (considering name as paravirt):

static inline bool cpu_paravirt(int cpu)
{
	if (static_branch_unlikely(&cpu_paravirt_framework))
		return arch_scale_cpu_capacity(cpu) == 1;

	return false;
}

Rest of the bits remain same. I found an issue with current series where setting affinity
is going wrong after cpu is marked paravirt, i will fix it next version. will do some more
testing and send next version in 2026.

Happy Holidays!

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months, 1 week ago


On 11/19/25 6:14 PM, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were
> discussed earlier[1].


Performance data on x86 and PowerPC:

++++++++++++++++++++++++++++++++++++++++++++++++
PowerPC: LPAR(VM) Running on powerVM hypervisor
++++++++++++++++++++++++++++++++++++++++++++++++

Host: 126 cores available in pool.
VM1: 96VP/64EC - 768 CPUs
VM2: 72VP/48EC - 576 CPUs
(VP- Virtual Processor core), (EC - Entitled Cores)
steal_check_frequency:1
steal_ratio_high:400
steal_ratio_low:150

Scenarios:
Secario 1: (Major improvement)
VM1 is running daytrader[1] and VM2 is running stress-ng --cpu=$(nproc)
Note: High gains. In the upstream the steal time was around 15%. With series it comes down
to 3%. With further tuning it could be reduced.

				upstream		+series
daytrader	   	   	1x			  1.7x     <<- 70% gain
throughput

-----------
Scenario 2: (improves thread_count < num_cpus)
VM1 is running schbench and VM2 is running stress-ng --cpu=$(nproc)
Note: Values are average of 5 runs and they are wakeup latencies

schbench -t 400			upstream		+series
50.0th:				  18.00			  16.60
90.0th:				 174.00			  46.80
99.0th:				3197.60                  928.80
99.9th:				6203.20                 4539.20
average rps:                   39665.61		       42334.65
  
schbench -t 600			upstream		+series
50.0th:				  23.80 		  19.80
90.0th:				 917.20                  439.00
99.0th:				5582.40                 3869.60
99.9th:				8982.40      		6574.40
average rps:		       39541.00		       40018.11

-----------
Scenario 3: (Improves)
VM1 is running hackbench and VM2 is running  stress-ng --cpu=$(nproc)
Note: Values are average of 10 runs and 20000 loops.

Process 10 groups          	  2.84               2.62
Process 20 groups          	  5.39               4.48
Process 30 groups          	  7.51               6.29
Process 40 groups          	  9.88               7.42
Process 50 groups    	  	 12.46               9.54
Process 60 groups          	 14.76              12.09
thread  10 groups          	  2.93               2.70
thread  20 groups          	  5.79               4.78
Process(Pipe) 10 groups    	  2.31               2.18
Process(Pipe) 20 groups  	  3.32               3.26
Process(Pipe) 30 groups  	  4.19               4.14
Process(Pipe) 40 groups  	  5.18               5.53
Process(Pipe) 50 groups 	  6.57               6.80
Process(Pipe) 60 groups  	  8.21               8.13
thread(Pipe)  10 groups 	  2.42               2.24
thread(Pipe)  20 groups 	  3.62               3.42

-----------
Notes:

Numbers might be very favorable since VM2 is constantly running and has some CPUs
marked as paravirt when there is steal time and thresholds also might have played a role.
Will plan to run same workload i.e hackbench and schbench on both VM's and see the behavior.

VM1 is CPUs distributed equally across Nodes, while VM2 is not. Since CPUs are marked paravirt
based on core count, some nodes on VM2 would have left unused and that could have added a boot for
VM1 performance specially for daytrader.

[1]: Daytrader is real life benchmark which does stock trading simulation.
https://www.ibm.com/docs/en/linux-on-systems?topic=descriptions-daytrader-benchmark-application
https://cwiki.apache.org/confluence/display/GMOxDOC12/Daytrader

TODO: Get numbers with very high concurrency of hackbench/schbench.

+++++++++++++++++++++++++++++++
on x86_64 (Laptop running KVMs)
+++++++++++++++++++++++++++++++
Host: 8 CPUs.
Two VM. Each spawned with -smp 8.
-----------
Scenario 1:
Both VM's are running hackbench 10 process 10000 loops.
Values are average of 3 runs. High steal of close 50% was seen when
running upstream. So marked 4-7 as paravirt by writing to sysfs file.
Since laptop has lot of host tasks running, there will be still be steal time.

hackbench 10 groups		upstream		+series (4-7 marked as paravirt)
(seconds)		 	  58			   54.42			

Note: Having 5 groups helps too. But when concurrency goes such as very high(40 groups), it regress.

-----------
Scenario 2:
Both VM's are running schbench. Values are average of 2 runs. 		
"schbench -t 4 -r 30 -i 30" (latencies improve but rps is slightly less)

wakeup latencies		upstream		+series(4-7 marked as paravirt)
50.0th				  25.5		  		13.5
90.0th				  70.0				30.0
99.0th				2588.0			      1992.0
99.9th				3844.0			      6032.0
average rps:			   338				326

schbench -t 8 -r 30 -i 30    (Major degradation of rps)
wakeup latencies		upstream		+series(4-7 marked as paravirt)
50.0th				  15.0				11.5
90.0th				1630.0			      2844.0
99.0th				4314.0			      6624.0
99.9th				8572.0			     10896.0
average rps:			 393			       240.5

Anything higher also regress. Need to see why it might be? Maybe too many context
switches since number of threads are too high and CPUs available is less.

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Greg KH 2 months, 2 weeks ago

On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
> Detailed problem statement and some of the implementation choices were 
> discussed earlier[1].
> 
> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
> 
> This is likely the version which would be used for LPC2025 discussion on
> this topic. Feel free to provide your suggestion and hoping for a solution
> that works for different architectures and it's use cases.
> 
> All the existing alternatives such as cpu hotplug, creating isolated
> partitions etc break the user affinity. Since number of CPUs to use change
> depending on the steal time, it is not driven by User. Hence it would be
> wrong to break the affinity. This series allows if the task is pinned
> only paravirt CPUs, it will continue running there.
> 
> Changes compared v3[1]:

There is no "v" for this series :(

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months, 2 weeks ago

Hi Greg.

On 11/24/25 10:35 PM, Greg KH wrote:
> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
>> Detailed problem statement and some of the implementation choices were
>> discussed earlier[1].
>>
>> [1]: https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/
>>
>> This is likely the version which would be used for LPC2025 discussion on
>> this topic. Feel free to provide your suggestion and hoping for a solution
>> that works for different architectures and it's use cases.
>>
>> All the existing alternatives such as cpu hotplug, creating isolated
>> partitions etc break the user affinity. Since number of CPUs to use change
>> depending on the steal time, it is not driven by User. Hence it would be
>> wrong to break the affinity. This series allows if the task is pinned
>> only paravirt CPUs, it will continue running there.
>>
>> Changes compared v3[1]:
> 
> There is no "v" for this series :(
> 

I thought about adding v1.

I made it as PATCH from RFC PATCH since functionally it should
be complete now with arch bits. Since it is v1, I remember usually 
people send out without adding v1. after v1 had tags such as v2.

I will keep v2 for the next series.

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Christophe Leroy (CS GROUP) 2 months, 2 weeks ago

Hi Shrikanth,

Le 25/11/2025 à 03:39, Shrikanth Hegde a écrit :
> Hi Greg.
> 
> On 11/24/25 10:35 PM, Greg KH wrote:
>> On Wed, Nov 19, 2025 at 06:14:32PM +0530, Shrikanth Hegde wrote:
>>> Detailed problem statement and some of the implementation choices were
>>> discussed earlier[1].
>>>
>>> [1]: https://eur01.safelinks.protection.outlook.com/? 
>>> url=https%3A%2F%2Flore.kernel.org%2Fall%2F20250910174210.1969750-1- 
>>> sshegde%40linux.ibm.com%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cc7e5a5830fcb4c796d4808de2bcbe09d%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638996351808032890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cV8RTPdV3So1GwQ9uVYgUuGxSfxutSezpaNBq6RYn%2FI%3D&reserved=0
>>>
>>> This is likely the version which would be used for LPC2025 discussion on
>>> this topic. Feel free to provide your suggestion and hoping for a 
>>> solution
>>> that works for different architectures and it's use cases.
>>>
>>> All the existing alternatives such as cpu hotplug, creating isolated
>>> partitions etc break the user affinity. Since number of CPUs to use 
>>> change
>>> depending on the steal time, it is not driven by User. Hence it would be
>>> wrong to break the affinity. This series allows if the task is pinned
>>> only paravirt CPUs, it will continue running there.
>>>
>>> Changes compared v3[1]:
>>
>> There is no "v" for this series :(
>>
> 
> I thought about adding v1.
> 
> I made it as PATCH from RFC PATCH since functionally it should
> be complete now with arch bits. Since it is v1, I remember usually 
> people send out without adding v1. after v1 had tags such as v2.
> 
> I will keep v2 for the next series.
> 

But you are listing changes compared to v3, how can it be a v1 ? 
Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here [1].

[1] 
https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/

Christophe

Re: [PATCH 00/17] Paravirt CPUs and push task for less vCPU preemption

Posted by Shrikanth Hegde 2 months, 2 weeks ago

Hi Christophe, Greg

>>>
>>> There is no "v" for this series :(
>>>
>>
>> I thought about adding v1.
>>
>> I made it as PATCH from RFC PATCH since functionally it should
>> be complete now with arch bits. Since it is v1, I remember usually 
>> people send out without adding v1. after v1 had tags such as v2.
>>
>> I will keep v2 for the next series.
>>
> 
> But you are listing changes compared to v3, how can it be a v1 ? 
> Shouldn't it be a v4 ? Or in reality a v5 as you already sent a v4 here 
> [1].
> 
> [1] https://lore.kernel.org/all/20251119062100.1112520-1- 
> sshegde@linux.ibm.com/
> 
> Christophe

Sorry about the confusion in numbers. Hopefully below helps for reviewing.
If there are no objections, I will keep next one as v2. Please let me know.

Revision logs:
++++++++++++++++++++++++++++++++++++++
RFC PATCH v4 -> PATCH (This series)
++++++++++++++++++++++++++++++++++++++
- Last two patches were sent out separate instead of being with series.
   Sent it as part of series.
- Use DEVICE_ATTR_RW instead (greg)
- Made it as PATCH since arch specific handling completes the
   functionality.

+++++++++++++++++++++++++++++++++
RFC PATCH v3 -> RFC PATCH v4
+++++++++++++++++++++++++++++++++
- Introduced computation of steal time in powerpc code.
- Derive number of CPUs to use and mark the remaining as paravirt based
   on steal values.
- Provide debugfs knobs to alter how steal time values being used.
- Removed static key check for paravirt CPUs (Yury)
- Removed preempt_disable/enable while calling stopper (Prateek)
- Made select_idle_sibling and friends aware of paravirt CPUs.
- Removed 3 unused schedstat fields and introduced 2 related to paravirt
   handling.
- Handled nohz_full case by enabling tick on it when there is CFS/RT on
   it.
- Updated debug patch to override arch behavior for easier debugging
   during development.
- Kept the method to push only current task out instead of moving all task's
   on rq given the complexity of later.

+++++++++++++++++++++++++++++++++
RFC v2 -> RFC PATCH v3
+++++++++++++++++++++++++++++++++
- Renamed to paravirt_cpus_mask
- Folded the changes under CONFIG_PARAVIRT.
- Fixed the crash due work_buf corruption while using
   stop_one_cpu_nowait.
- Added sysfs documentation.
- Copy most of __balance_push_cpu_stop to new one, this helps it move
   the code out of CONFIG_HOTPLUG_CPU.
- Some of the code movement suggested.

+++++++++++++++++++++++++++++++++
RFC PATCH -> RFC v2
+++++++++++++++++++++++++++++++++
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
- Used a static key such that no impact to regular case.
- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs.
- Add documentation patch
- Took care of reported compile error when NR_CPUS=1


PATCH          : https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@linux.ibm.com/
RFC PATCH v4   : https://lore.kernel.org/all/20251119062100.1112520-1-sshegde@linux.ibm.com/#r
RFC PATCH v3   : https://lore.kernel.org/all/20250910174210.1969750-1-sshegde@linux.ibm.com/#r
RFC v2         : https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/#r
RFC PATCH      : https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/