[RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU

Shrikanth Hegde posted 10 patches 5 months ago
[RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by Shrikanth Hegde 5 months ago
Actively push out any task running on a paravirt CPU. Since the task is
running on the CPU need to spawn a stopper thread and push the task out.

If task is sleeping, when it wakes up it is expected to move out. In
case it still chooses a paravirt CPU, next tick will move it out.
However, if the task in pinned only to paravirt CPUs, it will continue
running there.

Though code is almost same as __balance_push_cpu_stop and quite close to
push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
config.

Add push_task_work_done flag to protect pv_push_task_work buffer. This has
been placed at the empty slot available considering 64/128 byte
cacheline.

This currently works only FAIR and RT.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  9 ++++-
 2 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 279b0dd72b5e..1f9df5b8a3a2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5629,6 +5629,10 @@ void sched_tick(void)
 
 	sched_clock_tick();
 
+	/* push the current task out if a paravirt CPU */
+	if (is_cpu_paravirt(cpu))
+		push_current_from_paravirt_cpu(rq);
+
 	rq_lock(rq, &rf);
 	donor = rq->donor;
 
@@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 struct cpumask __cpu_paravirt_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_paravirt_mask);
 DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
+
+static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
+
+static int paravirt_push_cpu_stop(void *arg)
+{
+	struct task_struct *p = arg;
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+	int cpu;
+
+	raw_spin_lock_irq(&p->pi_lock);
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 0;
+
+	update_rq_clock(rq);
+
+	if (task_rq(p) == rq && task_on_rq_queued(p)) {
+		cpu = select_fallback_rq(rq->cpu, p);
+		rq = __migrate_task(rq, &rf, p, cpu);
+	}
+
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+
+	return 0;
+}
+
+/* A CPU is marked as Paravirt when there is contention for underlying
+ * physical CPU and using this CPU will lead to hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non paravirt CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_from_paravirt_cpu(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+	struct rq_flags rf;
+
+	if (!is_cpu_paravirt(rq->cpu))
+		return;
+
+	/* Idle task can't be pused out */
+	if (rq->curr == rq->idle)
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	/* Is it affine to only paravirt cpus? */
+	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
+		return;
+
+	/* There is already a stopper thread for this. Dont race with it */
+	if (rq->push_task_work_done == 1)
+		return;
+
+	local_irq_save(flags);
+	preempt_disable();
+
+	get_task_struct(push_task);
+
+	rq_lock(rq, &rf);
+	rq->push_task_work_done = 1;
+	rq_unlock(rq, &rf);
+
+	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
+			    this_cpu_ptr(&pv_push_task_work));
+	preempt_enable();
+	local_irq_restore(flags);
+}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f9991453d36..5077a32593da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1187,7 +1187,9 @@ struct rq {
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
-
+#ifdef CONFIG_PARAVIRT
+	bool			push_task_work_done;
+#endif
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
@@ -3890,11 +3892,16 @@ static inline bool is_cpu_paravirt(int cpu)
 
 	return false;
 }
+
+void push_current_from_paravirt_cpu(struct rq *rq);
+
 #else	/* !CONFIG_PARAVIRT */
 static inline bool is_cpu_paravirt(int cpu)
 {
 	return false;
 }
+
+static inline void push_current_from_paravirt_cpu(struct rq *rq) { }
 #endif	/* !CONFIG_PARAVIRT */
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.47.3
Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by K Prateek Nayak 5 months ago
Hello Shrikanth,

On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
> Actively push out any task running on a paravirt CPU. Since the task is
> running on the CPU need to spawn a stopper thread and push the task out.
> 
> If task is sleeping, when it wakes up it is expected to move out. In
> case it still chooses a paravirt CPU, next tick will move it out.
> However, if the task in pinned only to paravirt CPUs, it will continue
> running there.
> 
> Though code is almost same as __balance_push_cpu_stop and quite close to
> push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
> config.
> 
> Add push_task_work_done flag to protect pv_push_task_work buffer. This has
> been placed at the empty slot available considering 64/128 byte
> cacheline.
> 
> This currently works only FAIR and RT.

EXT can perhaps use the ops->cpu_{release,acquire}() if they are
interested in this.

> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  9 ++++-
>  2 files changed, 92 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 279b0dd72b5e..1f9df5b8a3a2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5629,6 +5629,10 @@ void sched_tick(void)
>  
>  	sched_clock_tick();
>  
> +	/* push the current task out if a paravirt CPU */
> +	if (is_cpu_paravirt(cpu))
> +		push_current_from_paravirt_cpu(rq);

Does this mean paravirt CPU is capable of handling an interrupt but may
not be continuously available to run a task? Or is the VMM expected to set
the CPU on the paravirt mask and give the vCPU sufficient time to move the
task before yanking it away from the pCPU?

> +
>  	rq_lock(rq, &rf);
>  	donor = rq->donor;
>  
> @@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>  struct cpumask __cpu_paravirt_mask __read_mostly;
>  EXPORT_SYMBOL(__cpu_paravirt_mask);
>  DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
> +
> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
> +
> +static int paravirt_push_cpu_stop(void *arg)
> +{
> +	struct task_struct *p = arg;

Can we move all pushable tasks at once instead of just the rq->curr at
the time of the tick? It can also avoid keeping the reference to "p"
and only selectively pushing it. Thoughts?

> +	struct rq *rq = this_rq();
> +	struct rq_flags rf;
> +	int cpu;
> +
> +	raw_spin_lock_irq(&p->pi_lock);
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 0;
> +
> +	update_rq_clock(rq);
> +
> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
> +		cpu = select_fallback_rq(rq->cpu, p);
> +		rq = __migrate_task(rq, &rf, p, cpu);
> +	}
> +
> +	rq_unlock(rq, &rf);
> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +
> +	return 0;
> +}
> +
> +/* A CPU is marked as Paravirt when there is contention for underlying
> + * physical CPU and using this CPU will lead to hypervisor preemptions.
> + * It is better not to use this CPU.
> + *
> + * In case any task is scheduled on such CPU, move it out. In
> + * select_fallback_rq a non paravirt CPU will be chosen and henceforth
> + * task shouldn't come back to this CPU
> + */
> +void push_current_from_paravirt_cpu(struct rq *rq)
> +{
> +	struct task_struct *push_task = rq->curr;
> +	unsigned long flags;
> +	struct rq_flags rf;
> +
> +	if (!is_cpu_paravirt(rq->cpu))
> +		return;
> +
> +	/* Idle task can't be pused out */
> +	if (rq->curr == rq->idle)
> +		return;
> +
> +	/* Do for only SCHED_NORMAL AND RT for now */
> +	if (push_task->sched_class != &fair_sched_class &&
> +	    push_task->sched_class != &rt_sched_class)
> +		return;
> +
> +	if (kthread_is_per_cpu(push_task) ||
> +	    is_migration_disabled(push_task))
> +		return;
> +
> +	/* Is it affine to only paravirt cpus? */
> +	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
> +		return;
> +
> +	/* There is already a stopper thread for this. Dont race with it */
> +	if (rq->push_task_work_done == 1)
> +		return;
> +
> +	local_irq_save(flags);
> +	preempt_disable();

Disabling IRQs implies preemption is disabled.

> +
> +	get_task_struct(push_task);
> +
> +	rq_lock(rq, &rf);
> +	rq->push_task_work_done = 1;
> +	rq_unlock(rq, &rf);
> +
> +	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
> +			    this_cpu_ptr(&pv_push_task_work));
> +	preempt_enable();
> +	local_irq_restore(flags);
> +}
>  #endif
-- 
Thanks and Regards,
Prateek
Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by Shrikanth Hegde 3 months ago
>> +
>> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
>> +
>> +static int paravirt_push_cpu_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
> 
> Can we move all pushable tasks at once instead of just the rq->curr at
> the time of the tick? It can also avoid keeping the reference to "p"
> and only selectively pushing it. Thoughts?
> 
>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +
>> +	update_rq_clock(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
>> +		cpu = select_fallback_rq(rq->cpu, p);
>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +	}
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
>> +

Got it work by using by using rt.pushable_tasks(RT) and rq->cfs_tasks(CFS).

I don't see any significant benefit by doing this. there is slight improvement in time
it takes to move the tasks out. This could help when there are way too many tasks on rq.
But these days most system are with HZ=1000, that means it is 1ms tick, it shouldn't take
very long to push the current task out. Also, rq lock likely needs to be held across
the loop to ensure loop doesn't get altered by irq etc.

Given the complexity, I prefer the method of pushing the current task out.
---

         /* push the rt tasks out first */
         plist_for_each_entry_safe(p, tmp_p, &orig_rq->rt.pushable_tasks, pushable_tasks) {
                 rq = orig_rq;

                 if (kthread_is_per_cpu(p) ||is_migration_disabled(p))
                         continue;

                 raw_spin_lock_irqsave(&p->pi_lock, flags);
                 rq_lock(rq, &rf);

                 update_rq_clock(rq);

                 if (task_rq(p) == rq && task_on_rq_queued(p)) {
                         cpu = select_fallback_rq(rq->cpu, p);
                         rq = __migrate_task(rq, &rf, p, cpu);
                 }

                 rq_unlock(rq, &rf);
                 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
         }
Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by Shrikanth Hegde 5 months ago

On 9/11/25 11:10 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/10/2025 11:12 PM, Shrikanth Hegde wrote:
>> Actively push out any task running on a paravirt CPU. Since the task is
>> running on the CPU need to spawn a stopper thread and push the task out.
>>
>> If task is sleeping, when it wakes up it is expected to move out. In
>> case it still chooses a paravirt CPU, next tick will move it out.
>> However, if the task in pinned only to paravirt CPUs, it will continue
>> running there.
>>
>> Though code is almost same as __balance_push_cpu_stop and quite close to
>> push_cpu_stop, it provides a cleaner implementation w.r.t to PARAVIRT
>> config.
>>
>> Add push_task_work_done flag to protect pv_push_task_work buffer. This has
>> been placed at the empty slot available considering 64/128 byte
>> cacheline.
>>
>> This currently works only FAIR and RT.
> 
> EXT can perhaps use the ops->cpu_{release,acquire}() if they are
> interested in this.
>

>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h |  9 ++++-
>>   2 files changed, 92 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 279b0dd72b5e..1f9df5b8a3a2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5629,6 +5629,10 @@ void sched_tick(void)
>>   
>>   	sched_clock_tick();
>>   
>> +	/* push the current task out if a paravirt CPU */
>> +	if (is_cpu_paravirt(cpu))
>> +		push_current_from_paravirt_cpu(rq);
> 
> Does this mean paravirt CPU is capable of handling an interrupt but may
> not be continuously available to run a task?

When i run hackbench which involves fair bit of IRQ stuff, it moves out.

For example,

echo 600-710 > /sys/devices/system/cpu/paravirt

11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00


There could some workloads which doesn't move irq's out, for which needs irqbalance change.
Looking into it.

  Or is the VMM expected to set
> the CPU on the paravirt mask and give the vCPU sufficient time to move the
> task before yanking it away from the pCPU?
>

If the vCPU is running something, it is going to run at some point on pCPU.
hypervisor will give the cycles to this vCPU by preempting some other vCPU.

It is that using this infra, there is should be nothing on that paravirt vCPU.
That way collectively VMM gets only limited request for pCPU which it can satify
without vCPU preemption.

  
>> +
>>   	rq_lock(rq, &rf);
>>   	donor = rq->donor;
>>   
>> @@ -10977,4 +10981,84 @@ void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
>>   struct cpumask __cpu_paravirt_mask __read_mostly;
>>   EXPORT_SYMBOL(__cpu_paravirt_mask);
>>   DEFINE_STATIC_KEY_FALSE(cpu_paravirt_push_tasks);
>> +
>> +static DEFINE_PER_CPU(struct cpu_stop_work, pv_push_task_work);
>> +
>> +static int paravirt_push_cpu_stop(void *arg)
>> +{
>> +	struct task_struct *p = arg;
> 
> Can we move all pushable tasks at once instead of just the rq->curr at
> the time of the tick? It can also avoid keeping the reference to "p"
> and only selectively pushing it. Thoughts?
> 

I think that is doable.
need to pass rq as arg and go through all tasks in rq in the stopped thread.

>> +	struct rq *rq = this_rq();
>> +	struct rq_flags rf;
>> +	int cpu;
>> +
>> +	raw_spin_lock_irq(&p->pi_lock);
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 0;
>> +
>> +	update_rq_clock(rq);
>> +
>> +	if (task_rq(p) == rq && task_on_rq_queued(p)) {
>> +		cpu = select_fallback_rq(rq->cpu, p);
>> +		rq = __migrate_task(rq, &rf, p, cpu);
>> +	}
>> +
>> +	rq_unlock(rq, &rf);
>> +	raw_spin_unlock_irq(&p->pi_lock);
>> +	put_task_struct(p);
>> +
>> +	return 0;
>> +}
>> +
>> +/* A CPU is marked as Paravirt when there is contention for underlying
>> + * physical CPU and using this CPU will lead to hypervisor preemptions.
>> + * It is better not to use this CPU.
>> + *
>> + * In case any task is scheduled on such CPU, move it out. In
>> + * select_fallback_rq a non paravirt CPU will be chosen and henceforth
>> + * task shouldn't come back to this CPU
>> + */
>> +void push_current_from_paravirt_cpu(struct rq *rq)
>> +{
>> +	struct task_struct *push_task = rq->curr;
>> +	unsigned long flags;
>> +	struct rq_flags rf;
>> +
>> +	if (!is_cpu_paravirt(rq->cpu))
>> +		return;
>> +
>> +	/* Idle task can't be pused out */
>> +	if (rq->curr == rq->idle)
>> +		return;
>> +
>> +	/* Do for only SCHED_NORMAL AND RT for now */
>> +	if (push_task->sched_class != &fair_sched_class &&
>> +	    push_task->sched_class != &rt_sched_class)
>> +		return;
>> +
>> +	if (kthread_is_per_cpu(push_task) ||
>> +	    is_migration_disabled(push_task))
>> +		return;
>> +
>> +	/* Is it affine to only paravirt cpus? */
>> +	if (cpumask_subset(push_task->cpus_ptr, cpu_paravirt_mask))
>> +		return;
>> +
>> +	/* There is already a stopper thread for this. Dont race with it */
>> +	if (rq->push_task_work_done == 1)
>> +		return;
>> +
>> +	local_irq_save(flags);
>> +	preempt_disable();
> 
> Disabling IRQs implies preemption is disabled.
>

Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
stopper runs at the next possible opportunity.

stop_one_cpu_nowait
  ->queues the task into stopper list
     -> wake_up_process(stopper)
        -> set need_resched
          -> stopper runs as early as possible.
          
>> +
>> +	get_task_struct(push_task);
>> +
>> +	rq_lock(rq, &rf);
>> +	rq->push_task_work_done = 1;
>> +	rq_unlock(rq, &rf);
>> +
>> +	stop_one_cpu_nowait(rq->cpu, paravirt_push_cpu_stop, push_task,
>> +			    this_cpu_ptr(&pv_push_task_work));
>> +	preempt_enable();
>> +	local_irq_restore(flags);
>> +}
>>   #endif
Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by K Prateek Nayak 5 months ago
Hello Shrikanth,

On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>> +    if (is_cpu_paravirt(cpu))
>>> +        push_current_from_paravirt_cpu(rq);
>>
>> Does this mean paravirt CPU is capable of handling an interrupt but may
>> not be continuously available to run a task?
> 
> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
> 
> For example,
> 
> echo 600-710 > /sys/devices/system/cpu/paravirt
> 
> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
> 
> 
> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
> Looking into it.
> 
>  Or is the VMM expected to set
>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>> task before yanking it away from the pCPU?
>>
> 
> If the vCPU is running something, it is going to run at some point on pCPU.
> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
> 
> It is that using this infra, there is should be nothing on that paravirt vCPU.
> That way collectively VMM gets only limited request for pCPU which it can satify
> without vCPU preemption.

Ack! Just wanted to understand the usage.

P.S. I remember discussions during last LPC where we could communicate
this unavailability via CPU capacity. Was that problematic for some
reason? Sorry if I didn't follow this discussion earlier.

[..snip..]
>>> +    local_irq_save(flags);
>>> +    preempt_disable();
>>
>> Disabling IRQs implies preemption is disabled.
>>
> 
> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
> stopper runs at the next possible opportunity.

But is there any reason to do both local_irq_save() and
preempt_disable()? include/linux/preempt.h defines preemptible() as:

    #define preemptible()   (preempt_count() == 0 && !irqs_disabled())

so disabling IRQs should be sufficient right or am I missing something?

> 
> stop_one_cpu_nowait
>  ->queues the task into stopper list
>     -> wake_up_process(stopper)
>        -> set need_resched
>          -> stopper runs as early as possible.
>         
-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by Shrikanth Hegde 4 months, 4 weeks ago

On 9/11/25 10:36 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>> +    if (is_cpu_paravirt(cpu))
>>>> +        push_current_from_paravirt_cpu(rq);
>>>
>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>> not be continuously available to run a task?
>>
>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>
>> For example,
>>
>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>
>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>
>>
>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>> Looking into it.
>>
>>   Or is the VMM expected to set
>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>> task before yanking it away from the pCPU?
>>>
>>
>> If the vCPU is running something, it is going to run at some point on pCPU.
>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>
>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>> That way collectively VMM gets only limited request for pCPU which it can satify
>> without vCPU preemption.
> 
> Ack! Just wanted to understand the usage.
> 
> P.S. I remember discussions during last LPC where we could communicate
> this unavailability via CPU capacity. Was that problematic for some
> reason? Sorry if I didn't follow this discussion earlier.
> 

Thanks for that questions. Gives a opportunity to retrospect.

Yes. That's where we started. but that has a lot of implementation challenges.
Still an option though.

History upto current state:

1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
    cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
    The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
    One of the possible approach was CPU Capacity.

1. Issues with CPU Capacity approach:
    a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
       since it relies on group misfit and group_overload should have higher priority there.
    b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
    c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.
    d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
       because of task_fits_cpu stuff
    e. More challenges in RT.

That's when Tobias had introduced a new group type called group_parked.
https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
   
It has relatively cleaner implementation compared to CPU CAPACITY.

It had a few disadvantages too:
1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
    concern.
2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.

That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
So tried a similar mechanism and this is where we are right now.

> [..snip..]
>>>> +    local_irq_save(flags);
>>>> +    preempt_disable();
>>>
>>> Disabling IRQs implies preemption is disabled.
>>>
>>
>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>> stopper runs at the next possible opportunity.
> 
> But is there any reason to do both local_irq_save() and
> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>
>      #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
> 
> so disabling IRQs should be sufficient right or am I missing something?
> 

f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
could be the answer you are looking for.

>>
>> stop_one_cpu_nowait
>>   ->queues the task into stopper list
>>      -> wake_up_process(stopper)
>>         -> set need_resched
>>           -> stopper runs as early as possible.
>>          

Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by K Prateek Nayak 4 months, 4 weeks ago
Hello Shrikanth,

On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
> 
> 
> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>> Hello Shrikanth,
>>
>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>> +    if (is_cpu_paravirt(cpu))
>>>>> +        push_current_from_paravirt_cpu(rq);
>>>>
>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>> not be continuously available to run a task?
>>>
>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>
>>> For example,
>>>
>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>
>>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>
>>>
>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>> Looking into it.
>>>
>>>   Or is the VMM expected to set
>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>> task before yanking it away from the pCPU?
>>>>
>>>
>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>
>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>> without vCPU preemption.
>>
>> Ack! Just wanted to understand the usage.
>>
>> P.S. I remember discussions during last LPC where we could communicate
>> this unavailability via CPU capacity. Was that problematic for some
>> reason? Sorry if I didn't follow this discussion earlier.
>>
> 
> Thanks for that questions. Gives a opportunity to retrospect.
> 
> Yes. That's where we started. but that has a lot of implementation challenges.
> Still an option though.
> 
> History upto current state:
> 
> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
>    cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
>    The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
>    One of the possible approach was CPU Capacity.

Ack. Is creating an isolated partition on the fly too expensive too?
I don't think creation of that partition is serialized and it should
achieve a similar result with a single sched-domain rebuild and I'm
hoping VMM doesn't change the paravirt mask at an alarming rate.

P.S. Some stupid benchmarking on a 256CPU machine:

    mkdir /sys/fs/cgroup/isol/
    echo isolated >  /sys/fs/cgroup/isol/cpuset.cpus.partition

    time for i in {1..1000}; do \
    echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
    done

    real    2m50.016s
    user    0m0.198s
    sys     1m47.708s

So that is about (170sec / 2000) ~ 85ms per cpuset operation.
Definitely more expensive than setting the paravirt but compare that to:

    for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
    for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'

    real    0m5.046s
    user    0m0.014s
    sys     0m0.110s

Definitely less expensive than a full hotplug.

> 
> 1. Issues with CPU Capacity approach:
>    a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
>       since it relies on group misfit and group_overload should have higher priority there.
>    b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
>    c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.

Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
Maybe we can toggle the "sched_asym_cpucapacity" static branch without
actually having SD_ASYM_CAPACITY in these special case to enable
asym_fits_cpu() steer away from these 0 capacity CPUs.

>    d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
>       because of task_fits_cpu stuff
>    e. More challenges in RT.
> 
> That's when Tobias had introduced a new group type called group_parked.
> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
>   It has relatively cleaner implementation compared to CPU CAPACITY.
> 
> It had a few disadvantages too:
> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
>    concern.
> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
> 
> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
> So tried a similar mechanism and this is where we are right now.

I agree push is great from that perspective.

> 
>> [..snip..]
>>>>> +    local_irq_save(flags);
>>>>> +    preempt_disable();
>>>>
>>>> Disabling IRQs implies preemption is disabled.
>>>>
>>>
>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>> stopper runs at the next possible opportunity.
>>
>> But is there any reason to do both local_irq_save() and
>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>
>>      #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>>
>> so disabling IRQs should be sufficient right or am I missing something?
>>
> 
> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
> could be the answer you are looking for.

I think in all the cases covered by that commit, "task_rq_unlock(...)" would
have enabled interrupts which required that specified pattern but here we
have preempt_disable() within a local_irq_save() section which might not be
necessary.

> 
>>>
>>> stop_one_cpu_nowait
>>>   ->queues the task into stopper list
>>>      -> wake_up_process(stopper)
>>>         -> set need_resched
>>>           -> stopper runs as early as possible.
>>>          
> 

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH v3 07/10] sched/core: Push current task from paravirt CPU
Posted by Shrikanth Hegde 4 months, 4 weeks ago

On 9/12/25 2:18 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 9/12/2025 10:52 AM, Shrikanth Hegde wrote:
>>
>>
>> On 9/11/25 10:36 PM, K Prateek Nayak wrote:
>>> Hello Shrikanth,
>>>
>>> On 9/11/2025 10:22 PM, Shrikanth Hegde wrote:
>>>>>> +    if (is_cpu_paravirt(cpu))
>>>>>> +        push_current_from_paravirt_cpu(rq);
>>>>>
>>>>> Does this mean paravirt CPU is capable of handling an interrupt but may
>>>>> not be continuously available to run a task?
>>>>
>>>> When i run hackbench which involves fair bit of IRQ stuff, it moves out.
>>>>
>>>> For example,
>>>>
>>>> echo 600-710 > /sys/devices/system/cpu/paravirt
>>>>
>>>> 11:31:54 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 11:31:57 AM  598    2.04    0.00   77.55    0.00   18.37    0.00    1.02    0.00    0.00    1.02
>>>> 11:31:57 AM  599    1.01    0.00   79.80    0.00   17.17    0.00    1.01    0.00    0.00    1.01
>>>> 11:31:57 AM  600    0.00    0.00    0.00    0.00    0.00    0.00    0.99    0.00    0.00   99.01
>>>> 11:31:57 AM  601    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>> 11:31:57 AM  602    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
>>>>
>>>>
>>>> There could some workloads which doesn't move irq's out, for which needs irqbalance change.
>>>> Looking into it.
>>>>
>>>>    Or is the VMM expected to set
>>>>> the CPU on the paravirt mask and give the vCPU sufficient time to move the
>>>>> task before yanking it away from the pCPU?
>>>>>
>>>>
>>>> If the vCPU is running something, it is going to run at some point on pCPU.
>>>> hypervisor will give the cycles to this vCPU by preempting some other vCPU.
>>>>
>>>> It is that using this infra, there is should be nothing on that paravirt vCPU.
>>>> That way collectively VMM gets only limited request for pCPU which it can satify
>>>> without vCPU preemption.
>>>
>>> Ack! Just wanted to understand the usage.
>>>
>>> P.S. I remember discussions during last LPC where we could communicate
>>> this unavailability via CPU capacity. Was that problematic for some
>>> reason? Sorry if I didn't follow this discussion earlier.
>>>
>>
>> Thanks for that questions. Gives a opportunity to retrospect.
>>
>> Yes. That's where we started. but that has a lot of implementation challenges.
>> Still an option though.
>>
>> History upto current state:
>>
>> 1. At LPC24 presented the problem statement, and why existing approaches such as hotplug,
>>     cpuset cgroup or taskset are not viable solution. Hotplug would have come handy if the cost was low.
>>     The overhead of sched domain rebuild and serial nature of hotplug makes it not viable option.
>>     One of the possible approach was CPU Capacity.
> 
> Ack. Is creating an isolated partition on the fly too expensive too?
> I don't think creation of that partition is serialized and it should
> achieve a similar result with a single sched-domain rebuild and I'm
> hoping VMM doesn't change the paravirt mask at an alarming rate.
> 

That is a good idea too.

Main issue is with when workload does taskset.

For example,
taskset -c 650-700 stress-ng --cpu=100 -t 10
echo isolated > cpuset.cpus.partition
echo 600-710 > cpuset.cpus.exclusive

Tasks move out and is cpu affinity is reset to all cpus. Similar to hotplug.
But both hotplug and write to exclusive are triggered by user, and hence user
is aware of it.

I don't think it is good idea to reset users cpu affinity without an action from them.

Looking at code of
      cpuset_write_resmask
      update_exclusive_cpumask
    - update_parent_effective_cpumask
       + 6.16% cpuset_update_tasks_cpumask
            set_cpus_allowed_ptr
            __set_cpus_allowed_ptr
            affine_move_task

affine_move_task -> would call migration_cpu_stop -> Moves one task at a time

If you see we do the same/similar thing in paravirt infra, but we don't touch/reset task's cpu affinity.
Affined tasks continue to run if it is affined to only paravirt CPUs. If there is any one one
non paravirt CPU in its cpus_ptr it will move there.

> P.S. Some stupid benchmarking on a 256CPU machine:
> 
>      mkdir /sys/fs/cgroup/isol/
>      echo isolated >  /sys/fs/cgroup/isol/cpuset.cpus.partition
> 
>      time for i in {1..1000}; do \
>      echo "8-15" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
>      echo "16-23" > /sys/fs/cgroup/isol/cpuset.cpus.exclusive; \
>      done
> 
>      real    2m50.016s
>      user    0m0.198s
>      sys     1m47.708s
> 
> So that is about (170sec / 2000) ~ 85ms per cpuset operation.

That cost would be okay. VMM isn't expected to change at very high rate.

> Definitely more expensive than setting the paravirt but compare that to:
> 
>      for i in {8..15}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {8..15}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {16..23}; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; \
>      for i in {16..23}; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done;'
> 
>      real    0m5.046s
>      user    0m0.014s
>      sys     0m0.110s
> 
> Definitely less expensive than a full hotplug.

This happens mainly due to the synchronize_rcu there.

> 
>>
>> 1. Issues with CPU Capacity approach:
>>     a. Need to make group_misfit_task as the highest priority. That alone will break big.LITTLE
>>        since it relies on group misfit and group_overload should have higher priority there.
>>     b. At high concurrency tasks still moved those CPUs with CAPACITY=1.
>>     c. A lot of scheduler stats would need to be aware of change in CAPACITY specially load balance/wakeup.
> 
> Ack. Thinking out loud: Can capacity go to 0 via H/W pressure interface?
> Maybe we can toggle the "sched_asym_cpucapacity" static branch without
> actually having SD_ASYM_CAPACITY in these special case to enable
> asym_fits_cpu() steer away from these 0 capacity CPUs.

bigger concern is around that group_misfit_task IMO.

> 
>>     d. in update_group_misfit - need to set the misfit load based on capacity. the current code sets to 0,
>>        because of task_fits_cpu stuff
>>     e. More challenges in RT.
>>
>> That's when Tobias had introduced a new group type called group_parked.
>> https://lore.kernel.org/all/20241204112149.25872-2-huschle@linux.ibm.com/
>>    It has relatively cleaner implementation compared to CPU CAPACITY.
>>
>> It had a few disadvantages too:
>> 1. It use to take around 8-10 seconds for tasks to move out of those CPUs. That was the main
>>     concern.
>> 2. Needs a few stats based changes in update_sg_lb_stats. might be tricky in all scenarios.
>>
>> That's when we were exploring how the tasks move out when the cpu goes offline. It happens quite fast too.
>> So tried a similar mechanism and this is where we are right now.
> 
> I agree push is great from that perspective.
> 

Yes. It is same at the moment.

>>
>>> [..snip..]
>>>>>> +    local_irq_save(flags);
>>>>>> +    preempt_disable();
>>>>>
>>>>> Disabling IRQs implies preemption is disabled.
>>>>>
>>>>
>>>> Most of the places stop_one_cpu_nowait called with preemption & irq disabled.
>>>> stopper runs at the next possible opportunity.
>>>
>>> But is there any reason to do both local_irq_save() and
>>> preempt_disable()? include/linux/preempt.h defines preemptible() as:
>>>
>>>       #define preemptible()   (preempt_count() == 0 && !irqs_disabled())
>>>
>>> so disabling IRQs should be sufficient right or am I missing something?
>>>
>>
>> f0498d2a54e79 (Peter Zijlstra) "sched: Fix stop_one_cpu_nowait() vs hotplug"
>> could be the answer you are looking for.
> 
> I think in all the cases covered by that commit, "task_rq_unlock(...)" would
> have enabled interrupts which required that specified pattern but here we
> have preempt_disable() within a local_irq_save() section which might not be
> necessary.
> 
>>
>>>>
>>>> stop_one_cpu_nowait
>>>>    ->queues the task into stopper list
>>>>       -> wake_up_process(stopper)
>>>>          -> set need_resched
>>>>            -> stopper runs as early as possible.
>>>>           
>>
>