[v2] PREEMPT_AUTO: support lazy rescheduling

[PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Ankur Arora 1 year, 8 months ago

The default preemption policy for voluntary preemption under
PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
class, and lazily for well-behaved, non-idle tasks.

This is the same policy as preempt=none, with an eager handling of
higher priority scheduling classes.

Comparing a cyclictest workload with a background kernel load of
'stress-ng --mmap', shows that both the average and the maximum
latencies improve:

 # stress-ng --mmap 0 &
 # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300

                                     Min     (  %stdev )    Act     (  %stdev )   Avg     (  %stdev )   Max      (  %stdev )

  PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +- 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
  PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )

The table above shows the aggregated latencies across all CPUs.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c  | 12 ++++++++----
 kernel/sched/sched.h |  6 ++++++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c25cccc09b65..2bc3ae21a9d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
 	if (preempt_model_preemptible())
 		return RESCHED_NOW;
 
+	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
+		return RESCHED_NOW;
+
 	if (is_idle_task(curr))
 		return RESCHED_NOW;
 
@@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 	if (p->sched_class == rq->curr->sched_class)
 		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
 	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
-		resched_curr(rq);
+		resched_curr_priority(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -8989,11 +8992,11 @@ static void __sched_dynamic_update(int mode)
 	case preempt_dynamic_none:
 		if (mode != preempt_dynamic_mode)
 			pr_info("%s: none\n", PREEMPT_MODE);
-		preempt_dynamic_mode = mode;
 		break;
 
 	case preempt_dynamic_voluntary:
-		preempt_dynamic_mode = preempt_dynamic_undefined;
+		if (mode != preempt_dynamic_mode)
+			pr_info("%s: voluntary\n", PREEMPT_MODE);
 		break;
 
 	case preempt_dynamic_full:
@@ -9003,9 +9006,10 @@ static void __sched_dynamic_update(int mode)
 
 		if (mode != preempt_dynamic_mode)
 			pr_info("%s: full\n", PREEMPT_MODE);
-		preempt_dynamic_mode = mode;
 		break;
 	}
+
+	preempt_dynamic_mode = mode;
 }
 
 #endif /* CONFIG_PREEMPT_AUTO */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 107c5fc2b7bb..ee8e99a9a677 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2468,6 +2468,7 @@ enum resched_opt {
 	RESCHED_DEFAULT,
 	RESCHED_FORCE,
 	RESCHED_TICK,
+	RESCHED_PRIORITY,
 };
 
 extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2482,6 +2483,11 @@ static inline void resched_curr_tick(struct rq *rq)
 	__resched_curr(rq, RESCHED_TICK);
 }
 
+static inline void resched_curr_priority(struct rq *rq)
+{
+	__resched_curr(rq, RESCHED_PRIORITY);
+}
+
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;
-- 
2.31.1

Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Tianchen Ding 1 year, 7 months ago

On 2024/5/28 08:35, Ankur Arora wrote:
> The default preemption policy for voluntary preemption under
> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
> class, and lazily for well-behaved, non-idle tasks.
> 
> This is the same policy as preempt=none, with an eager handling of
> higher priority scheduling classes.
> 
> Comparing a cyclictest workload with a background kernel load of
> 'stress-ng --mmap', shows that both the average and the maximum
> latencies improve:
> 
>   # stress-ng --mmap 0 &
>   # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
> 
>                                       Min     (  %stdev )    Act     (  %stdev )   Avg     (  %stdev )   Max      (  %stdev )
> 
>    PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +- 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>    PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
> 
> The table above shows the aggregated latencies across all CPUs.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Ziljstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   kernel/sched/core.c  | 12 ++++++++----
>   kernel/sched/sched.h |  6 ++++++
>   2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c25cccc09b65..2bc3ae21a9d0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>   	if (preempt_model_preemptible())
>   		return RESCHED_NOW;
>   
> +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
> +		return RESCHED_NOW;
> +
>   	if (is_idle_task(curr))
>   		return RESCHED_NOW;
>   
> @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>   	if (p->sched_class == rq->curr->sched_class)
>   		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>   	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> -		resched_curr(rq);
> +		resched_curr_priority(rq);
>   
Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.

Maybe sth like

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..eedb70234bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,6 +8352,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	struct sched_entity *se = &curr->se, *pse = &p->se;
  	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
  	int cse_is_idle, pse_is_idle;
+	enum resched_opt opt = RESCHED_PRIORITY;
  
  	if (unlikely(se == pse))
  		return;
@@ -8385,7 +8386,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	/* Idle tasks are by definition preempted by non-idle tasks. */
  	if (unlikely(task_has_idle_policy(curr)) &&
  	    likely(!task_has_idle_policy(p)))
-		goto preempt;
+		goto preempt; /* RESCHED_PRIORITY */
  
  	/*
  	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
@@ -8405,7 +8406,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	 * in the inverse case).
  	 */
  	if (cse_is_idle && !pse_is_idle)
-		goto preempt;
+		goto preempt; /* RESCHED_PRIORITY */
  	if (cse_is_idle != pse_is_idle)
  		return;
  
@@ -8415,13 +8416,15 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
  	/*
  	 * XXX pick_eevdf(cfs_rq) != se ?
  	 */
-	if (pick_eevdf(cfs_rq) == pse)
+	if (pick_eevdf(cfs_rq) == pse) {
+		opt = RESCHED_DEFAULT;
  		goto preempt;
+	}
  
  	return;
  
  preempt:
-	resched_curr(rq);
+	__resched_curr(rq, opt);
  }
  
  #ifdef CONFIG_SMP

Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Ankur Arora 1 year, 7 months ago

Tianchen Ding <dtcccc@linux.alibaba.com> writes:

> On 2024/5/28 08:35, Ankur Arora wrote:
>> The default preemption policy for voluntary preemption under
>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>> class, and lazily for well-behaved, non-idle tasks.
>> This is the same policy as preempt=none, with an eager handling of
>> higher priority scheduling classes.
>> Comparing a cyclictest workload with a background kernel load of
>> 'stress-ng --mmap', shows that both the average and the maximum
>> latencies improve:
>>   # stress-ng --mmap 0 &
>>   # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>                                       Min     (  %stdev )    Act     (  %stdev
>> )   Avg     (  %stdev )   Max      (  %stdev )
>>    PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>    PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>> The table above shows the aggregated latencies across all CPUs.
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Ziljstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   kernel/sched/core.c  | 12 ++++++++----
>>   kernel/sched/sched.h |  6 ++++++
>>   2 files changed, 14 insertions(+), 4 deletions(-)
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c25cccc09b65..2bc3ae21a9d0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>   	if (preempt_model_preemptible())
>>   		return RESCHED_NOW;
>>   +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>> +		return RESCHED_NOW;
>> +
>>   	if (is_idle_task(curr))
>>   		return RESCHED_NOW;
>>   @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>> *p, int flags)
>>   	if (p->sched_class == rq->curr->sched_class)
>>   		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>   	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>> -		resched_curr(rq);
>> +		resched_curr_priority(rq);
>>
> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.

So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
(And, that case is already handled by resched_opt_translate() explicitly
promoting idle tasks to TIF_NEED_RESCHED.)

But, on the general question of doing resched_curr_priority() in the
same class: I did consider it. But, it seemed to me that we want to
keep run to completion semantics for lazy scheduling, and so not
enforcing priority in a scheduling class was a good line.

(Note that resched_curr_priority(), at least as it stands, is going away
for v3. I'll be folding lazy scheduling as a single model under
PREEMPT_DYNAMIC. So, no separate lazy=none, lazy=voluntary.)

Thanks

--
ankur

Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Tianchen Ding 1 year, 7 months ago

On 2024/6/22 02:58, Ankur Arora wrote:
> 
> Tianchen Ding <dtcccc@linux.alibaba.com> writes:
> 
>> On 2024/5/28 08:35, Ankur Arora wrote:
>>> The default preemption policy for voluntary preemption under
>>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>>> class, and lazily for well-behaved, non-idle tasks.
>>> This is the same policy as preempt=none, with an eager handling of
>>> higher priority scheduling classes.
>>> Comparing a cyclictest workload with a background kernel load of
>>> 'stress-ng --mmap', shows that both the average and the maximum
>>> latencies improve:
>>>    # stress-ng --mmap 0 &
>>>    # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>>                                        Min     (  %stdev )    Act     (  %stdev
>>> )   Avg     (  %stdev )   Max      (  %stdev )
>>>     PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>>     PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>>> The table above shows the aggregated latencies across all CPUs.
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Ziljstra <peterz@infradead.org>
>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    kernel/sched/core.c  | 12 ++++++++----
>>>    kernel/sched/sched.h |  6 ++++++
>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index c25cccc09b65..2bc3ae21a9d0 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>>    	if (preempt_model_preemptible())
>>>    		return RESCHED_NOW;
>>>    +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>>> +		return RESCHED_NOW;
>>> +
>>>    	if (is_idle_task(curr))
>>>    		return RESCHED_NOW;
>>>    @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>>> *p, int flags)
>>>    	if (p->sched_class == rq->curr->sched_class)
>>>    		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>>    	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>>> -		resched_curr(rq);
>>> +		resched_curr_priority(rq);
>>>
>> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
>> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.
> 
> So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
> (And, that case is already handled by resched_opt_translate() explicitly
> promoting idle tasks to TIF_NEED_RESCHED.)
> 
> But, on the general question of doing resched_curr_priority() in the
> same class: I did consider it. But, it seemed to me that we want to
> keep run to completion semantics for lazy scheduling, and so not
> enforcing priority in a scheduling class was a good line.
> 

OK, on general question, this is just a suggestion :-)

Actually, my key point is about SCHED_IDLE. It's not a real idle task, but a 
normal task with lowest priority. So is_idle_task() in resched_opt_translate() 
does not fit it. Should add task_has_idle_policy().

However, even using task_has_idle_policy() may be still not enough. Because 
SCHED_IDLE policy:
   1. It is the lowest priority, but still belongs to fair_sched_class, which is 
the same as SCHED_NORMAL.
   2. Not only tasks, *se of cgroup* can be SCHED_IDLE, too. (introduced by 
commit 304000390f88d)

So in the special case about SCHED_NORMAL vs SCHED_IDLE, I suggest still do some 
work in fair.c.

Thanks.

Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Ankur Arora 1 year, 7 months ago

Tianchen Ding <dtcccc@linux.alibaba.com> writes:

> On 2024/6/22 02:58, Ankur Arora wrote:
>> Tianchen Ding <dtcccc@linux.alibaba.com> writes:
>>
>>> On 2024/5/28 08:35, Ankur Arora wrote:
>>>> The default preemption policy for voluntary preemption under
>>>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>>>> class, and lazily for well-behaved, non-idle tasks.
>>>> This is the same policy as preempt=none, with an eager handling of
>>>> higher priority scheduling classes.
>>>> Comparing a cyclictest workload with a background kernel load of
>>>> 'stress-ng --mmap', shows that both the average and the maximum
>>>> latencies improve:
>>>>    # stress-ng --mmap 0 &
>>>>    # cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300
>>>>                                        Min     (  %stdev )    Act     (  %stdev
>>>> )   Avg     (  %stdev )   Max      (  %stdev )
>>>>     PREEMPT_AUTO, preempt=voluntary    1.73  ( +-  25.43% )   62.16 ( +-
>>>> 303.39% )  14.92 ( +-  17.96% )  2778.22 ( +-  15.04% )
>>>>     PREEMPT_DYNAMIC, preempt=voluntary 1.83  ( +-  20.76% )  253.45 ( +- 233.21% )  18.70 ( +-  15.88% )  2992.45 ( +-  15.95% )
>>>> The table above shows the aggregated latencies across all CPUs.
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Peter Ziljstra <peterz@infradead.org>
>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>>>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    kernel/sched/core.c  | 12 ++++++++----
>>>>    kernel/sched/sched.h |  6 ++++++
>>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c25cccc09b65..2bc3ae21a9d0 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
>>>>    	if (preempt_model_preemptible())
>>>>    		return RESCHED_NOW;
>>>>    +	if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
>>>> +		return RESCHED_NOW;
>>>> +
>>>>    	if (is_idle_task(curr))
>>>>    		return RESCHED_NOW;
>>>>    @@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct
>>>> *p, int flags)
>>>>    	if (p->sched_class == rq->curr->sched_class)
>>>>    		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
>>>>    	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
>>>> -		resched_curr(rq);
>>>> +		resched_curr_priority(rq);
>>>>
>>> Besides the conditions about higher class, can we do resched_curr_priority() in the same class?
>>> For example, in fair class, we can do it when SCHED_NORMAL vs SCHED_IDLE.
>> So, I agree about the specific case of SCHED_NORMAL vs SCHED_IDLE.
>> (And, that case is already handled by resched_opt_translate() explicitly
>> promoting idle tasks to TIF_NEED_RESCHED.)
>> But, on the general question of doing resched_curr_priority() in the
>> same class: I did consider it. But, it seemed to me that we want to
>> keep run to completion semantics for lazy scheduling, and so not
>> enforcing priority in a scheduling class was a good line.
>>
>
> OK, on general question, this is just a suggestion :-)
>
> Actually, my key point is about SCHED_IDLE. It's not a real idle task, but a
> normal task with lowest priority. So is_idle_task() in resched_opt_translate()
> does not fit it. Should add task_has_idle_policy().
>
> However, even using task_has_idle_policy() may be still not enough. Because
> SCHED_IDLE policy:
>   1. It is the lowest priority, but still belongs to fair_sched_class, which is
>   the same as SCHED_NORMAL.
>   2. Not only tasks, *se of cgroup* can be SCHED_IDLE, too. (introduced by
>   commit 304000390f88d)

Thanks. That is useful to know. Let me see how best to incorporate this.

Side question: are there any benchmarks that would exercise various types
of sched policy, idle and otherwise?

--
ankur

Re: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

Posted by Tianchen Ding 1 year, 7 months ago

On 2024/6/25 09:12, Ankur Arora wrote:
> 
> Side question: are there any benchmarks that would exercise various types
> of sched policy, idle and otherwise?
> 

AFAIK, no.
May need to combine workloads with different sched policies set by manual...
Or let's see if anyone else can offer suggestions.

Thanks.