[PATCH 1/2] sched/fair: Record the short sleeping time of a task

Chen Yu posted 2 patches 2 years, 4 months ago
There is a newer version of this series
[PATCH 1/2] sched/fair: Record the short sleeping time of a task
Posted by Chen Yu 2 years, 4 months ago
During task wakeup, the wakee firstly checks if its previous
running CPU is idle. If yes, choose that CPU as its first
choice. However, in most cases, the wakee's previous CPU
could be chosen by someone else, which breaks the cache
locality.

Proposes a mechanism to reserve the task's previous
CPU for a short while. In this reservation period, other
tasks are not allowed to pick that CPU until a timeout.
The reservation period is defined as the average short
sleep time of the task. To be more specific, it is the
time delta between this task being dequeued and enqueued.
Only the sleep time shorter than sysctl_sched_migration_cost
will be recorded. If the sleep time is longer than
sysctl_sched_migration_cost, give the reservation period
a penalty by shrinking it to half. In this way, the 'burst'
sleeping time of the task is honored, meanwhile, if that
task becomes a long-sleeper, the reservation time of that
task is shrunk to reduce the impact on task wakeup.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc37ae787e33..4a0ac0276384 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -561,6 +561,9 @@ struct sched_entity {
 	u64				vruntime;
 	s64				vlag;
 	u64				slice;
+	u64				prev_dequeue_time;
+	/* the reservation period of this task during wakeup */
+	u64				sis_rsv_avg;
 
 	u64				nr_migrations;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0877878bcdb..297b9470829c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	u64 last_dequeue = p->se.prev_dequeue_time;
+	u64 now = sched_clock_cpu(task_cpu(p));
+
+	/*
+	 * If the task is a short-sleepting task, there is no need
+	 * to migrate it to other CPUs. Estimate the average short sleeping
+	 * time of the wakee. This sleep time is used as a hint to reserve
+	 * the dequeued task's previous CPU for a short while. During this
+	 * reservation period, select_idle_cpu() prevents other wakees from
+	 * choosing this CPU. This could bring a better cache locality.
+	 */
+	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
+	    now > last_dequeue) {
+		if (now - last_dequeue < sysctl_sched_migration_cost)
+			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
+		else
+			p->se.sis_rsv_avg >>= 1;
+	}
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_sleep = flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
+	u64 now;
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+	now = sched_clock_cpu(cpu_of(rq));
+	p->se.prev_dequeue_time = task_sleep ? now : 0;
 }
 
 #ifdef CONFIG_SMP
-- 
2.25.1
Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
Posted by Aaron Lu 2 years, 4 months ago
On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> During task wakeup, the wakee firstly checks if its previous
> running CPU is idle. If yes, choose that CPU as its first
> choice. However, in most cases, the wakee's previous CPU
> could be chosen by someone else, which breaks the cache
> locality.
> 
> Proposes a mechanism to reserve the task's previous
> CPU for a short while. In this reservation period, other
> tasks are not allowed to pick that CPU until a timeout.
> The reservation period is defined as the average short
> sleep time of the task. To be more specific, it is the
> time delta between this task being dequeued and enqueued.
> Only the sleep time shorter than sysctl_sched_migration_cost
> will be recorded. If the sleep time is longer than
> sysctl_sched_migration_cost, give the reservation period
> a penalty by shrinking it to half. In this way, the 'burst'
> sleeping time of the task is honored, meanwhile, if that
> task becomes a long-sleeper, the reservation time of that
> task is shrunk to reduce the impact on task wakeup.
> 
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched.h |  3 +++
>  kernel/sched/fair.c   | 21 +++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc37ae787e33..4a0ac0276384 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -561,6 +561,9 @@ struct sched_entity {
>  	u64				vruntime;
>  	s64				vlag;
>  	u64				slice;
> +	u64				prev_dequeue_time;
> +	/* the reservation period of this task during wakeup */
> +	u64				sis_rsv_avg;

Nit: these info are only relavant for task, not group so might be better
to move them to task_struct to save a little memory?

>  
>  	u64				nr_migrations;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0877878bcdb..297b9470829c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct sched_entity *se = &p->se;
>  	int idle_h_nr_running = task_has_idle_policy(p);
>  	int task_new = !(flags & ENQUEUE_WAKEUP);
> +	u64 last_dequeue = p->se.prev_dequeue_time;
> +	u64 now = sched_clock_cpu(task_cpu(p));

I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
seems to suggest task_cpu(p) can be different from cpu_of(rq).

> +
> +	/*
> +	 * If the task is a short-sleepting task, there is no need
> +	 * to migrate it to other CPUs. Estimate the average short sleeping
> +	 * time of the wakee. This sleep time is used as a hint to reserve
> +	 * the dequeued task's previous CPU for a short while. During this
> +	 * reservation period, select_idle_cpu() prevents other wakees from
> +	 * choosing this CPU. This could bring a better cache locality.
> +	 */
> +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&

Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
will be enqueued there, right?

Thanks,
Aaron

> +	    now > last_dequeue) {
> +		if (now - last_dequeue < sysctl_sched_migration_cost)
> +			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
> +		else
> +			p->se.sis_rsv_avg >>= 1;
> +	}
>  
>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
> @@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	int task_sleep = flags & DEQUEUE_SLEEP;
>  	int idle_h_nr_running = task_has_idle_policy(p);
>  	bool was_sched_idle = sched_idle_rq(rq);
> +	u64 now;
>  
>  	util_est_dequeue(&rq->cfs, p);
>  
> @@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  dequeue_throttle:
>  	util_est_update(&rq->cfs, p, task_sleep);
>  	hrtick_update(rq);
> +	now = sched_clock_cpu(cpu_of(rq));
> +	p->se.prev_dequeue_time = task_sleep ? now : 0;
>  }
>  
>  #ifdef CONFIG_SMP
> -- 
> 2.25.1
>
Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
Posted by Chen Yu 2 years, 4 months ago
Hi Aaron,

On 2023-09-27 at 15:53:33 +0800, Aaron Lu wrote:
> On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> > During task wakeup, the wakee firstly checks if its previous
> > running CPU is idle. If yes, choose that CPU as its first
> > choice. However, in most cases, the wakee's previous CPU
> > could be chosen by someone else, which breaks the cache
> > locality.
> > 
> > Proposes a mechanism to reserve the task's previous
> > CPU for a short while. In this reservation period, other
> > tasks are not allowed to pick that CPU until a timeout.
> > The reservation period is defined as the average short
> > sleep time of the task. To be more specific, it is the
> > time delta between this task being dequeued and enqueued.
> > Only the sleep time shorter than sysctl_sched_migration_cost
> > will be recorded. If the sleep time is longer than
> > sysctl_sched_migration_cost, give the reservation period
> > a penalty by shrinking it to half. In this way, the 'burst'
> > sleeping time of the task is honored, meanwhile, if that
> > task becomes a long-sleeper, the reservation time of that
> > task is shrunk to reduce the impact on task wakeup.
> > 
> > Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  include/linux/sched.h |  3 +++
> >  kernel/sched/fair.c   | 21 +++++++++++++++++++++
> >  2 files changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dc37ae787e33..4a0ac0276384 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -561,6 +561,9 @@ struct sched_entity {
> >  	u64				vruntime;
> >  	s64				vlag;
> >  	u64				slice;
> > +	u64				prev_dequeue_time;
> > +	/* the reservation period of this task during wakeup */
> > +	u64				sis_rsv_avg;
> 
> Nit: these info are only relavant for task, not group so might be better
> to move them to task_struct to save a little memory?
>

Yes, I'll try to do this.
 
> >  
> >  	u64				nr_migrations;
> >  
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d0877878bcdb..297b9470829c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	struct sched_entity *se = &p->se;
> >  	int idle_h_nr_running = task_has_idle_policy(p);
> >  	int task_new = !(flags & ENQUEUE_WAKEUP);
> > +	u64 last_dequeue = p->se.prev_dequeue_time;
> > +	u64 now = sched_clock_cpu(task_cpu(p));
> 
> I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
> seems to suggest task_cpu(p) can be different from cpu_of(rq).
>

You are right. My original thought is to use task's previous CPU rather than
the current rq. But at this stage the task's CPU has already been updated
to the same as rq. I'll think more about how to deal with this properly.
 
> > +
> > +	/*
> > +	 * If the task is a short-sleepting task, there is no need
> > +	 * to migrate it to other CPUs. Estimate the average short sleeping
> > +	 * time of the wakee. This sleep time is used as a hint to reserve
> > +	 * the dequeued task's previous CPU for a short while. During this
> > +	 * reservation period, select_idle_cpu() prevents other wakees from
> > +	 * choosing this CPU. This could bring a better cache locality.
> > +	 */
> > +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
> 
> Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
> will be enqueued there, right?
>

Right. If rq and the task's CPU are the same, there is no need to check cpu_online.

thanks,
Chenyu
Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
Posted by Mathieu Desnoyers 2 years, 4 months ago
On 9/26/23 06:11, Chen Yu wrote:
> During task wakeup, the wakee firstly checks if its previous
> running CPU is idle. If yes, choose that CPU as its first
> choice. However, in most cases, the wakee's previous CPU
> could be chosen by someone else, which breaks the cache
> locality.
> 
> Proposes a mechanism to reserve the task's previous
> CPU for a short while. In this reservation period, other
> tasks are not allowed to pick that CPU until a timeout.
> The reservation period is defined as the average short
> sleep time of the task. To be more specific, it is the
> time delta between this task being dequeued and enqueued.
> Only the sleep time shorter than sysctl_sched_migration_cost
> will be recorded. If the sleep time is longer than
> sysctl_sched_migration_cost, give the reservation period
> a penalty by shrinking it to half. In this way, the 'burst'
> sleeping time of the task is honored, meanwhile, if that
> task becomes a long-sleeper, the reservation time of that
> task is shrunk to reduce the impact on task wakeup.
> 
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   include/linux/sched.h |  3 +++
>   kernel/sched/fair.c   | 21 +++++++++++++++++++++
>   2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc37ae787e33..4a0ac0276384 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -561,6 +561,9 @@ struct sched_entity {
>   	u64				vruntime;
>   	s64				vlag;
>   	u64				slice;
> +	u64				prev_dequeue_time;
> +	/* the reservation period of this task during wakeup */
> +	u64				sis_rsv_avg;
>   
>   	u64				nr_migrations;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0877878bcdb..297b9470829c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	struct sched_entity *se = &p->se;
>   	int idle_h_nr_running = task_has_idle_policy(p);
>   	int task_new = !(flags & ENQUEUE_WAKEUP);
> +	u64 last_dequeue = p->se.prev_dequeue_time;
> +	u64 now = sched_clock_cpu(task_cpu(p));
> +
> +	/*
> +	 * If the task is a short-sleepting task, there is no need
> +	 * to migrate it to other CPUs. Estimate the average short sleeping
> +	 * time of the wakee. This sleep time is used as a hint to reserve
> +	 * the dequeued task's previous CPU for a short while. During this
> +	 * reservation period, select_idle_cpu() prevents other wakees from
> +	 * choosing this CPU. This could bring a better cache locality.

"This could bring a better cache locality." could be rephrased as
"This improves cache locality for short-sleeping tasks."

Please add my:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> +	 */
> +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
> +	    now > last_dequeue) {
> +		if (now - last_dequeue < sysctl_sched_migration_cost)
> +			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
> +		else
> +			p->se.sis_rsv_avg >>= 1;
> +	}
>   
>   	/*
>   	 * The code below (indirectly) updates schedutil which looks at
> @@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	int task_sleep = flags & DEQUEUE_SLEEP;
>   	int idle_h_nr_running = task_has_idle_policy(p);
>   	bool was_sched_idle = sched_idle_rq(rq);
> +	u64 now;
>   
>   	util_est_dequeue(&rq->cfs, p);
>   
> @@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   dequeue_throttle:
>   	util_est_update(&rq->cfs, p, task_sleep);
>   	hrtick_update(rq);
> +	now = sched_clock_cpu(cpu_of(rq));
> +	p->se.prev_dequeue_time = task_sleep ? now : 0;
>   }
>   
>   #ifdef CONFIG_SMP

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
Posted by Chen Yu 2 years, 4 months ago
Hi Mathieu,

On 2023-09-26 at 06:29:40 -0400, Mathieu Desnoyers wrote:
> On 9/26/23 06:11, Chen Yu wrote:
> > During task wakeup, the wakee firstly checks if its previous
> > running CPU is idle. If yes, choose that CPU as its first
> > choice. However, in most cases, the wakee's previous CPU
> > could be chosen by someone else, which breaks the cache
> > locality.
> > 
> > Proposes a mechanism to reserve the task's previous
> > CPU for a short while. In this reservation period, other
> > tasks are not allowed to pick that CPU until a timeout.
> > The reservation period is defined as the average short
> > sleep time of the task. To be more specific, it is the
> > time delta between this task being dequeued and enqueued.
> > Only the sleep time shorter than sysctl_sched_migration_cost
> > will be recorded. If the sleep time is longer than
> > sysctl_sched_migration_cost, give the reservation period
> > a penalty by shrinking it to half. In this way, the 'burst'
> > sleeping time of the task is honored, meanwhile, if that
> > task becomes a long-sleeper, the reservation time of that
> > task is shrunk to reduce the impact on task wakeup.
> > 
> > Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >   include/linux/sched.h |  3 +++
> >   kernel/sched/fair.c   | 21 +++++++++++++++++++++
> >   2 files changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dc37ae787e33..4a0ac0276384 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -561,6 +561,9 @@ struct sched_entity {
> >   	u64				vruntime;
> >   	s64				vlag;
> >   	u64				slice;
> > +	u64				prev_dequeue_time;
> > +	/* the reservation period of this task during wakeup */
> > +	u64				sis_rsv_avg;
> >   	u64				nr_migrations;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d0877878bcdb..297b9470829c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   	struct sched_entity *se = &p->se;
> >   	int idle_h_nr_running = task_has_idle_policy(p);
> >   	int task_new = !(flags & ENQUEUE_WAKEUP);
> > +	u64 last_dequeue = p->se.prev_dequeue_time;
> > +	u64 now = sched_clock_cpu(task_cpu(p));
> > +
> > +	/*
> > +	 * If the task is a short-sleepting task, there is no need
> > +	 * to migrate it to other CPUs. Estimate the average short sleeping
> > +	 * time of the wakee. This sleep time is used as a hint to reserve
> > +	 * the dequeued task's previous CPU for a short while. During this
> > +	 * reservation period, select_idle_cpu() prevents other wakees from
> > +	 * choosing this CPU. This could bring a better cache locality.
> 
> "This could bring a better cache locality." could be rephrased as
> "This improves cache locality for short-sleeping tasks."
>

OK, will do in the next version.
 
> Please add my:
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 

Thanks very much!

best,
Chenyu