[v2] sched: Implement shared runqueue in CFS

[PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

Posted by David Vernet 2 years, 7 months ago

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called shared_runq, which places tasks into a FIFO queue on the
wakeup path, and then dequeues them in newidle_balance(). We don't want
to enable the feature by default, so this patch defines and declares a
new scheduler feature called SHARED_RUNQ which is disabled by default.
In addition, we add some calls to empty / skeleton functions in the
relevant fair codepaths where shared_runq will be utilized.

A set of future patches will implement these functions, and enable
shared_runq for both single and multi socket / CCX architectures.

Note as well that in future patches, the location of these calls may
change. For example, if we end up enqueueing tasks in a shared runqueue
any time they become runnable, we'd move the calls from
enqueue_task_fair() and pick_next_task_fair() to __enqueue_entity() and
__dequeue_entity() respectively.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c     | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e882b7bf5b4..f7967be7646c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,18 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 #ifdef CONFIG_SMP
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+				     int enq_flags)
+{}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+	return 0;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
+
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
  */
@@ -162,6 +174,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
+				     int enq_flags)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -6386,6 +6405,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		update_overutilized_status(rq);
 
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_enqueue_task(rq, p, flags);
+
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
@@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_dequeue_task(p);
 }
 
 #ifdef CONFIG_SMP
@@ -8173,6 +8198,9 @@ done: __maybe_unused;
 
 	update_misfit_status(p, rq);
 
+	if (sched_feat(SHARED_RUNQ))
+		shared_runq_dequeue_task(p);
+
 	return p;
 
 idle:
@@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
+		return -1;
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -12343,6 +12374,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+	shared_runq_dequeue_task(p);
 	detach_task_cfs_rq(p);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..cd5db1a24181 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,4 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+SCHED_FEAT(SHARED_RUNQ, false)
-- 
2.40.1

Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

Posted by Peter Zijlstra 2 years, 7 months ago

On Mon, Jul 10, 2023 at 03:03:39PM -0500, David Vernet wrote:

> @@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> +	if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
> +		return -1;
> +

Next patch implements shared_runq_pick_next_task() with the same return
values as newidle_balance(), which would seem to suggest the following
instead:

	if (sched_feat(SHARED_RUNQ)) {
		pulled_task = shared_runq_pick_next_task(this_rq, rf);
		if (pulled_task)
			return pulled_task;
	}

Additionally, I think we wants something like:

	sd = rcu_dereference_check_sched_domain(this_rq->sd);
	if (sched_feat(SHARED_RUNQ)) {
		... /* see above */

		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
		sd = sd->parent;
	}

to ensure we skip <=LLC domains, since those have already been covered
by this shiny new thing, no?

Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

Posted by David Vernet 2 years, 7 months ago

On Tue, Jul 11, 2023 at 11:45:47AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:39PM -0500, David Vernet wrote:
> 
> > @@ -11843,6 +11871,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >  	if (!cpu_active(this_cpu))
> >  		return 0;
> >  
> > +	if (sched_feat(SHARED_RUNQ) && shared_runq_pick_next_task(this_rq, rf))
> > +		return -1;
> > +
> 
> Next patch implements shared_runq_pick_next_task() with the same return
> values as newidle_balance(), which would seem to suggest the following
> instead:
> 
> 	if (sched_feat(SHARED_RUNQ)) {
> 		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> 		if (pulled_task)
> 			return pulled_task;
> 	}

Yep, that's cleaner.

> Additionally, I think we wants something like:
> 
> 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> 	if (sched_feat(SHARED_RUNQ)) {
> 		... /* see above */
> 
> 		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> 		sd = sd->parent;
> 	}
> 
> to ensure we skip <=LLC domains, since those have already been covered
> by this shiny new thing, no?

Indeed, another nice optimization. I'll incorporate both of these into
the next version.

Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

Posted by Abel Wu 2 years, 7 months ago

On 7/11/23 4:03 AM, David Vernet wrote:
> @@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   dequeue_throttle:
>   	util_est_update(&rq->cfs, p, task_sleep);
>   	hrtick_update(rq);
> +
> +	if (sched_feat(SHARED_RUNQ))
> +		shared_runq_dequeue_task(p);

Would disabling SHARED_RUNQ causing task list nodes left
in the shared stateful runqueue?

Re: [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls

Posted by David Vernet 2 years, 7 months ago

On Wed, Jul 12, 2023 at 04:39:03PM +0800, Abel Wu wrote:
> On 7/11/23 4:03 AM, David Vernet wrote:
> > @@ -6467,6 +6489,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   dequeue_throttle:
> >   	util_est_update(&rq->cfs, p, task_sleep);
> >   	hrtick_update(rq);
> > +
> > +	if (sched_feat(SHARED_RUNQ))
> > +		shared_runq_dequeue_task(p);
> 
> Would disabling SHARED_RUNQ causing task list nodes left
> in the shared stateful runqueue?

Hi Abel,

Yes, good call, there will be some stale tasks. The obvious way to get
around that would of course be to always call
shared_runq_dequeue_task(p) on the __dequeue_entity() path, but it would
be silly to tax a hot path in the scheduler in support of a feature
that's disabled by default.

At first I was thinking that the only issue would be some overhead in
clearing stale tasks once it was re-enabled, but that we'd be OK because
of this check in shared_runq_pick_next_task():

  298         if (task_on_rq_queued(p) && !task_on_cpu(rq, p)) {
  299                 update_rq_clock(src_rq);
  300                 src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
  301         }

So we wouldn't migrate tasks that weren't actually suitable. But that's
obviously wrong. It's not safe to keep stale tasks in that list for (at
least) two reasons.

- A task could exit (which would be easy to fix by just adding a dequeue
  call in task_dead_fair())
- We could have a double-add if a task is re-enqueued in the list after
  having been previously enqueued, but then never dequeued due to the
  timing of disabling SHARED_RUNQ.

Not sure what the best solution is here. We could always address this by
draining the list when the feature is disabled, but there's not yet a
mechanism to hook in a callback to be invoked when a scheduler feature
is enabled/disabled. It shouldn't be too hard to add that, assuming
other sched folks are amenable to it. It should just be a matter of
adding another __SCHED_FEAT_NR-sized table of NULL-able callbacks that
are invoked on enable / disable state change, and which can be specified
in a new SCHED_FEAT_CALLBACK or something macro.

Peter -- what do you think?

Thanks,
David