[RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function

John Stultz posted 7 patches 1 year, 2 months ago
[RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
Posted by John Stultz 1 year, 2 months ago
Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
  clarity, pointed out by Metin Kaya
* Fix build warning Reported-by: kernel test robot <lkp@intel.com>
  Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
v8:
* Fixed case where we might return a blocked task from find_proxy_task()
* Continued tweaks to handle avoiding returning blocked tasks
v9:
* Add zap_balance_callbacks helper to unwind balance_callbacks
  when we will re-call pick_next_task() again.
* Add extra comment suggested by Metin
* Typo fixes from Metin
* Moved adding proxy_resched_idle earlier in the series, as suggested
  by Metin
* Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
  crashes caused by stale references to next
* s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
* Number of tweaks and cleanups suggested by Metin
* Simplify proxy_deactivate as suggested by Metin
v11:
* Tweaks for earlier simplification in try_to_deactivate_task
v13:
* Rename rename "next" to "donor" in find_proxy_task() for clarity
* Similarly use "donor" instead of next in proxy_deactivate
* Refactor/simplify proxy_resched_idle
* Moved up a needed fix from later in the series
---
 kernel/sched/core.c  | 129 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c    |  15 ++++-
 kernel/sched/sched.h |  10 +++-
 3 files changed, 148 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8714050b6d0d..b492506d33415 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 	}
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void balance_push(struct rq *rq);
 
 /*
@@ -6592,7 +6620,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  * Otherwise marks the task's __state as RUNNING
  */
 static bool try_to_block_task(struct rq *rq, struct task_struct *p,
-			      unsigned long task_state)
+			      unsigned long task_state, bool deactivate_cond)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -6601,6 +6629,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	if (!deactivate_cond)
+		return false;
+
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
 		!(task_state & TASK_NOLOAD) &&
@@ -6624,6 +6655,88 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	return true;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq)
+{
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	unsigned long state = READ_ONCE(donor->__state);
+
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (state == TASK_RUNNING)
+		return false;
+	/*
+	 * Because we got donor from pick_next_task, it is *crucial*
+	 * that we call proxy_resched_idle before we deactivate it.
+	 * As once we deactivate donor, donor->on_rq is set to zero,
+	 * which allows ttwu to immediately try to wake the task on
+	 * another rq. So we cannot use *any* references to donor
+	 * after that point. So things like cfs_rq->curr or rq->donor
+	 * need to be changed from next *before* we deactivate.
+	 */
+	proxy_resched_idle(rq);
+	return try_to_block_task(rq, donor, state, true);
+}
+
+/*
+ * Initial simple proxy that just returns the task if it's waking
+ * or deactivates the blocked task so we can pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	struct task_struct *p = donor;
+	struct mutex *mutex;
+
+	mutex = p->blocked_on;
+	/* Something changed in the chain, so pick again */
+	if (!mutex)
+		return NULL;
+	/*
+	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+	 * and ensure @owner sticks around.
+	 */
+	raw_spin_lock(&mutex->wait_lock);
+	raw_spin_lock(&p->blocked_lock);
+
+	/* Check again that p is blocked with blocked_lock held */
+	if (!task_is_blocked(p) || mutex != get_task_blocked_on(p)) {
+		/*
+		 * Something changed in the blocked_on chain and
+		 * we don't know if only at this level. So, let's
+		 * just bail out completely and let __schedule
+		 * figure things out (pick_again loop).
+		 */
+		goto out;
+	}
+	if (!proxy_deactivate(rq, donor))
+		/* XXX: This hack won't work when we get to migrations */
+		donor->blocked_on_state = BO_RUNNABLE;
+
+out:
+	raw_spin_unlock(&p->blocked_lock);
+	raw_spin_unlock(&mutex->wait_lock);
+	return NULL;
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
+	return donor;
+}
+#endif /* SCHED_PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6732,12 +6845,22 @@ static void __sched notrace __schedule(int sched_mode)
 			goto picked;
 		}
 	} else if (!preempt && prev_state) {
-		block = try_to_block_task(rq, prev, prev_state);
+		block = try_to_block_task(rq, prev, prev_state,
+					  !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	if (unlikely(task_is_blocked(next))) {
+		next = find_proxy_task(rq, next, &rf);
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
+			goto pick_again;
+		}
+	}
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd66a46b06aca..fa4d9bf76ad49 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Selected is tied to current,
+	 * so don't push it either.
+	 */
+	if (task_current(rq, p) || task_current_donor(rq, p))
+		return;
+	/*
+	 * Pinned tasks can't be pushed.
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
 
 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24eae02ddc7f6..f560d1d1a7a0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2272,6 +2272,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
 	return rq->donor == p;
 }
 
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+
+	return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2481,7 +2489,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq->donor != prev);
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
-- 
2.47.0.371.ga323438b13-goog
Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
Posted by Peter Zijlstra 1 year, 1 month ago
On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f8714050b6d0d..b492506d33415 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
>  	}
>  }
>  
> +/*
> + * Only called from __schedule context
> + *
> + * There are some cases where we are going to re-do the action
> + * that added the balance callbacks. We may not be in a state
> + * where we can run them, so just zap them so they can be
> + * properly re-added on the next time around. This is similar
> + * handling to running the callbacks, except we just don't call
> + * them.
> + */

Which specific callbacks are this? sched_core_balance()?

In general, shooting down all callbacks like this makes me feel somewhat
uncomfortable.

> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +
> +static inline struct task_struct *
> +proxy_resched_idle(struct rq *rq)
> +{
> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);
> +	set_tsk_need_resched(rq->idle);
> +	return rq->idle;
> +}
> +
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +{
> +	unsigned long state = READ_ONCE(donor->__state);
> +
> +	/* Don't deactivate if the state has been changed to TASK_RUNNING */
> +	if (state == TASK_RUNNING)
> +		return false;
> +	/*
> +	 * Because we got donor from pick_next_task, it is *crucial*
> +	 * that we call proxy_resched_idle before we deactivate it.
> +	 * As once we deactivate donor, donor->on_rq is set to zero,
> +	 * which allows ttwu to immediately try to wake the task on
> +	 * another rq. So we cannot use *any* references to donor
> +	 * after that point. So things like cfs_rq->curr or rq->donor
> +	 * need to be changed from next *before* we deactivate.
> +	 */
> +	proxy_resched_idle(rq);
> +	return try_to_block_task(rq, donor, state, true);
> +}
> +
> +/*
> + * Initial simple proxy that just returns the task if it's waking
> + * or deactivates the blocked task so we can pick something that
> + * isn't blocked.
> + */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	struct task_struct *p = donor;
> +	struct mutex *mutex;
> +
> +	mutex = p->blocked_on;
> +	/* Something changed in the chain, so pick again */
> +	if (!mutex)
> +		return NULL;
> +	/*
> +	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> +	 * and ensure @owner sticks around.
> +	 */
> +	raw_spin_lock(&mutex->wait_lock);
> +	raw_spin_lock(&p->blocked_lock);

I'm still wondering what this blocked_lock does, that previous patch had
it mirror wait_mutex too, so far I don't see the point.

> +
> +	/* Check again that p is blocked with blocked_lock held */
> +	if (!task_is_blocked(p) || mutex != get_task_blocked_on(p)) {
> +		/*
> +		 * Something changed in the blocked_on chain and
> +		 * we don't know if only at this level. So, let's
> +		 * just bail out completely and let __schedule
> +		 * figure things out (pick_again loop).
> +		 */
> +		goto out;
> +	}
> +	if (!proxy_deactivate(rq, donor))
> +		/* XXX: This hack won't work when we get to migrations */
> +		donor->blocked_on_state = BO_RUNNABLE;
> +
> +out:
> +	raw_spin_unlock(&p->blocked_lock);
> +	raw_spin_unlock(&mutex->wait_lock);
> +	return NULL;
> +}
> +#else /* SCHED_PROXY_EXEC */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
> +	return donor;
> +}
> +#endif /* SCHED_PROXY_EXEC */
> +
>  /*
>   * __schedule() is the main scheduler function.
>   *
> @@ -6732,12 +6845,22 @@ static void __sched notrace __schedule(int sched_mode)
>  			goto picked;
>  		}
>  	} else if (!preempt && prev_state) {
> -		block = try_to_block_task(rq, prev, prev_state);
> +		block = try_to_block_task(rq, prev, prev_state,
> +					  !task_is_blocked(prev));
>  		switch_count = &prev->nvcsw;
>  	}
>  
> -	next = pick_next_task(rq, prev, &rf);
> +pick_again:
> +	next = pick_next_task(rq, rq->donor, &rf);
>  	rq_set_donor(rq, next);
> +	if (unlikely(task_is_blocked(next))) {
> +		next = find_proxy_task(rq, next, &rf);
> +		if (!next) {
> +			/* zap the balance_callbacks before picking again */
> +			zap_balance_callbacks(rq);
> +			goto pick_again;
> +		}
> +	}
>  picked:
>  	clear_tsk_need_resched(prev);
>  	clear_preempt_need_resched();
Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
Posted by John Stultz 1 year, 1 month ago
On Fri, Dec 13, 2024 at 4:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f8714050b6d0d..b492506d33415 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
> >       }
> >  }
> >
> > +/*
> > + * Only called from __schedule context
> > + *
> > + * There are some cases where we are going to re-do the action
> > + * that added the balance callbacks. We may not be in a state
> > + * where we can run them, so just zap them so they can be
> > + * properly re-added on the next time around. This is similar
> > + * handling to running the callbacks, except we just don't call
> > + * them.
> > + */
>
> Which specific callbacks are this? sched_core_balance()?
>
> In general, shooting down all callbacks like this makes me feel somewhat
> uncomfortable.

So, if we originally picked a RT task, I believe it would setup the
push_rt_tasks callback, but if it got migrated and if we needed to
pick again,  we'd end up tripping on
`SCHED_WARN_ON(rq->balance_callback && rq->balance_callback !=
&balance_push_callback);`

For a while I tried to unpin and run the balance callbacks before
calling pick_again, if find_proxy_task() failed, but that was running
into troubles with tasks getting unintentionally added to the rt
pushable list (this was back in ~feb, so my memory is a little fuzzy).

So that's when I figured zaping the callbacks would be best, with the
idea being that we are starting selection over, so we effectively have
to undo any of the state that was set by pick_next_task() before
calling it again.

Let me know if you have concerns with this, or suggestions for other approaches.

> > +/*
> > + * Initial simple proxy that just returns the task if it's waking
> > + * or deactivates the blocked task so we can pick something that
> > + * isn't blocked.
> > + */
> > +static struct task_struct *
> > +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > +{
> > +     struct task_struct *p = donor;
> > +     struct mutex *mutex;
> > +
> > +     mutex = p->blocked_on;
> > +     /* Something changed in the chain, so pick again */
> > +     if (!mutex)
> > +             return NULL;
> > +     /*
> > +      * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > +      * and ensure @owner sticks around.
> > +      */
> > +     raw_spin_lock(&mutex->wait_lock);
> > +     raw_spin_lock(&p->blocked_lock);
>
> I'm still wondering what this blocked_lock does, that previous patch had
> it mirror wait_mutex too, so far I don't see the point.

Yeah, early on in the series it's maybe not as useful, but as we start
dealing with sleeping owner enqueuing, its doing more:
  https://github.com/johnstultz-work/linux-dev/commit/d594ca8df88645aa3b2b9daa105664893818bdb7

But it is possible it is more of a crutch for me to keep straight the
locking rules as it's simpler to keep in my head. :)
Happy to think a bit more on if it can be folded together with another lock.

Thanks again for the review and thoughts here!

thanks
-john
Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
Posted by Peter Zijlstra 1 year, 1 month ago
On Mon, Dec 16, 2024 at 09:42:31PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 4:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:
> >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f8714050b6d0d..b492506d33415 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
> > >       }
> > >  }
> > >
> > > +/*
> > > + * Only called from __schedule context
> > > + *
> > > + * There are some cases where we are going to re-do the action
> > > + * that added the balance callbacks. We may not be in a state
> > > + * where we can run them, so just zap them so they can be
> > > + * properly re-added on the next time around. This is similar
> > > + * handling to running the callbacks, except we just don't call
> > > + * them.
> > > + */
> >
> > Which specific callbacks are this? sched_core_balance()?
> >
> > In general, shooting down all callbacks like this makes me feel somewhat
> > uncomfortable.
> 
> So, if we originally picked a RT task, I believe it would setup the
> push_rt_tasks callback, but if it got migrated and if we needed to
> pick again,  we'd end up tripping on
> `SCHED_WARN_ON(rq->balance_callback && rq->balance_callback !=
> &balance_push_callback);`
> 
> For a while I tried to unpin and run the balance callbacks before
> calling pick_again, if find_proxy_task() failed, but that was running
> into troubles with tasks getting unintentionally added to the rt
> pushable list (this was back in ~feb, so my memory is a little fuzzy).
> 
> So that's when I figured zaping the callbacks would be best, with the
> idea being that we are starting selection over, so we effectively have
> to undo any of the state that was set by pick_next_task() before
> calling it again.
> 
> Let me know if you have concerns with this, or suggestions for other approaches.

For now, lets stick a coherent comment on, explaining exactly which
callbacks and why.

> > > +/*
> > > + * Initial simple proxy that just returns the task if it's waking
> > > + * or deactivates the blocked task so we can pick something that
> > > + * isn't blocked.
> > > + */
> > > +static struct task_struct *
> > > +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > > +{
> > > +     struct task_struct *p = donor;
> > > +     struct mutex *mutex;
> > > +
> > > +     mutex = p->blocked_on;
> > > +     /* Something changed in the chain, so pick again */
> > > +     if (!mutex)
> > > +             return NULL;
> > > +     /*
> > > +      * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > > +      * and ensure @owner sticks around.
> > > +      */
> > > +     raw_spin_lock(&mutex->wait_lock);
> > > +     raw_spin_lock(&p->blocked_lock);
> >
> > I'm still wondering what this blocked_lock does, that previous patch had
> > it mirror wait_mutex too, so far I don't see the point.
> 
> Yeah, early on in the series it's maybe not as useful, but as we start
> dealing with sleeping owner enqueuing, its doing more:
>   https://github.com/johnstultz-work/linux-dev/commit/d594ca8df88645aa3b2b9daa105664893818bdb7
> 
> But it is possible it is more of a crutch for me to keep straight the
> locking rules as it's simpler to keep in my head. :)
> Happy to think a bit more on if it can be folded together with another lock.

I'm a big believer in only introducing state when we actually need it --
and I don't believe we actually need blocked_lock until we go SMP.

Anyway, I have since figured out the why of blocked_lock again; but
yeah, comments, because I'm sure to forget it again at some point.