[v4] Reintroduce NEXT_BUDDY for EEVDF

[PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Mel Gorman 3 months ago

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v4
Hmean     1       1268.80 (   0.00%)     1268.92 (   0.01%)
Hmean     4       3971.74 (   0.00%)     3952.44 (  -0.49%)
Hmean     7       5548.23 (   0.00%)     5375.49 *  -3.11%*
Hmean     12      7310.86 (   0.00%)     7080.80 (  -3.15%)
Hmean     21      8874.53 (   0.00%)     9044.30 (   1.91%)
Hmean     30      9361.93 (   0.00%)    10430.80 *  11.42%*
Hmean     48      9540.14 (   0.00%)    11786.28 *  23.54%*
Hmean     79      9208.74 (   0.00%)    11910.07 *  29.33%*
Hmean     110     8573.12 (   0.00%)    11571.03 *  34.97%*
Hmean     141     7791.33 (   0.00%)    11151.69 *  43.13%*
Hmean     160     7666.60 (   0.00%)    10661.53 *  39.06%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         14.62 (   0.00%)       14.69 (  -0.50%)
Amean      4         18.76 (   0.00%)       18.88 (  -0.63%)
Amean      7         23.71 (   0.00%)       24.52 (  -3.43%)
Amean      12        31.25 (   0.00%)       32.20 (  -3.03%)
Amean      21        45.12 (   0.00%)       44.19 (   2.05%)
Amean      30        61.07 (   0.00%)       54.74 (  10.35%)
Amean      48        95.91 (   0.00%)       77.52 (  19.18%)
Amean      79       163.38 (   0.00%)      126.04 (  22.85%)
Amean      110      243.91 (   0.00%)      179.93 (  26.23%)
Amean      141      343.47 (   0.00%)      240.03 (  30.12%)
Amean      160      401.15 (   0.00%)      284.92 (  28.98%)

That is still looking good and the variance is reduced quite a bit with
the caveat that different CPU families may yield different results.

Fairness is a concern so the next report tracks how many milliseconds
does it take for all clients to complete a workfile. This one is tricky
because dbench makes no effort to synchronise clients so the durations at
benchmark start time differ substantially from typical runtimes. Depending
on the configuration, the benchmark can be dominated by the timing of
fsync of different clients. This problem could be mitigated by warming up
the benchmark for a number of minutes but it's a matter of opinion whether
that counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         15.06 (   0.00%)       15.10 (  -0.27%)
Amean      4        603.81 (   0.00%)     1451.94 (-140.46%)
Amean      7        855.32 (   0.00%)     1116.68 ( -30.56%)
Amean      12      1890.02 (   0.00%)     2075.98 (  -9.84%)
Amean      21      3195.23 (   0.00%)     2508.86 (  21.48%)
Amean      30     13919.53 (   0.00%)     3324.58 (  76.12%)
Amean      48     25246.07 (   0.00%)     4146.10 (  83.58%)
Amean      79     29701.84 (   0.00%)    21338.68 (  28.16%)
Amean      110    22803.03 (   0.00%)    29502.47 ( -29.38%)
Amean      141    36356.07 (   0.00%)    25292.07 (  30.43%)
Amean      160    17046.71 (   0.00%)    27043.67 ( -58.64%)
Stddev     1          0.47 (   0.00%)        0.46 (   2.84%)
Stddev     4        395.24 (   0.00%)      782.96 ( -98.10%)
Stddev     7        467.24 (   0.00%)      738.26 ( -58.00%)
Stddev     12      1071.43 (   0.00%)     1124.78 (  -4.98%)
Stddev     21      1694.50 (   0.00%)     1463.27 (  13.65%)
Stddev     30      7945.63 (   0.00%)     1889.19 (  76.22%)
Stddev     48     14339.51 (   0.00%)     2226.59 (  84.47%)
Stddev     79     16620.91 (   0.00%)    12027.14 (  27.64%)
Stddev     110    12912.15 (   0.00%)    16601.77 ( -28.57%)
Stddev     141    20700.13 (   0.00%)    13624.50 (  34.18%)
Stddev     160     9079.16 (   0.00%)    15729.63 ( -73.25%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.2657 (   0.00%)      0.2120 (  20.20%)
Amean     4        0.6107 (   0.00%)      0.6120 (  -0.22%)
Amean     7        0.7923 (   0.00%)      0.7307 (   7.78%)
Amean     12       1.1500 (   0.00%)      1.1480 (   0.17%)
Amean     21       1.7950 (   0.00%)      1.8207 (  -1.43%)
Amean     30       2.3207 (   0.00%)      2.5340 *  -9.19%*
Amean     48       3.5023 (   0.00%)      4.0047 * -14.34%*
Amean     79       4.8093 (   0.00%)      5.3137 * -10.49%*
Amean     110      6.1160 (   0.00%)      6.5287 *  -6.75%*
Amean     141      7.4763 (   0.00%)      7.9553 *  -6.41%*
Amean     172      8.9560 (   0.00%)      9.3977 *  -4.93%*
Amean     203     10.4783 (   0.00%)     11.0160 *  -5.13%*
Amean     234     12.4977 (   0.00%)     13.5510 (  -8.43%)
Amean     265     14.7003 (   0.00%)     15.6950 *  -6.77%*
Amean     296     16.1007 (   0.00%)     17.0860 *  -6.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.3073 (   0.00%)      0.3187 (  -3.69%)
Amean     4        0.7863 (   0.00%)      0.7850 (   0.17%)
Amean     7        1.3670 (   0.00%)      1.3747 (  -0.56%)
Amean     12       2.1337 (   0.00%)      2.1450 (  -0.53%)
Amean     21       3.4683 (   0.00%)      3.4293 (   1.12%)
Amean     30       4.7247 (   0.00%)      4.8000 (  -1.59%)
Amean     48       7.6097 (   0.00%)      7.6943 (  -1.11%)
Amean     79      14.7957 (   0.00%)     15.3353 (  -3.65%)
Amean     110     21.3413 (   0.00%)     21.5753 (  -1.10%)
Amean     141     29.0503 (   0.00%)     28.8740 (   0.61%)
Amean     172     36.4660 (   0.00%)     35.7207 (   2.04%)
Amean     203     39.7177 (   0.00%)     39.7343 (  -0.04%)
Amean     234     42.1120 (   0.00%)     43.0177 (  -2.15%)
Amean     265     45.7830 (   0.00%)     46.9863 (  -2.63%)
Amean     296     50.7043 (   0.00%)     50.7017 (   0.01%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v4
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.61 *  57.88%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       87.28 *  27.03%*
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      367.00 *  88.51%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3951.39 (  -1.99%)


schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla sched-preemptnext-v4r1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9167.86 *   3.00%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9213.06 *   2.51%*
Hmean     RPS-90th-80     9123.73 (   0.00%)     9263.80 (   1.54%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9282.18 (   0.96%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 145 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 123 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..688bf010f444 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5524,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7030,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
+					 * protection to allow
+					 * preemption.
+					 */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return PREEMPT_WAKEUP_NONE;
+
+	/* Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se))
+		return PREEMPT_WAKEUP_RESCHED;
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NEXT;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if (wake_flags & WF_SYNC) {
+		u64 delta = rq_clock_task(rq) - se->exec_start;
+		u64 threshold = sysctl_sched_migration_cost;
+
+		/*
+		 * WF_SYNC without WF_TTWU is not expected so warn if it
+		 * happens even though it is likely harmless.
+		 */
+		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Reduce the
+		 * threshold before preemption is allowed to an
+		 * arbitrary lower value as it is more likely (but not
+		 * guaranteed) the waker requires the wakee to finish.
+		 */
+		if (wake_flags & WF_RQ_SELECTED)
+			threshold >>= 2;
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= threshold)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);
-- 
2.51.0

Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Madadi Vineeth Reddy 3 months ago

Hi Mel,

On 03/11/25 16:34, Mel Gorman wrote:
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In all other cases,
> it's throttled based on the accumulated runtime of the waker so there is
> a chance that some batched wakeups have been issued before preemption.

[..snip..]

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;

IIUC, the logic attempts to maintain deadline ordering among buddies, but
this doesn't address tasks already on the runqueue.

So with frequent wakeups, queued tasks (even with earlier deadlines) may be
unfairly delayed. I understand that this would fade away quickly as the
woken up task that got to run due to buddy preference would accumulate negative
lag and would not be eligible to run again, but the starvation could be higher if
wakeups are very high.

To test this, I ran schbench (many message and worker threads) together with
stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
around 64%.

This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
Thoughts?


I also ran schbench and hackbench. All these were run on a Power11 System
containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.

schbench(new) 99.0th latency (lower is better)
========
load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)

mt=message threads ; wt=worker threads

schbench being a wakeup sensitive workload showed good improvement.


hackbench (lower is better)
========
case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)

process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)

Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
range.

Thanks,
Madadi Vineeth Reddy


> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}
> +
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);

Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Mel Gorman 3 months ago

On Thu, Nov 06, 2025 at 03:18:30AM +0530, Madadi Vineeth Reddy wrote:
> Hi Mel,
> 
> On 03/11/25 16:34, Mel Gorman wrote:
> > Reimplement NEXT_BUDDY preemption to take into account the deadline and
> > eligibility of the wakee with respect to the waker. In the event
> > multiple buddies could be considered, the one with the earliest deadline
> > is selected.
> > 
> > Sync wakeups are treated differently to every other type of wakeup. The
> > WF_SYNC assumption is that the waker promises to sleep in the very near
> > future. This is violated in enough cases that WF_SYNC should be treated
> > as a suggestion instead of a contract. If a waker does go to sleep almost
> > immediately then the delay in wakeup is negligible. In all other cases,
> > it's throttled based on the accumulated runtime of the waker so there is
> > a chance that some batched wakeups have been issued before preemption.
> 
> [..snip..]
> 
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > +		return PREEMPT_WAKEUP_NONE;
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (in_task() && !entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> > +
> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The older buddy may be cache cold and completely unrelated
> > +	 * to the current wakeup but that is unpredictable where as
> > +	 * obeying the deadline is more in line with EEVDF objectives.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NEXT;
> 
> IIUC, the logic attempts to maintain deadline ordering among buddies, but
> this doesn't address tasks already on the runqueue.
> 

It's addressed in that a buddy is only selected if it is eligible to
run. Buddies in this context are receiving preferential treatment in
terms of ordering but the other tasks on the queue should not be
starved either.

> So with frequent wakeups, queued tasks (even with earlier deadlines) may be
> unfairly delayed. I understand that this would fade away quickly as the
> woken up task that got to run due to buddy preference would accumulate negative
> lag and would not be eligible to run again, but the starvation could be higher if
> wakeups are very high.
> 

They shouldn't get starved as such, only delayed as the buddies become
eligible while other tasks on the runqueue have positive lag.

> To test this, I ran schbench (many message and worker threads) together with
> stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
> around 64%.
> 

Stress-NG bogo-ops are by definition, bogus ops. The amount of work
executed depends on the stressor and the timing of when they execute.
Hence, a drop of 64% may or may not matter to the general case because the
drop may be due to a different mix of "operations", some of which may task
1ms and others that take a minute but are both "1 operation".

> This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
> Thoughts?

I 100% accept that NEXT_BUDDY can affect timings of workloads but stress-ng
is not the best workload to use for performance testing at all because
the bogoops metric is by definition, bogus. There may be a good reason
to revisit if PICK_BUDDY should have been moved to __pick_eevdf or if
PICK_BUDDY should be applied if the slice is not protected but stressng
is a terrible workload to justify a decision either way.

> I also ran schbench and hackbench. All these were run on a Power11 System
> containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.
> 
> schbench(new) 99.0th latency (lower is better)
> ========
> load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
> 20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
> 20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
> 20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
> 20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
> 20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)
> 
> mt=message threads ; wt=worker threads
> 
> schbench being a wakeup sensitive workload showed good improvement.
> 

Good news because NEXT_BUDDY is primarily about prioritising an eligible
wakee over another eligible task to preserve hotness.

> 
> hackbench (lower is better)
> ========
> case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
> process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
> process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
> process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
> process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
> process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
> process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
> process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)
> 
> process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
> process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
> process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
> process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
> process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
> process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
> process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)
> 
> Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
> range.
> 

Expected for hackbench because the degree of overload is so generally
high and cache hotness has limited benefit for it as so little data is
shared.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Peter Zijlstra 3 months ago

On Mon, Nov 03, 2025 at 11:04:45AM +0000, Mel Gorman wrote:

> @@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
> +					 * protection to allow
> +					 * preemption.
> +					 */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};

Not really a fan of that comment style. While noodling, I've managed to
'accidentally' rename NEXT to PICK, since its more about letting
__pick_eevdf() decide.

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

This seems to do 3 things:

 - that reschedule waker on !eligible
 - set the buddy (while losing NEXT_BUDDY)
 - the WF_SYNC thing

Let's split that out.

> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;

I'm thinking NONE is the wrong default

>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}

This was the only NEXT_BUDDY site and none were returned in trade.

>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
(this is redundant)
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);

Right, much better. But since I was noddling, how about something like
so on top?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8727,23 +8727,16 @@ static void set_next_buddy(struct sched_
 }
 
 enum preempt_wakeup_action {
-	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
-	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
-					 * protection to allow
-					 * preemption.
-					 */
-	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
-					 * before rescheduling.
-					 */
-	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
 };
 
-static inline enum preempt_wakeup_action
-__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
-		 struct sched_entity *pse, struct sched_entity *se)
+static inline void
+set_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
 {
-	bool pse_before;
-
 	/*
 	 * Ignore wakee preemption on WF_FORK as it is less likely that
 	 * there is shared data as exec often follow fork. Do not
@@ -8751,11 +8744,7 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * EEVDF to forcibly queue an ineligible task.
 	 */
 	if ((wake_flags & WF_FORK) || pse->sched_delayed)
-		return PREEMPT_WAKEUP_NONE;
-
-	/* Reschedule if waker is no longer eligible. */
-	if (in_task() && !entity_eligible(cfs_rq, se))
-		return PREEMPT_WAKEUP_RESCHED;
+		return;
 
 	/*
 	 * Keep existing buddy if the deadline is sooner than pse.
@@ -8764,63 +8753,62 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * obeying the deadline is more in line with EEVDF objectives.
 	 */
 	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
-		return PREEMPT_WAKEUP_NEXT;
+		return;
 
 	set_next_buddy(pse);
+}
 
-	/*
-	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
-	 * strictly enforced because the hint is either misunderstood or
-	 * multiple tasks must be woken up.
-	 */
-	pse_before = entity_before(pse, se);
-	if (wake_flags & WF_SYNC) {
-		u64 delta = rq_clock_task(rq) - se->exec_start;
-		u64 threshold = sysctl_sched_migration_cost;
-
-		/*
-		 * WF_SYNC without WF_TTWU is not expected so warn if it
-		 * happens even though it is likely harmless.
-		 */
-		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 delta = rq_clock_task(rq) - se->exec_start;
+	u64 threshold = sysctl_sched_migration_cost;
 
-		if ((s64)delta < 0)
-			delta = 0;
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it
+	 * happens even though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
 
-		/*
-		 * WF_RQ_SELECTED implies the tasks are stacking on a
-		 * CPU when they could run on other CPUs. Reduce the
-		 * threshold before preemption is allowed to an
-		 * arbitrary lower value as it is more likely (but not
-		 * guaranteed) the waker requires the wakee to finish.
-		 */
-		if (wake_flags & WF_RQ_SELECTED)
-			threshold >>= 2;
+	if ((s64)delta < 0)
+		delta = 0;
 
-		/*
-		 * As WF_SYNC is not strictly obeyed, allow some runtime for
-		 * batch wakeups to be issued.
-		 */
-		if (pse_before && delta >= threshold)
-			return PREEMPT_WAKEUP_RESCHED;
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a
+	 * CPU when they could run on other CPUs. Reduce the
+	 * threshold before preemption is allowed to an
+	 * arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
 
-		return PREEMPT_WAKEUP_NONE;
-	}
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for
+	 * batch wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
 
-	return PREEMPT_WAKEUP_NEXT;
+	return PREEMPT_WAKEUP_NONE;
 }
 
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8886,26 +8874,40 @@ static void check_preempt_wakeup_fair(st
 	 */
 	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
 		preempt_action = PREEMPT_WAKEUP_SHORT;
-	} else {
-		/*
-		 * If @p potentially is completing work required by current then
-		 * consider preemption.
-		 */
-		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
-						      pse, se);
+		goto pick;
+
 	}
 
+	/*
+	 * If @p potentially is completing work required by current
+	 * then consider preemption.
+	 */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		/* Reschedule if waker is no longer eligible. */
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+
+	} 
+
+	if (sched_feat(NEXT_BUDDY))
+		set_preempt_buddy(rq, cfs_rq, wake_flags, pse, se);
+
+	if (wake_flags & WF_SYNC)
+		preempt_action = preempt_sync(rq, wake_flags, pse, se);
+
 	switch (preempt_action) {
 	case PREEMPT_WAKEUP_NONE:
 		return;
+
 	case PREEMPT_WAKEUP_RESCHED:
 		goto preempt;
+
 	case PREEMPT_WAKEUP_SHORT:
-		fallthrough;
-	case PREEMPT_WAKEUP_NEXT:
-		break;
+	case PREEMPT_WAKEUP_PICK:
+		goto pick;
 	}
 
+pick:
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */

Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Peter Zijlstra 3 months ago

On Mon, Nov 03, 2025 at 03:07:11PM +0100, Peter Zijlstra wrote:

> > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	struct sched_entity *se = &donor->se, *pse = &p->se;
> >  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> >  	int cse_is_idle, pse_is_idle;
> > -	bool do_preempt_short = false;
> > +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
> 
> I'm thinking NONE is the wrong default

That was unfinished, kid interrupted me at an inopportune moment and I
lost my train ;-)

Anyway, the current code defaults to what will be 'pick'. And I suppose
we could make the default depend on WAKEUP_PREEMPT but meh.

[PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
[PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals