[v1] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

[tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by tip-bot2 for Mel Gorman 2 months, 3 weeks ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:21 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00

sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v5
Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v5
Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)

Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v5
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla   sched-preemptnext-v5
Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 130 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 071e07f..c6e5c64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
+};
+
+static inline bool
+set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
+{
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return false;
+
+	set_next_buddy(pse);
+	return true;
+}
+
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 threshold, delta;
+
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
+	 * though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+	threshold = sysctl_sched_migration_cost;
+	delta = rq_clock_task(rq) - se->exec_start;
+	if ((s64)delta < 0)
+		delta = 0;
+
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
+	 * could run on other CPUs. Reduce the threshold before preemption is
+	 * allowed to an arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
+
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
+	 * wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
+
+	return PREEMPT_WAKEUP_NONE;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
 
 	if (unlikely(se == pse))
 		return;
@@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+		goto pick;
+	}
 
 	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return;
+
+	/*
+	 * If @p potentially is completing work required by current then
+	 * consider preemption.
+	 *
+	 * Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+	}
+
+	/* Prefer picking wakee soon if appropriate. */
+	if (sched_feat(NEXT_BUDDY) &&
+	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
+
+		/*
+		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
+		 * buddies are ignored as they may not be relevant to the
+		 * waker and less likely to be cache hot.
+		 */
+		if (wake_flags & WF_SYNC)
+			preempt_action = preempt_sync(rq, wake_flags, pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_PICK:
+		break;
+	}
+
+pick:
+	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);

[REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Posted by Ryan Roberts 1 month, 2 weeks ago

Hi Mel, Peter,

We are building out a kernel performance regression monitoring lab at Arm, and 
I've noticed some fairly large perofrmance regressions in real-world workloads, 
for which bisection has fingered this patch.

We are looking at performance changes between v6.18 and v6.19-rc1, and by 
reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
to move the testing to linux-next over the next couple of quarters so hopefully 
we will be able to deliver this sort of news prior to merging in future).

All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
statistically significant regression/improvement, where "statistically 
significant" means the 95% confidence intervals do not overlap".

The below is a large scale mysql workload, running across 2 AWS instances (a 
load generator and the mysql server). We have a partner for whom this is a very 
important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
(where the patch is added). By reverting the patch, the regression is not only 
fixed by performance is now nearly 6% better than v6.18:

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
|                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+


Next are a bunch of benchmarks all running on a single system. specjbb is the 
SPEC Java Business Benchmark. The mysql one is the same as above but this time 
both loadgen and server are on the same system. pgbench is the PostgreSQL 
benchmark.

I'm showing hackbench for completeness, but I don't consider it a high priority 
issue.

Interestingly, nginx improves significantly with the patch.

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
|                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
|                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
|                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
|                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
|                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
|                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
|                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
|                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
|                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
|                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
|                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
|                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
|                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
|                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
|                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
|                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
|                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
|                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
|                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
|                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
|                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
|                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
|                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
|                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
|                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
|                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
|                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
|                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
|                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
|                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
|                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
|                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
|                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
|                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
|                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
|                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
|                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
|                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
|                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
|                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
|                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
|                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
|                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
|                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
|                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
|                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
|                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
|                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
|                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
|                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
|                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
|                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
|                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
|                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
|                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
|                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
|                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
|                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
|                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
|                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
|                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
|                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
|                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
|                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
|                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
|                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
|                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
|                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
|                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
|                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
|                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
|                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
|                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
|                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
|                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
|                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
|                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
|                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+

All of the benchmarks have been run multiple times and I have high confidence in 
the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.

I'm not providing the data, but we also see similar regressions on AmpereOne 
(another arm64 server system). And we have seen a few functional tests (kvm 
selftests) that have started to timeout due to this patch slowing things down on 
arm64.

I'm hoping you can advise on the best way to proceed? We have a bigger library 
than what I'm showing, but the only improvement I see due to this patch is 
nginx. So based on that, my preference would be to revert the patch upstream 
until the issues can be worked out. I'm guessing the story is quite different 
for x86 though?

Thanks,
Ryan



On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
> Author:        Mel Gorman <mgorman@techsingularity.net>
> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
> 
> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
> 
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In other cases, it's
> throttled based on the accumulated runtime of the waker so there is a
> chance that some batched wakeups have been issued before preemption.
> 
> For all other wakeups, preemption happens if the wakee has a earlier
> deadline than the waker and eligible to run.
> 
> While many workloads were tested, the two main targets were a modified
> dbench4 benchmark and hackbench because the are on opposite ends of the
> spectrum -- one prefers throughput by avoiding preemption and the other
> relies on preemption.
> 
> First is the dbench throughput data even though it is a poor metric but
> it is the default metric. The test machine is a 2-socket machine and the
> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
> threads. It's important to note that these results are not representative
> across all machines, especially Zen machines, as different bottlenecks
> are exposed on different machines and filesystems.
> 
> dbench4 Throughput (misleading but traditional)
>                             6.18-rc1               6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
> 
> As throughput is misleading, the benchmark is modified to use a short
> loadfile report the completion time duration in milliseconds.
> 
> dbench4 Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
> 
> That is still looking good and the variance is reduced quite a bit.
> Finally, fairness is a concern so the next report tracks how many
> milliseconds does it take for all clients to complete a workfile. This
> one is tricky because dbench makes to effort to synchronise clients so
> the durations at benchmark start time differ substantially from typical
> runtimes. This problem could be mitigated by warming up the benchmark
> for a number of minutes but it's a matter of opinion whether that
> counts as an evasion of inconvenient results.
> 
> dbench4 All Clients Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
> 
> This is more of a mixed bag but it at least shows that fairness
> is not crippled.
> 
> The hackbench results are more neutral but this is still important.
> It's possible to boost the dbench figures by a large amount but only by
> crippling the performance of a workload like hackbench. The WF_SYNC
> behaviour is important for these workloads and is why the WF_SYNC
> changes are not a separate patch.
> 
> hackbench-process-pipes
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
> 
> Processes using pipes are impacted but the variance (not presented) indicates
> it's close to noise and the results are not always reproducible. If executed
> across multiple reboots, it may show neutral or small gains so the worst
> measured results are presented.
> 
> Hackbench using sockets is more reliably neutral as the wakeup
> mechanisms are different between sockets and pipes.
> 
> hackbench-process-sockets
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v2
> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
> 
> As schbench has been mentioned in numerous bugs recently, the results
> are interesting. A test case that represents the default schbench
> behaviour is
> 
> schbench Wakeup Latency (usec)
>                                        6.18.0-rc1             6.18.0-rc1
>                                           vanilla   sched-preemptnext-v5
> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
> 
> schbench Requests Per Second (ops/sec)
>                                   6.18.0-rc1             6.18.0-rc1
>                                      vanilla   sched-preemptnext-v5
> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
> ---
>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 130 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 071e07f..c6e5c64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>  	if (cfs_rq->nr_queued == 1)
>  		return curr && curr->on_rq ? curr : se;
>  
> +	/*
> +	 * Picking the ->next buddy will affect latency but not fairness.
> +	 */
> +	if (sched_feat(PICK_BUDDY) &&
> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> +		/* ->next will never be delayed */
> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> +		return cfs_rq->next;
> +	}
> +
>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>  		curr = NULL;
>  
> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  	return delta_exec;
>  }
>  
> +static void set_next_buddy(struct sched_entity *se);
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
>  
> -	/*
> -	 * Picking the ->next buddy will affect latency but not fairness.
> -	 */
> -	if (sched_feat(PICK_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> -		/* ->next will never be delayed */
> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> -		return cfs_rq->next;
> -	}
> -
>  	se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
> +};
> +
> +static inline bool
> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
> +		  struct sched_entity *pse, struct sched_entity *se)
> +{
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return false;
> +
> +	set_next_buddy(pse);
> +	return true;
> +}
> +
> +/*
> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> + * strictly enforced because the hint is either misunderstood or
> + * multiple tasks must be woken up.
> + */
> +static inline enum preempt_wakeup_action
> +preempt_sync(struct rq *rq, int wake_flags,
> +	     struct sched_entity *pse, struct sched_entity *se)
> +{
> +	u64 threshold, delta;
> +
> +	/*
> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
> +	 * though it is likely harmless.
> +	 */
> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +	threshold = sysctl_sched_migration_cost;
> +	delta = rq_clock_task(rq) - se->exec_start;
> +	if ((s64)delta < 0)
> +		delta = 0;
> +
> +	/*
> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
> +	 * could run on other CPUs. Reduce the threshold before preemption is
> +	 * allowed to an arbitrary lower value as it is more likely (but not
> +	 * guaranteed) the waker requires the wakee to finish.
> +	 */
> +	if (wake_flags & WF_RQ_SELECTED)
> +		threshold >>= 2;
> +
> +	/*
> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
> +	 * wakeups to be issued.
> +	 */
> +	if (entity_before(pse, se) && delta >= threshold)
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	return PREEMPT_WAKEUP_NONE;
> +}
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>  	struct task_struct *donor = rq->donor;
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +		goto pick;
> +	}
>  
>  	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return;
> +
> +	/*
> +	 * If @p potentially is completing work required by current then
> +	 * consider preemption.
> +	 *
> +	 * Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
> +		goto preempt;
> +	}
> +
> +	/* Prefer picking wakee soon if appropriate. */
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> +
> +		/*
> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> +		 * buddies are ignored as they may not be relevant to the
> +		 * waker and less likely to be cache hot.
> +		 */
> +		if (wake_flags & WF_SYNC)
> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_PICK:
> +		break;
> +	}
> +
> +pick:
> +	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);
>