[tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

tip-bot2 for Mel Gorman posted 1 patch 2 months, 3 weeks ago
kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 130 insertions(+), 22 deletions(-)
[tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by tip-bot2 for Mel Gorman 2 months, 3 weeks ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:21 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00

sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v5
Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v5
Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)

Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v5
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla   sched-preemptnext-v5
Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 130 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 071e07f..c6e5c64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
+};
+
+static inline bool
+set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
+{
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return false;
+
+	set_next_buddy(pse);
+	return true;
+}
+
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 threshold, delta;
+
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
+	 * though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+	threshold = sysctl_sched_migration_cost;
+	delta = rq_clock_task(rq) - se->exec_start;
+	if ((s64)delta < 0)
+		delta = 0;
+
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
+	 * could run on other CPUs. Reduce the threshold before preemption is
+	 * allowed to an arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
+
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
+	 * wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
+
+	return PREEMPT_WAKEUP_NONE;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
 
 	if (unlikely(se == pse))
 		return;
@@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+		goto pick;
+	}
 
 	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return;
+
+	/*
+	 * If @p potentially is completing work required by current then
+	 * consider preemption.
+	 *
+	 * Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+	}
+
+	/* Prefer picking wakee soon if appropriate. */
+	if (sched_feat(NEXT_BUDDY) &&
+	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
+
+		/*
+		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
+		 * buddies are ignored as they may not be relevant to the
+		 * waker and less likely to be cache hot.
+		 */
+		if (wake_flags & WF_SYNC)
+			preempt_action = preempt_sync(rq, wake_flags, pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_PICK:
+		break;
+	}
+
+pick:
+	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);
[REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 1 month, 2 weeks ago
Hi Mel, Peter,

We are building out a kernel performance regression monitoring lab at Arm, and 
I've noticed some fairly large perofrmance regressions in real-world workloads, 
for which bisection has fingered this patch.

We are looking at performance changes between v6.18 and v6.19-rc1, and by 
reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
to move the testing to linux-next over the next couple of quarters so hopefully 
we will be able to deliver this sort of news prior to merging in future).

All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
statistically significant regression/improvement, where "statistically 
significant" means the 95% confidence intervals do not overlap".

The below is a large scale mysql workload, running across 2 AWS instances (a 
load generator and the mysql server). We have a partner for whom this is a very 
important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
(where the patch is added). By reverting the patch, the regression is not only 
fixed by performance is now nearly 6% better than v6.18:

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
|                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+


Next are a bunch of benchmarks all running on a single system. specjbb is the 
SPEC Java Business Benchmark. The mysql one is the same as above but this time 
both loadgen and server are on the same system. pgbench is the PostgreSQL 
benchmark.

I'm showing hackbench for completeness, but I don't consider it a high priority 
issue.

Interestingly, nginx improves significantly with the patch.

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
|                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
|                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
|                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
|                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
|                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
|                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
|                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
|                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
|                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
|                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
|                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
|                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
|                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
|                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
|                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
|                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
|                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
|                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
|                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
|                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
|                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
|                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
|                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
|                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
|                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
|                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
|                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
|                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
|                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
|                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
|                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
|                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
|                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
|                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
|                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
|                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
|                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
|                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
|                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
|                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
|                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
|                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
|                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
|                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
|                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
|                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
|                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
|                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
|                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
|                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
|                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
|                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
|                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
|                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
|                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
|                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
|                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
|                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
|                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
|                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
|                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
|                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
|                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
|                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
|                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
|                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
|                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
|                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
|                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
|                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
|                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
|                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
|                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
|                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
|                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
|                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
|                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
|                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+

All of the benchmarks have been run multiple times and I have high confidence in 
the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.

I'm not providing the data, but we also see similar regressions on AmpereOne 
(another arm64 server system). And we have seen a few functional tests (kvm 
selftests) that have started to timeout due to this patch slowing things down on 
arm64.

I'm hoping you can advise on the best way to proceed? We have a bigger library 
than what I'm showing, but the only improvement I see due to this patch is 
nginx. So based on that, my preference would be to revert the patch upstream 
until the issues can be worked out. I'm guessing the story is quite different 
for x86 though?

Thanks,
Ryan



On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
> Author:        Mel Gorman <mgorman@techsingularity.net>
> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
> 
> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
> 
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In other cases, it's
> throttled based on the accumulated runtime of the waker so there is a
> chance that some batched wakeups have been issued before preemption.
> 
> For all other wakeups, preemption happens if the wakee has a earlier
> deadline than the waker and eligible to run.
> 
> While many workloads were tested, the two main targets were a modified
> dbench4 benchmark and hackbench because the are on opposite ends of the
> spectrum -- one prefers throughput by avoiding preemption and the other
> relies on preemption.
> 
> First is the dbench throughput data even though it is a poor metric but
> it is the default metric. The test machine is a 2-socket machine and the
> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
> threads. It's important to note that these results are not representative
> across all machines, especially Zen machines, as different bottlenecks
> are exposed on different machines and filesystems.
> 
> dbench4 Throughput (misleading but traditional)
>                             6.18-rc1               6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
> 
> As throughput is misleading, the benchmark is modified to use a short
> loadfile report the completion time duration in milliseconds.
> 
> dbench4 Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
> 
> That is still looking good and the variance is reduced quite a bit.
> Finally, fairness is a concern so the next report tracks how many
> milliseconds does it take for all clients to complete a workfile. This
> one is tricky because dbench makes to effort to synchronise clients so
> the durations at benchmark start time differ substantially from typical
> runtimes. This problem could be mitigated by warming up the benchmark
> for a number of minutes but it's a matter of opinion whether that
> counts as an evasion of inconvenient results.
> 
> dbench4 All Clients Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
> 
> This is more of a mixed bag but it at least shows that fairness
> is not crippled.
> 
> The hackbench results are more neutral but this is still important.
> It's possible to boost the dbench figures by a large amount but only by
> crippling the performance of a workload like hackbench. The WF_SYNC
> behaviour is important for these workloads and is why the WF_SYNC
> changes are not a separate patch.
> 
> hackbench-process-pipes
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
> 
> Processes using pipes are impacted but the variance (not presented) indicates
> it's close to noise and the results are not always reproducible. If executed
> across multiple reboots, it may show neutral or small gains so the worst
> measured results are presented.
> 
> Hackbench using sockets is more reliably neutral as the wakeup
> mechanisms are different between sockets and pipes.
> 
> hackbench-process-sockets
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v2
> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
> 
> As schbench has been mentioned in numerous bugs recently, the results
> are interesting. A test case that represents the default schbench
> behaviour is
> 
> schbench Wakeup Latency (usec)
>                                        6.18.0-rc1             6.18.0-rc1
>                                           vanilla   sched-preemptnext-v5
> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
> 
> schbench Requests Per Second (ops/sec)
>                                   6.18.0-rc1             6.18.0-rc1
>                                      vanilla   sched-preemptnext-v5
> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
> ---
>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 130 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 071e07f..c6e5c64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>  	if (cfs_rq->nr_queued == 1)
>  		return curr && curr->on_rq ? curr : se;
>  
> +	/*
> +	 * Picking the ->next buddy will affect latency but not fairness.
> +	 */
> +	if (sched_feat(PICK_BUDDY) &&
> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> +		/* ->next will never be delayed */
> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> +		return cfs_rq->next;
> +	}
> +
>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>  		curr = NULL;
>  
> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  	return delta_exec;
>  }
>  
> +static void set_next_buddy(struct sched_entity *se);
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
>  
> -	/*
> -	 * Picking the ->next buddy will affect latency but not fairness.
> -	 */
> -	if (sched_feat(PICK_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> -		/* ->next will never be delayed */
> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> -		return cfs_rq->next;
> -	}
> -
>  	se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
> +};
> +
> +static inline bool
> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
> +		  struct sched_entity *pse, struct sched_entity *se)
> +{
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return false;
> +
> +	set_next_buddy(pse);
> +	return true;
> +}
> +
> +/*
> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> + * strictly enforced because the hint is either misunderstood or
> + * multiple tasks must be woken up.
> + */
> +static inline enum preempt_wakeup_action
> +preempt_sync(struct rq *rq, int wake_flags,
> +	     struct sched_entity *pse, struct sched_entity *se)
> +{
> +	u64 threshold, delta;
> +
> +	/*
> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
> +	 * though it is likely harmless.
> +	 */
> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +	threshold = sysctl_sched_migration_cost;
> +	delta = rq_clock_task(rq) - se->exec_start;
> +	if ((s64)delta < 0)
> +		delta = 0;
> +
> +	/*
> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
> +	 * could run on other CPUs. Reduce the threshold before preemption is
> +	 * allowed to an arbitrary lower value as it is more likely (but not
> +	 * guaranteed) the waker requires the wakee to finish.
> +	 */
> +	if (wake_flags & WF_RQ_SELECTED)
> +		threshold >>= 2;
> +
> +	/*
> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
> +	 * wakeups to be issued.
> +	 */
> +	if (entity_before(pse, se) && delta >= threshold)
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	return PREEMPT_WAKEUP_NONE;
> +}
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>  	struct task_struct *donor = rq->donor;
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +		goto pick;
> +	}
>  
>  	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return;
> +
> +	/*
> +	 * If @p potentially is completing work required by current then
> +	 * consider preemption.
> +	 *
> +	 * Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
> +		goto preempt;
> +	}
> +
> +	/* Prefer picking wakee soon if appropriate. */
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> +
> +		/*
> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> +		 * buddies are ignored as they may not be relevant to the
> +		 * waker and less likely to be cache hot.
> +		 */
> +		if (wake_flags & WF_SYNC)
> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_PICK:
> +		break;
> +	}
> +
> +pick:
> +	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);
>
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 1 month ago
Hi, I appreciate I sent this report just before Xmas so most likely you haven't
had a chance to look, but wanted to bring it back to the top of your mailbox in
case it was missed.

Happy new year!

Thanks,
Ryan

On 22/12/2025 10:57, Ryan Roberts wrote:
> Hi Mel, Peter,
> 
> We are building out a kernel performance regression monitoring lab at Arm, and 
> I've noticed some fairly large perofrmance regressions in real-world workloads, 
> for which bisection has fingered this patch.
> 
> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
> to move the testing to linux-next over the next couple of quarters so hopefully 
> we will be able to deliver this sort of news prior to merging in future).
> 
> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> statistically significant regression/improvement, where "statistically 
> significant" means the 95% confidence intervals do not overlap".
> 
> The below is a large scale mysql workload, running across 2 AWS instances (a 
> load generator and the mysql server). We have a partner for whom this is a very 
> important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
> (where the patch is added). By reverting the patch, the regression is not only 
> fixed by performance is now nearly 6% better than v6.18:
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
> |                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> 
> Next are a bunch of benchmarks all running on a single system. specjbb is the 
> SPEC Java Business Benchmark. The mysql one is the same as above but this time 
> both loadgen and server are on the same system. pgbench is the PostgreSQL 
> benchmark.
> 
> I'm showing hackbench for completeness, but I don't consider it a high priority 
> issue.
> 
> Interestingly, nginx improves significantly with the patch.
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
> |                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
> |                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
> |                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
> |                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
> |                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
> |                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
> |                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
> |                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
> |                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
> |                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
> |                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
> |                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
> |                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
> |                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
> |                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
> |                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
> |                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
> |                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
> |                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
> |                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
> |                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
> |                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
> |                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
> |                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
> |                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
> |                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
> |                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
> |                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
> |                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
> |                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
> |                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
> |                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
> |                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
> |                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
> |                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
> |                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
> |                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
> |                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
> |                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
> |                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
> |                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
> |                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
> |                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
> |                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
> |                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
> |                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
> |                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
> |                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
> |                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
> |                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
> |                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
> |                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
> |                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
> |                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
> |                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
> |                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> All of the benchmarks have been run multiple times and I have high confidence in 
> the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.
> 
> I'm not providing the data, but we also see similar regressions on AmpereOne 
> (another arm64 server system). And we have seen a few functional tests (kvm 
> selftests) that have started to timeout due to this patch slowing things down on 
> arm64.
> 
> I'm hoping you can advise on the best way to proceed? We have a bigger library 
> than what I'm showing, but the only improvement I see due to this patch is 
> nginx. So based on that, my preference would be to revert the patch upstream 
> until the issues can be worked out. I'm guessing the story is quite different 
> for x86 though?
> 
> Thanks,
> Ryan
> 
> 
> 
> On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Author:        Mel Gorman <mgorman@techsingularity.net>
>> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
>> Committer:     Peter Zijlstra <peterz@infradead.org>
>> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
>>
>> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
>>
>> Reimplement NEXT_BUDDY preemption to take into account the deadline and
>> eligibility of the wakee with respect to the waker. In the event
>> multiple buddies could be considered, the one with the earliest deadline
>> is selected.
>>
>> Sync wakeups are treated differently to every other type of wakeup. The
>> WF_SYNC assumption is that the waker promises to sleep in the very near
>> future. This is violated in enough cases that WF_SYNC should be treated
>> as a suggestion instead of a contract. If a waker does go to sleep almost
>> immediately then the delay in wakeup is negligible. In other cases, it's
>> throttled based on the accumulated runtime of the waker so there is a
>> chance that some batched wakeups have been issued before preemption.
>>
>> For all other wakeups, preemption happens if the wakee has a earlier
>> deadline than the waker and eligible to run.
>>
>> While many workloads were tested, the two main targets were a modified
>> dbench4 benchmark and hackbench because the are on opposite ends of the
>> spectrum -- one prefers throughput by avoiding preemption and the other
>> relies on preemption.
>>
>> First is the dbench throughput data even though it is a poor metric but
>> it is the default metric. The test machine is a 2-socket machine and the
>> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
>> threads. It's important to note that these results are not representative
>> across all machines, especially Zen machines, as different bottlenecks
>> are exposed on different machines and filesystems.
>>
>> dbench4 Throughput (misleading but traditional)
>>                             6.18-rc1               6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
>> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
>> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
>> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
>> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
>> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
>> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
>> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
>> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
>> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
>> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
>>
>> As throughput is misleading, the benchmark is modified to use a short
>> loadfile report the completion time duration in milliseconds.
>>
>> dbench4 Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
>> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
>> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
>> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
>> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
>> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
>> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
>> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
>> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
>> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
>> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
>> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
>> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
>> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
>> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
>> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
>> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
>> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
>> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
>> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
>> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
>> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
>>
>> That is still looking good and the variance is reduced quite a bit.
>> Finally, fairness is a concern so the next report tracks how many
>> milliseconds does it take for all clients to complete a workfile. This
>> one is tricky because dbench makes to effort to synchronise clients so
>> the durations at benchmark start time differ substantially from typical
>> runtimes. This problem could be mitigated by warming up the benchmark
>> for a number of minutes but it's a matter of opinion whether that
>> counts as an evasion of inconvenient results.
>>
>> dbench4 All Clients Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
>> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
>> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
>> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
>> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
>> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
>> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
>> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
>> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
>> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
>> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
>> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
>> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
>> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
>> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
>> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
>> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
>> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
>> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
>> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
>> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
>> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
>>
>> This is more of a mixed bag but it at least shows that fairness
>> is not crippled.
>>
>> The hackbench results are more neutral but this is still important.
>> It's possible to boost the dbench figures by a large amount but only by
>> crippling the performance of a workload like hackbench. The WF_SYNC
>> behaviour is important for these workloads and is why the WF_SYNC
>> changes are not a separate patch.
>>
>> hackbench-process-pipes
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
>> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
>> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
>> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
>> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
>> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
>> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
>> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
>> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
>> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
>> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
>> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
>> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
>> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
>> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
>>
>> Processes using pipes are impacted but the variance (not presented) indicates
>> it's close to noise and the results are not always reproducible. If executed
>> across multiple reboots, it may show neutral or small gains so the worst
>> measured results are presented.
>>
>> Hackbench using sockets is more reliably neutral as the wakeup
>> mechanisms are different between sockets and pipes.
>>
>> hackbench-process-sockets
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v2
>> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
>> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
>> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
>> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
>> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
>> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
>> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
>> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
>> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
>> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
>> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
>> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
>> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
>> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
>> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
>>
>> As schbench has been mentioned in numerous bugs recently, the results
>> are interesting. A test case that represents the default schbench
>> behaviour is
>>
>> schbench Wakeup Latency (usec)
>>                                        6.18.0-rc1             6.18.0-rc1
>>                                           vanilla   sched-preemptnext-v5
>> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
>> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
>> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
>> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
>>
>> schbench Requests Per Second (ops/sec)
>>                                   6.18.0-rc1             6.18.0-rc1
>>                                      vanilla   sched-preemptnext-v5
>> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
>> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
>> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
>> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
>>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
>> ---
>>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 130 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 071e07f..c6e5c64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>>  	if (cfs_rq->nr_queued == 1)
>>  		return curr && curr->on_rq ? curr : se;
>>  
>> +	/*
>> +	 * Picking the ->next buddy will affect latency but not fairness.
>> +	 */
>> +	if (sched_feat(PICK_BUDDY) &&
>> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> +		/* ->next will never be delayed */
>> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> +		return cfs_rq->next;
>> +	}
>> +
>>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>  		curr = NULL;
>>  
>> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>  	return delta_exec;
>>  }
>>  
>> +static void set_next_buddy(struct sched_entity *se);
>> +
>>  /*
>>   * Used by other classes to account runtime.
>>   */
>> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>>  {
>>  	struct sched_entity *se;
>>  
>> -	/*
>> -	 * Picking the ->next buddy will affect latency but not fairness.
>> -	 */
>> -	if (sched_feat(PICK_BUDDY) &&
>> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> -		/* ->next will never be delayed */
>> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> -		return cfs_rq->next;
>> -	}
>> -
>>  	se = pick_eevdf(cfs_rq);
>>  	if (se->sched_delayed) {
>>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  	hrtick_update(rq);
>>  }
>>  
>> -static void set_next_buddy(struct sched_entity *se);
>> -
>>  /*
>>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>>   * failing half-way through and resume the dequeue later.
>> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>>  	}
>>  }
>>  
>> +enum preempt_wakeup_action {
>> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
>> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
>> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
>> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
>> +};
>> +
>> +static inline bool
>> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
>> +		  struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	/*
>> +	 * Keep existing buddy if the deadline is sooner than pse.
>> +	 * The older buddy may be cache cold and completely unrelated
>> +	 * to the current wakeup but that is unpredictable where as
>> +	 * obeying the deadline is more in line with EEVDF objectives.
>> +	 */
>> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
>> +		return false;
>> +
>> +	set_next_buddy(pse);
>> +	return true;
>> +}
>> +
>> +/*
>> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
>> + * strictly enforced because the hint is either misunderstood or
>> + * multiple tasks must be woken up.
>> + */
>> +static inline enum preempt_wakeup_action
>> +preempt_sync(struct rq *rq, int wake_flags,
>> +	     struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	u64 threshold, delta;
>> +
>> +	/*
>> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
>> +	 * though it is likely harmless.
>> +	 */
>> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
>> +
>> +	threshold = sysctl_sched_migration_cost;
>> +	delta = rq_clock_task(rq) - se->exec_start;
>> +	if ((s64)delta < 0)
>> +		delta = 0;
>> +
>> +	/*
>> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
>> +	 * could run on other CPUs. Reduce the threshold before preemption is
>> +	 * allowed to an arbitrary lower value as it is more likely (but not
>> +	 * guaranteed) the waker requires the wakee to finish.
>> +	 */
>> +	if (wake_flags & WF_RQ_SELECTED)
>> +		threshold >>= 2;
>> +
>> +	/*
>> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
>> +	 * wakeups to be issued.
>> +	 */
>> +	if (entity_before(pse, se) && delta >= threshold)
>> +		return PREEMPT_WAKEUP_RESCHED;
>> +
>> +	return PREEMPT_WAKEUP_NONE;
>> +}
>> +
>>  /*
>>   * Preempt the current task with a newly woken task if needed:
>>   */
>>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>  {
>> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>>  	struct task_struct *donor = rq->donor;
>>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>>  	int cse_is_idle, pse_is_idle;
>> -	bool do_preempt_short = false;
>>  
>>  	if (unlikely(se == pse))
>>  		return;
>> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	if (task_is_throttled(p))
>>  		return;
>>  
>> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>> -		set_next_buddy(pse);
>> -	}
>> -
>>  	/*
>>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>>  	 * wake up path.
>> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  		 * When non-idle entity preempt an idle entity,
>>  		 * don't give idle entity slice protection.
>>  		 */
>> -		do_preempt_short = true;
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>>  		goto preempt;
>>  	}
>>  
>> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	 * If @p has a shorter slice than current and @p is eligible, override
>>  	 * current's slice protection in order to allow preemption.
>>  	 */
>> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
>> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>> +		goto pick;
>> +	}
>>  
>>  	/*
>> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
>> +	 * there is shared data as exec often follow fork. Do not
>> +	 * preempt for tasks that are sched_delayed as it would violate
>> +	 * EEVDF to forcibly queue an ineligible task.
>> +	 */
>> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
>> +		return;
>> +
>> +	/*
>> +	 * If @p potentially is completing work required by current then
>> +	 * consider preemption.
>> +	 *
>> +	 * Reschedule if waker is no longer eligible. */
>> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
>> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
>> +		goto preempt;
>> +	}
>> +
>> +	/* Prefer picking wakee soon if appropriate. */
>> +	if (sched_feat(NEXT_BUDDY) &&
>> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> +		/*
>> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> +		 * buddies are ignored as they may not be relevant to the
>> +		 * waker and less likely to be cache hot.
>> +		 */
>> +		if (wake_flags & WF_SYNC)
>> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> +	}
>> +
>> +	switch (preempt_action) {
>> +	case PREEMPT_WAKEUP_NONE:
>> +		return;
>> +	case PREEMPT_WAKEUP_RESCHED:
>> +		goto preempt;
>> +	case PREEMPT_WAKEUP_SHORT:
>> +		fallthrough;
>> +	case PREEMPT_WAKEUP_PICK:
>> +		break;
>> +	}
>> +
>> +pick:
>> +	/*
>>  	 * If @p has become the most eligible task, force preemption.
>>  	 */
>> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>>  		goto preempt;
>>  
>> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
>> +	if (sched_feat(RUN_TO_PARITY))
>>  		update_protect_slice(cfs_rq, se);
>>  
>>  	return;
>>  
>>  preempt:
>> -	if (do_preempt_short)
>> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>>  		cancel_protect_slice(se);
>>  
>>  	resched_curr_lazy(rq);
>>
>
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Dietmar Eggemann 1 month ago
On 02.01.26 13:38, Ryan Roberts wrote:
> Hi, I appreciate I sent this report just before Xmas so most likely you haven't
> had a chance to look, but wanted to bring it back to the top of your mailbox in
> case it was missed.
> 
> Happy new year!
> 
> Thanks,
> Ryan
> 
> On 22/12/2025 10:57, Ryan Roberts wrote:
>> Hi Mel, Peter,
>>
>> We are building out a kernel performance regression monitoring lab at Arm, and 
>> I've noticed some fairly large perofrmance regressions in real-world workloads, 
>> for which bisection has fingered this patch.
>>
>> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
>> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
>> to move the testing to linux-next over the next couple of quarters so hopefully 
>> we will be able to deliver this sort of news prior to merging in future).
>>
>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>> statistically significant regression/improvement, where "statistically 
>> significant" means the 95% confidence intervals do not overlap".

You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals'.

Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?

---

Mel mentioned that he tested on a 2-socket machine. So I guess something
like my Intel Xeon Silver 4314:

cpu0 0 0
domain0 SMT 00000001,00000001
domain1 MC 55555555,55555555
domain2 NUMA ffffffff,ffffffff

node distances:
node   0   1
  0:  10  20
  1:  20  10

Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
domain? I guess topology has influence in benchmark numbers here as well.

---

There was also a lot of improvement on schbench (wakeup latency) on
higher percentiles (>= 99.0th) on the 2-socket machine with those 2
patches. I guess you haven't seen those on Grav3?

[...]
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 1 month ago
On 02/01/2026 15:52, Dietmar Eggemann wrote:
> On 02.01.26 13:38, Ryan Roberts wrote:
>> Hi, I appreciate I sent this report just before Xmas so most likely you haven't
>> had a chance to look, but wanted to bring it back to the top of your mailbox in
>> case it was missed.
>>
>> Happy new year!
>>
>> Thanks,
>> Ryan
>>
>> On 22/12/2025 10:57, Ryan Roberts wrote:
>>> Hi Mel, Peter,
>>>
>>> We are building out a kernel performance regression monitoring lab at Arm, and 
>>> I've noticed some fairly large perofrmance regressions in real-world workloads, 
>>> for which bisection has fingered this patch.
>>>
>>> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
>>> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
>>> to move the testing to linux-next over the next couple of quarters so hopefully 
>>> we will be able to deliver this sort of news prior to merging in future).
>>>
>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>> statistically significant regression/improvement, where "statistically 
>>> significant" means the 95% confidence intervals do not overlap".
> 
> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> 
> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?

Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?

> 
> ---
> 
> Mel mentioned that he tested on a 2-socket machine. So I guess something
> like my Intel Xeon Silver 4314:
> 
> cpu0 0 0
> domain0 SMT 00000001,00000001
> domain1 MC 55555555,55555555
> domain2 NUMA ffffffff,ffffffff
> 
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
> domain? I guess topology has influence in benchmark numbers here as well.

I can't easily enable scheduler debugging right now (which I think is needed to 
get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
is a single NUMA node and topology for cpu0 gives this if it helps:

/sys/devices/system/cpu/cpu0/topology$ grep "" -r .
./cluster_cpus:ffffffff,ffffffff
./cluster_cpus_list:0-63
./physical_package_id:0
./core_cpus_list:0
./core_siblings:ffffffff,ffffffff
./cluster_id:0
./core_siblings_list:0-63
./package_cpus:ffffffff,ffffffff
./package_cpus_list:0-63
./thread_siblings_list:0
./core_id:0
./core_cpus:00000000,00000001
./thread_siblings:00000000,00000001

> 
> ---
> 
> There was also a lot of improvement on schbench (wakeup latency) on
> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
> patches. I guess you haven't seen those on Grav3?
> 

I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
revert-next-buddy. The means have moved a bit but there are only a couple of 
cases that we consisder statistically significant (marked (R)egression / 
(I)mprovement):

+----------------------------+------------------------------------------------------+-------------+-------------------+
| Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
+============================+======================================================+=============+===================+
| schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
|                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
|                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
|                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
|                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
|                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
|                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
|                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
|                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
|                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
|                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
|                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
|                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
|                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
|                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
|                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
|                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
|                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
|                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
|                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
|                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
|                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
|                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
|                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
|                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
|                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
|                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
|                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
|                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
|                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
|                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
|                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
|                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
|                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
|                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
|                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
+----------------------------+------------------------------------------------------+-------------+-------------------+

I could get the results for 6.18 if useful, but I think what I have probably 
shows enough of the picture: This patch has not impacted schbench much on 
this HW.

Thanks,
Ryan
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Dietmar Eggemann 1 month ago
On 05.01.26 12:45, Ryan Roberts wrote:
> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>> On 02.01.26 13:38, Ryan Roberts wrote:

[...]

>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>> statistically significant regression/improvement, where "statistically 
>>>> significant" means the 95% confidence intervals do not overlap".
>>
>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>
>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> 
> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?

Well, I assume this would be more valuable. Before this patch-set (e.g
v6.18), NEXT_BUDDY was disabled and this is what people are running.

Now (>= v6.19-rc1) we have NEXT_BUDDY=true (1/2) and 'NEXT_BUDDY aligned
to EEVDF' (2/2). This is what people will run when they switch to v6.19
later.

But patch 2/2 changes more than the 'if (sched_feat(NEXT_BUDDY) ...'
condition. So testing 'w/o 2/2' vs. 'w/ 2/2' and 'NEXT_BUDDY=false'
could be helpful as well.

>> ---
>>
>> Mel mentioned that he tested on a 2-socket machine. So I guess something
>> like my Intel Xeon Silver 4314:
>>
>> cpu0 0 0
>> domain0 SMT 00000001,00000001
>> domain1 MC 55555555,55555555
>> domain2 NUMA ffffffff,ffffffff
>>
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>>
>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>> domain? I guess topology has influence in benchmark numbers here as well.
> 
> I can't easily enable scheduler debugging right now (which I think is needed to 
> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> is a single NUMA node and topology for cpu0 gives this if it helps:
> 
> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> ./cluster_cpus:ffffffff,ffffffff
> ./cluster_cpus_list:0-63
> ./physical_package_id:0
> ./core_cpus_list:0
> ./core_siblings:ffffffff,ffffffff
> ./cluster_id:0
> ./core_siblings_list:0-63
> ./package_cpus:ffffffff,ffffffff
> ./package_cpus_list:0-63

[...]

OK, so single (flat) MC domain with 64 CPUs.

>> There was also a lot of improvement on schbench (wakeup latency) on
>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>> patches. I guess you haven't seen those on Grav3?
>>
> 
> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> revert-next-buddy. The means have moved a bit but there are only a couple of 
> cases that we consisder statistically significant (marked (R)egression / 
> (I)mprovement):
> 
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> +============================+======================================================+=============+===================+
> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> 
> I could get the results for 6.18 if useful, but I think what I have probably 
> shows enough of the picture: This patch has not impacted schbench much on 
> this HW.

I see. IMHO, task scheduler tests are all about putting the right about
of stress onto the system, not too little and not too much.

I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
Not sure which parameter set Mel was using on his 2 socket machine. And
I still assume he tested w/o (base) against with these 2 patches.

The other test Mel was using is this modified dbench4 (prefers
throughput (less preemption)). Not sure if this is part of the MmTests
suite?

It would be nice to be able to run the same tests on different machines
(with a parameter set adapted to the number of CPUs), so we have only
the arch and the topology as variables). But there is definitely more
variety (e.g. used filesystem, etc) ... so this is not trivial.

[...]
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Mel Gorman 1 month ago
On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> On 05.01.26 12:45, Ryan Roberts wrote:
> > On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >> On 02.01.26 13:38, Ryan Roberts wrote:
> 
> [...]
> 

Sorry for slow responses. I'm still back from holidays yet and unfortunately
do not have access to test machines right now cannot revalidate any of
the results against 6.19-rc*.

> >>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>> statistically significant regression/improvement, where "statistically 
> >>>> significant" means the 95% confidence intervals do not overlap".
> >>
> >> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>
> >> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> > 
> > Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> 
> Well, I assume this would be more valuable.

Agreed because we need to know if it's NEXT_BUDDY that is conceptually
an issue with EEVDF in these cases or the specific implementation. The
comparison between 

6.18A					(baseline)
6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
architecture and specific arch implementation dependent. If it cannot be
sanely reconciled then it may be best to completely remove NEXT_BUDDY from
EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
it existed in CFS can be sanely implemented against EEVDF so it'll never
be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
vs !NEXT_BUDDY even on CFS as it was enabled for so long.

> >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
> >> domain? I guess topology has influence in benchmark numbers here as well.
> > 
> > I can't easily enable scheduler debugging right now (which I think is needed to 
> > get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> > is a single NUMA node and topology for cpu0 gives this if it helps:
> > 
> > /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> > ./cluster_cpus:ffffffff,ffffffff
> > ./cluster_cpus_list:0-63
> > ./physical_package_id:0
> > ./core_cpus_list:0
> > ./core_siblings:ffffffff,ffffffff
> > ./cluster_id:0
> > ./core_siblings_list:0-63
> > ./package_cpus:ffffffff,ffffffff
> > ./package_cpus_list:0-63
> 
> [...]
> 
> OK, so single (flat) MC domain with 64 CPUs.
> 

That is what the OS sees but does it reflect reality? e.g. does Graviton3
have multiple caches that are simply not advertised to the OS?

> >> There was also a lot of improvement on schbench (wakeup latency) on
> >> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
> >> patches. I guess you haven't seen those on Grav3?
> >>
> > 
> > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> > revert-next-buddy. The means have moved a bit but there are only a couple of 
> > cases that we consisder statistically significant (marked (R)egression / 
> > (I)mprovement):
> > 
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> > +============================+======================================================+=============+===================+
> > | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > 
> > I could get the results for 6.18 if useful, but I think what I have probably 
> > shows enough of the picture: This patch has not impacted schbench much on 
> > this HW.
> 
> I see. IMHO, task scheduler tests are all about putting the right about
> of stress onto the system, not too little and not too much.
> 
> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.

Agreed. It's not the full picture but it's a valuable part.

> Not sure which parameter set Mel was using on his 2 socket machine. And
> I still assume he tested w/o (base) against with these 2 patches.
> 

He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
is ok.

> The other test Mel was using is this modified dbench4 (prefers
> throughput (less preemption)). Not sure if this is part of the MmTests
> suite?
> 

It is. The modifications are not extensive. dbench by default reports overall
throughput over time which masks actual throughput at a point in time. The
new metric tracks time taken to process "loadfiles" over time which is
more sensible to analyse. Other metrics such as loadfiles processed per
client would be easily extracted but isn't at the moment as dbench itself
is not designed for measuring fairness of forward progress as such.

> It would be nice to be able to run the same tests on different machines
> (with a parameter set adapted to the number of CPUs), so we have only
> the arch and the topology as variables). But there is definitely more
> variety (e.g. used filesystem, etc) ... so this is not trivial.
> 

From a topology perspective it is fairly trivial though. For example,
MMTESTS has a schbench configuration that runs one message thread per NUMA
node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
properly or nr_llcs could also be used fairly trivially. You're right that
once filesystems are involved then it all gets more interesting. ext4 and
xfs use kernel threads differently (jbd vs kworkers), the underlying storage
is a factor, workset size vs RAM impacts dirty throttling and reclaim and
NUMA sizes all play a part. dbench is useful in this regard because while
it interacts with the filesystem a wakeups between userspace and kernel
threads get exercised, the amount of IO is relatively small.

Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
machines but I can't start that yet.

-- 
Mel Gorman
SUSE Labs
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 4 weeks, 1 day ago
On 08/01/2026 08:50, Mel Gorman wrote:
> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>> On 05.01.26 12:45, Ryan Roberts wrote:
>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>
>> [...]
>>
> 
> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> do not have access to test machines right now cannot revalidate any of
> the results against 6.19-rc*.

No problem, thanks for getting back to me!

> 
>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>>>> statistically significant regression/improvement, where "statistically 
>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>
>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>
>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>
>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>
>> Well, I assume this would be more valuable.
> 
> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> an issue with EEVDF in these cases or the specific implementation. The
> comparison between 
> 
> 6.18A					(baseline)
> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
tomorrow. Then we can take it from there.

I appreciate your time on this!

Thanks,
Ryan


> 
> It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
> architecture and specific arch implementation dependent. If it cannot be
> sanely reconciled then it may be best to completely remove NEXT_BUDDY from
> EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
> it existed in CFS can be sanely implemented against EEVDF so it'll never
> be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
> vs !NEXT_BUDDY even on CFS as it was enabled for so long.
> 
>>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>>> domain? I guess topology has influence in benchmark numbers here as well.
>>>
>>> I can't easily enable scheduler debugging right now (which I think is needed to 
>>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
>>> is a single NUMA node and topology for cpu0 gives this if it helps:
>>>
>>> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
>>> ./cluster_cpus:ffffffff,ffffffff
>>> ./cluster_cpus_list:0-63
>>> ./physical_package_id:0
>>> ./core_cpus_list:0
>>> ./core_siblings:ffffffff,ffffffff
>>> ./cluster_id:0
>>> ./core_siblings_list:0-63
>>> ./package_cpus:ffffffff,ffffffff
>>> ./package_cpus_list:0-63
>>
>> [...]
>>
>> OK, so single (flat) MC domain with 64 CPUs.
>>
> 
> That is what the OS sees but does it reflect reality? e.g. does Graviton3
> have multiple caches that are simply not advertised to the OS?
> 
>>>> There was also a lot of improvement on schbench (wakeup latency) on
>>>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>>>> patches. I guess you haven't seen those on Grav3?
>>>>
>>>
>>> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
>>> revert-next-buddy. The means have moved a bit but there are only a couple of 
>>> cases that we consisder statistically significant (marked (R)egression / 
>>> (I)mprovement):
>>>
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
>>> +============================+======================================================+=============+===================+
>>> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>>
>>> I could get the results for 6.18 if useful, but I think what I have probably 
>>> shows enough of the picture: This patch has not impacted schbench much on 
>>> this HW.
>>
>> I see. IMHO, task scheduler tests are all about putting the right about
>> of stress onto the system, not too little and not too much.
>>
>> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
> 
> Agreed. It's not the full picture but it's a valuable part.
> 
>> Not sure which parameter set Mel was using on his 2 socket machine. And
>> I still assume he tested w/o (base) against with these 2 patches.
>>
> 
> He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
> is ok.
> 
>> The other test Mel was using is this modified dbench4 (prefers
>> throughput (less preemption)). Not sure if this is part of the MmTests
>> suite?
>>
> 
> It is. The modifications are not extensive. dbench by default reports overall
> throughput over time which masks actual throughput at a point in time. The
> new metric tracks time taken to process "loadfiles" over time which is
> more sensible to analyse. Other metrics such as loadfiles processed per
> client would be easily extracted but isn't at the moment as dbench itself
> is not designed for measuring fairness of forward progress as such.
> 
>> It would be nice to be able to run the same tests on different machines
>> (with a parameter set adapted to the number of CPUs), so we have only
>> the arch and the topology as variables). But there is definitely more
>> variety (e.g. used filesystem, etc) ... so this is not trivial.
>>
> 
> From a topology perspective it is fairly trivial though. For example,
> MMTESTS has a schbench configuration that runs one message thread per NUMA
> node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
> configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
> properly or nr_llcs could also be used fairly trivially. You're right that
> once filesystems are involved then it all gets more interesting. ext4 and
> xfs use kernel threads differently (jbd vs kworkers), the underlying storage
> is a factor, workset size vs RAM impacts dirty throttling and reclaim and
> NUMA sizes all play a part. dbench is useful in this regard because while
> it interacts with the filesystem a wakeups between userspace and kernel
> threads get exercised, the amount of IO is relatively small.
> 
> Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
> no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
> machines but I can't start that yet.
>
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 4 weeks ago
On 08/01/2026 13:15, Ryan Roberts wrote:
> On 08/01/2026 08:50, Mel Gorman wrote:
>> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>>> On 05.01.26 12:45, Ryan Roberts wrote:
>>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>>
>>> [...]
>>>
>>
>> Sorry for slow responses. I'm still back from holidays yet and unfortunately
>> do not have access to test machines right now cannot revalidate any of
>> the results against 6.19-rc*.
> 
> No problem, thanks for getting back to me!
> 
>>
>>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>>>>> statistically significant regression/improvement, where "statistically 
>>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>>
>>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>>
>>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>>
>>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>>
>>> Well, I assume this would be more valuable.
>>
>> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
>> an issue with EEVDF in these cases or the specific implementation. The
>> comparison between 
>>
>> 6.18A					(baseline)
>> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
>> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
>> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)
> 
> OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
> tomorrow. Then we can take it from there.

Hi Mel, Dietmar,

Here are the updated results, now including column for "revert #1 & #2".

6-18-0 (base)		(baseline)
6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
revert #1 & #2		(NEXT_BUDDY disabled)
revert #2		(Old NEXT_BUDDY implementation enabled)


The regressions that are fixed by "revert #2" (as originally reported) are still 
fixed in "revert #1 & #2". Interestingly, performance actually improves further 
for the latter in the multi-node mysql benchmark (which is our VIP workload). 
There are a couple of hackbench cases (sockets with high thread counts) that 
showed an improvement with "revert #2" but which is gone with "revert #1 & #2".

Let me know if I can usefully do anything else.


Multi-node SUT (workload running across 2 machines):

+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
+=================================+====================================================+===============+=============+============+================+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |     646267.33 |  (R) -1.33% |  (I) 5.87% |      (I) 7.63% |
|                                 | new order rate (orders/min)                        |     213256.50 |  (R) -1.32% |  (I) 5.87% |      (I) 7.64% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Single-node SUT (workload running on single machine):

+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
+=================================+====================================================+===============+=============+============+================+
| specjbb/composite               | critical-jOPS (jOPS)                               |      94700.00 |  (R) -5.10% |     -0.90% |         -0.37% |
|                                 | max-jOPS (jOPS)                                    |     113984.50 |  (R) -3.90% |     -0.65% |          0.65% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |     245438.25 |  (R) -3.88% |     -0.13% |          0.24% |
|                                 | new order rate (orders/min)                        |      80985.75 |  (R) -3.78% |     -0.07% |          0.29% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |      63124.00 |   (I) 2.90% |      0.74% |          0.85% |
|                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |         0.016 |   (I) 5.49% |      1.05% |          1.05% |
|                                 | Scale: 1 Clients: 1 Read Write (TPS)               |        974.92 |       0.11% |     -0.08% |         -0.03% |
|                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |          1.03 |       0.12% |     -0.06% |         -0.06% |
|                                 | Scale: 1 Clients: 250 Read Only (TPS)              |    1915931.58 |  (R) -2.25% |  (I) 2.12% |          1.62% |
|                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |          0.13 |  (R) -2.37% |  (I) 2.09% |          1.69% |
|                                 | Scale: 1 Clients: 250 Read Write (TPS)             |        855.67 |      -1.36% |     -0.14% |         -0.12% |
|                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |        292.39 |      -1.31% |     -0.08% |         -0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |    1534130.08 | (R) -11.37% |      0.08% |          0.48% |
|                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |          0.65 | (R) -11.38% |      0.08% |          0.44% |
|                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |        578.75 |      -1.11% |      2.15% |         -0.96% |
|                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |       1736.98 |      -1.26% |      2.47% |         -0.90% |
|                                 | Scale: 100 Clients: 1 Read Only (TPS)              |      57170.33 |       1.68% |      0.10% |          0.22% |
|                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |         0.018 |       1.94% |      0.00% |          0.96% |
|                                 | Scale: 100 Clients: 1 Read Write (TPS)             |        836.58 |      -0.37% |     -0.41% |          0.07% |
|                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |          1.20 |      -0.37% |     -0.40% |          0.06% |
|                                 | Scale: 100 Clients: 250 Read Only (TPS)            |    1773440.67 |      -1.61% |      1.67% |          1.34% |
|                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |          0.14 |      -1.40% |      1.56% |          1.20% |
|                                 | Scale: 100 Clients: 250 Read Write (TPS)           |       5505.50 |      -0.17% |     -0.86% |         -1.66% |
|                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |         45.42 |      -0.17% |     -0.85% |         -1.67% |
|                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |    1393037.50 | (R) -10.31% |     -0.19% |          0.53% |
|                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |          0.72 | (R) -10.30% |     -0.17% |          0.53% |
|                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |       5085.92 |       0.27% |      0.07% |         -0.79% |
|                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |        196.79 |       0.23% |      0.05% |         -0.81% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |          0.14 |      -1.51% |     -1.05% |         -1.51% |
|                                 | hackbench-process-pipes-4 (seconds)                |          0.44 |   (I) 6.49% |  (I) 5.42% |      (I) 6.06% |
|                                 | hackbench-process-pipes-7 (seconds)                |          0.68 | (R) -18.36% |  (I) 3.40% |         -0.41% |
|                                 | hackbench-process-pipes-12 (seconds)               |          1.24 | (R) -19.89% |     -0.45% |     (R) -2.23% |
|                                 | hackbench-process-pipes-21 (seconds)               |          1.81 |  (R) -8.41% |     -1.22% |     (R) -2.46% |
|                                 | hackbench-process-pipes-30 (seconds)               |          2.39 |  (R) -9.06% | (R) -2.95% |         -1.62% |
|                                 | hackbench-process-pipes-48 (seconds)               |          3.18 | (R) -11.68% | (R) -4.10% |         -0.26% |
|                                 | hackbench-process-pipes-79 (seconds)               |          3.84 |  (R) -9.74% | (R) -3.25% |     (R) -2.45% |
|                                 | hackbench-process-pipes-110 (seconds)              |          4.68 |  (R) -6.57% | (R) -2.12% |     (R) -2.25% |
|                                 | hackbench-process-pipes-141 (seconds)              |          5.75 |  (R) -5.86% | (R) -3.44% |     (R) -2.89% |
|                                 | hackbench-process-pipes-172 (seconds)              |          6.80 |  (R) -4.28% | (R) -2.81% |     (R) -2.44% |
|                                 | hackbench-process-pipes-203 (seconds)              |          7.94 |  (R) -4.01% | (R) -3.00% |     (R) -2.17% |
|                                 | hackbench-process-pipes-234 (seconds)              |          9.02 |  (R) -3.52% | (R) -2.81% |     (R) -2.20% |
|                                 | hackbench-process-pipes-256 (seconds)              |          9.78 |  (R) -3.24% | (R) -2.81% |     (R) -2.74% |
|                                 | hackbench-process-sockets-1 (seconds)              |          0.29 |       0.50% |      0.26% |          0.03% |
|                                 | hackbench-process-sockets-4 (seconds)              |          0.76 |  (I) 17.44% | (I) 16.31% |     (I) 19.09% |
|                                 | hackbench-process-sockets-7 (seconds)              |          1.16 |  (I) 12.10% |  (I) 9.78% |     (I) 11.83% |
|                                 | hackbench-process-sockets-12 (seconds)             |          1.86 |  (I) 10.19% |  (I) 9.83% |     (I) 11.21% |
|                                 | hackbench-process-sockets-21 (seconds)             |          3.12 |   (I) 9.38% |  (I) 9.20% |     (I) 10.30% |
|                                 | hackbench-process-sockets-30 (seconds)             |          4.30 |   (I) 6.43% |  (I) 6.11% |      (I) 7.22% |
|                                 | hackbench-process-sockets-48 (seconds)             |          6.58 |   (I) 3.00% |  (I) 2.19% |      (I) 2.85% |
|                                 | hackbench-process-sockets-79 (seconds)             |         10.56 |   (I) 2.87% |  (I) 3.31% |          3.10% |
|                                 | hackbench-process-sockets-110 (seconds)            |         13.85 |      -1.15% |  (I) 2.33% |          0.22% |
|                                 | hackbench-process-sockets-141 (seconds)            |         19.23 |      -1.40% | (I) 14.53% |          2.64% |
|                                 | hackbench-process-sockets-172 (seconds)            |         26.33 |   (I) 3.52% | (I) 30.37% |      (I) 4.32% |
|                                 | hackbench-process-sockets-203 (seconds)            |         30.27 |       1.10% | (I) 27.20% |          0.32% |
|                                 | hackbench-process-sockets-234 (seconds)            |         35.12 |       1.60% | (I) 28.24% |          1.28% |
|                                 | hackbench-process-sockets-256 (seconds)            |         38.74 |       0.70% | (I) 28.74% |          0.53% |
|                                 | hackbench-thread-pipes-1 (seconds)                 |          0.17 |      -1.32% |     -0.76% |         -0.67% |
|                                 | hackbench-thread-pipes-4 (seconds)                 |          0.45 |   (I) 6.91% |  (I) 7.64% |      (I) 9.08% |
|                                 | hackbench-thread-pipes-7 (seconds)                 |          0.74 |  (R) -7.51% |  (I) 5.26% |      (I) 2.82% |
|                                 | hackbench-thread-pipes-12 (seconds)                |          1.32 |  (R) -8.40% |  (I) 2.32% |         -0.53% |
|                                 | hackbench-thread-pipes-21 (seconds)                |          1.95 |  (R) -2.95% |      0.91% |     (R) -2.00% |
|                                 | hackbench-thread-pipes-30 (seconds)                |          2.50 |  (R) -4.61% |      1.47% |         -1.63% |
|                                 | hackbench-thread-pipes-48 (seconds)                |          3.32 |  (R) -5.45% |  (I) 2.15% |          0.81% |
|                                 | hackbench-thread-pipes-79 (seconds)                |          4.04 |  (R) -5.53% |      1.85% |         -0.53% |
|                                 | hackbench-thread-pipes-110 (seconds)               |          4.94 |  (R) -2.33% |      1.51% |          0.59% |
|                                 | hackbench-thread-pipes-141 (seconds)               |          6.04 |  (R) -2.47% |      1.15% |          0.24% |
|                                 | hackbench-thread-pipes-172 (seconds)               |          7.15 |      -0.91% |      1.48% |          0.45% |
|                                 | hackbench-thread-pipes-203 (seconds)               |          8.31 |      -1.29% |      0.77% |          0.40% |
|                                 | hackbench-thread-pipes-234 (seconds)               |          9.49 |      -1.03% |      0.77% |          0.65% |
|                                 | hackbench-thread-pipes-256 (seconds)               |         10.30 |      -0.80% |      0.42% |          0.30% |
|                                 | hackbench-thread-sockets-1 (seconds)               |          0.31 |       0.05% |     -0.05% |         -0.43% |
|                                 | hackbench-thread-sockets-4 (seconds)               |          0.79 |  (I) 18.91% | (I) 16.82% |     (I) 19.79% |
|                                 | hackbench-thread-sockets-7 (seconds)               |          1.16 |  (I) 12.57% | (I) 10.63% |     (I) 12.95% |
|                                 | hackbench-thread-sockets-12 (seconds)              |          1.87 |  (I) 12.65% | (I) 12.26% |     (I) 13.90% |
|                                 | hackbench-thread-sockets-21 (seconds)              |          3.16 |  (I) 11.62% | (I) 12.74% |     (I) 13.89% |
|                                 | hackbench-thread-sockets-30 (seconds)              |          4.32 |   (I) 7.35% |  (I) 8.89% |      (I) 9.51% |
|                                 | hackbench-thread-sockets-48 (seconds)              |          6.45 |   (I) 2.69% |  (I) 3.06% |      (I) 3.74% |
|                                 | hackbench-thread-sockets-79 (seconds)              |         10.15 |   (I) 3.30% |      1.98% |      (I) 2.76% |
|                                 | hackbench-thread-sockets-110 (seconds)             |         13.45 |      -0.25% |  (I) 3.68% |          0.44% |
|                                 | hackbench-thread-sockets-141 (seconds)             |         17.87 |  (R) -2.18% |  (I) 8.46% |          1.51% |
|                                 | hackbench-thread-sockets-172 (seconds)             |         24.38 |       1.02% | (I) 24.33% |          1.38% |
|                                 | hackbench-thread-sockets-203 (seconds)             |         28.38 |      -0.99% | (I) 24.20% |          0.57% |
|                                 | hackbench-thread-sockets-234 (seconds)             |         32.75 |      -0.42% | (I) 24.35% |          0.72% |
|                                 | hackbench-thread-sockets-256 (seconds)             |         36.49 |      -1.30% | (I) 26.22% |          0.81% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| pts/nginx                       | Connections: 200 (Requests Per Second)             |     252332.60 |  (I) 17.54% |     -0.53% |         -0.61% |
|                                 | Connections: 1000 (Requests Per Second)            |     248591.29 |  (I) 20.41% |      0.10% |          0.57% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Thanks,
Ryan
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Mel Gorman 3 weeks, 1 day ago
On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> On 08/01/2026 13:15, Ryan Roberts wrote:
> > On 08/01/2026 08:50, Mel Gorman wrote:
> >> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> >>> On 05.01.26 12:45, Ryan Roberts wrote:
> >>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >>>>> On 02.01.26 13:38, Ryan Roberts wrote:
> >>>
> >>> [...]
> >>>
> >>
> >> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> >> do not have access to test machines right now cannot revalidate any of
> >> the results against 6.19-rc*.
> > 
> > No problem, thanks for getting back to me!
> > 
> >>
> >>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>>>>> statistically significant regression/improvement, where "statistically 
> >>>>>>> significant" means the 95% confidence intervals do not overlap".
> >>>>>
> >>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>>>>
> >>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> >>>>
> >>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> >>>
> >>> Well, I assume this would be more valuable.
> >>
> >> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> >> an issue with EEVDF in these cases or the specific implementation. The
> >> comparison between 
> >>
> >> 6.18A					(baseline)
> >> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> >> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> >> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)
> > 
> > OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
> > tomorrow. Then we can take it from there.
> 
> Hi Mel, Dietmar,
> 
> Here are the updated results, now including column for "revert #1 & #2".
> 
> 6-18-0 (base)		(baseline)
> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> revert #1 & #2		(NEXT_BUDDY disabled)
> revert #2		(Old NEXT_BUDDY implementation enabled)
> 

Thanks.

> 
> The regressions that are fixed by "revert #2" (as originally reported) are still 
> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> for the latter in the multi-node mysql benchmark (which is our VIP workload). 

It suggests that NEXT_BUDDY in general is harmful to this workload. In an
ideal world, this would also be checked against the NEXT_BUDDY implementation
CFS but it would be a waste of time for many reasons. I find it particularly
interesting that it is only measurable with the 2-machine test as it
suggests, but not proves, that the problem may be related to WF_SYNC
wakeups from the network layer

> There are a couple of hackbench cases (sockets with high thread counts) that 
> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> 
> Let me know if I can usefully do anything else.
> 


> Multi-node SUT (workload running across 2 machines):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     646267.33 |  (R) -1.33% |  (I) 5.87% |      (I) 7.63% |
> |                                 | new order rate (orders/min)                        |     213256.50 |  (R) -1.32% |  (I) 5.87% |      (I) 7.64% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Ok, fairly clear there

> Single-node SUT (workload running on single machine):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |      94700.00 |  (R) -5.10% |     -0.90% |         -0.37% |
> |                                 | max-jOPS (jOPS)                                    |     113984.50 |  (R) -3.90% |     -0.65% |          0.65% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

I assume this is specjbb2015. I'm a little cautious of this results as
specjbb2015 focuses on peak performance. It starts with low CPU usage and
scales up to find the point where performance reaches a peak. This metric can
be gamed and what works for specjbb, particularly as the machines approches
being heavily utilised and transitions to overloaded can be problematic.

Can you look at the detailed results for specjbb2015 and determine if the
peak was picked from different load points?

> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     245438.25 |  (R) -3.88% |     -0.13% |          0.24% |
> |                                 | new order rate (orders/min)                        |      80985.75 |  (R) -3.78% |     -0.07% |          0.29% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |      63124.00 |   (I) 2.90% |      0.74% |          0.85% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |         0.016 |   (I) 5.49% |      1.05% |          1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |        974.92 |       0.11% |     -0.08% |         -0.03% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |          1.03 |       0.12% |     -0.06% |         -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |    1915931.58 |  (R) -2.25% |  (I) 2.12% |          1.62% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |          0.13 |  (R) -2.37% |  (I) 2.09% |          1.69% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |        855.67 |      -1.36% |     -0.14% |         -0.12% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |        292.39 |      -1.31% |     -0.08% |         -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |    1534130.08 | (R) -11.37% |      0.08% |          0.48% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |          0.65 | (R) -11.38% |      0.08% |          0.44% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |        578.75 |      -1.11% |      2.15% |         -0.96% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |       1736.98 |      -1.26% |      2.47% |         -0.90% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |      57170.33 |       1.68% |      0.10% |          0.22% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |         0.018 |       1.94% |      0.00% |          0.96% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |        836.58 |      -0.37% |     -0.41% |          0.07% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |          1.20 |      -0.37% |     -0.40% |          0.06% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |    1773440.67 |      -1.61% |      1.67% |          1.34% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |          0.14 |      -1.40% |      1.56% |          1.20% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |       5505.50 |      -0.17% |     -0.86% |         -1.66% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |         45.42 |      -0.17% |     -0.85% |         -1.67% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |    1393037.50 | (R) -10.31% |     -0.19% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |          0.72 | (R) -10.30% |     -0.17% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |       5085.92 |       0.27% |      0.07% |         -0.79% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |        196.79 |       0.23% |      0.05% |         -0.81% |

A few points of concern but nothing as severe as the mysql Multi-node
SUT. The worst regressions are when the number of clients exceeds the number
of CPUs and at that point any wakeup preemption is potentially harmful.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |          0.14 |      -1.51% |     -1.05% |         -1.51% |
> |                                 | hackbench-process-pipes-4 (seconds)                |          0.44 |   (I) 6.49% |  (I) 5.42% |      (I) 6.06% |
> |                                 | hackbench-process-pipes-7 (seconds)                |          0.68 | (R) -18.36% |  (I) 3.40% |         -0.41% |

So hackbench is all over the place with a mix of gains and losses so no
clear winner.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |     252332.60 |  (I) 17.54% |     -0.53% |         -0.61% |
> |                                 | Connections: 1000 (Requests Per Second)            |     248591.29 |  (I) 20.41% |      0.10% |          0.57% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> 

And this is the main winner. The results confirm that NEXT_BUDDY is not
a universal win but the mysql results and Daytrader results from Madadi
are a concern.

I still don't have access to test machines to investigate this properly
and may not have access for 1-2 weeks. I think the best approach for now
is to disable NEXT_BUDDY by default again until it's determined exactly
why mysql multi-host and daytrader suffered.

Can you this this to be sure please?

--8<--
sched/fair: Disable scheduler feature NEXT_BUDDY

NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..136a6584be79 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, true)
+SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Peter Zijlstra 3 weeks, 5 days ago
On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:

> Here are the updated results, now including column for "revert #1 & #2".
> 
> 6-18-0 (base)		(baseline)
> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> revert #1 & #2		(NEXT_BUDDY disabled)
> revert #2		(Old NEXT_BUDDY implementation enabled)
> 
> 
> The regressions that are fixed by "revert #2" (as originally reported) are still 
> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
> There are a couple of hackbench cases (sockets with high thread counts) that 
> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> 
> Let me know if I can usefully do anything else.

If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
defining characteristic of BATCH is that it fully ignores wakeup
preemption.
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 3 weeks, 5 days ago
On 12/01/2026 07:47, Peter Zijlstra wrote:
> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> 
>> Here are the updated results, now including column for "revert #1 & #2".
>>
>> 6-18-0 (base)		(baseline)
>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>> revert #1 & #2		(NEXT_BUDDY disabled)
>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>
>>
>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>> There are a couple of hackbench cases (sockets with high thread counts) that 
>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>
>> Let me know if I can usefully do anything else.
> 
> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
> defining characteristic of BATCH is that it fully ignores wakeup
> preemption.

Is there a way I can force all future tasks to use SCHED_BATCH at the system
level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
me to do. But if I need to invoke the top level command with chrt -b and hope
that nothing in the workload explicitly changes the scheduling policy that would
be both trickier for me to do and (I guess) higher risk that it ends up not
doing what I expected. Happy to give whatever you recommend a try...
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by K Prateek Nayak 3 weeks, 4 days ago
Hello Ryan,

On 1/12/2026 2:22 PM, Ryan Roberts wrote:
> On 12/01/2026 07:47, Peter Zijlstra wrote:
>> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
>>
>>> Here are the updated results, now including column for "revert #1 & #2".
>>>
>>> 6-18-0 (base)		(baseline)
>>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>>> revert #1 & #2		(NEXT_BUDDY disabled)
>>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>>
>>>
>>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>>> There are a couple of hackbench cases (sockets with high thread counts) that 
>>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>>
>>> Let me know if I can usefully do anything else.
>>
>> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
>> defining characteristic of BATCH is that it fully ignores wakeup
>> preemption.
> 
> Is there a way I can force all future tasks to use SCHED_BATCH at the system
> level?

One shortcut is to echo "NO_WAKEUP_PREEMPTION" into
/sys/kernel/debug/sched/features but note it'll disable wakeup preemption
for all tasks, including kthreads, which might adversely affect
performance and is not an exact equivalent to only running the workload
under SCHED_BATCH.

For repro-collection/mysql-workload, (which I presume is [1]), there is
a "WORKLOAD_SCHED_POLICY" environment variable that can be overridden [2]
which controls the "CPUSchedulingPolicy" of the mysqld service.

[1] https://github.com/aws/repro-collection/tree/main/workloads/mysql
[2] https://github.com/aws/repro-collection/blob/a2cdf0455bd3422c9c1fc689ceac32971223b984/repros/repro-mysql-EEVDF-regression/main.sh#L102

-- 
Thanks and Regards,
Prateek
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Peter Zijlstra 3 weeks, 5 days ago
On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote:
> On 12/01/2026 07:47, Peter Zijlstra wrote:
> > On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> > 
> >> Here are the updated results, now including column for "revert #1 & #2".
> >>
> >> 6-18-0 (base)		(baseline)
> >> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> >> revert #1 & #2		(NEXT_BUDDY disabled)
> >> revert #2		(Old NEXT_BUDDY implementation enabled)
> >>
> >>
> >> The regressions that are fixed by "revert #2" (as originally reported) are still 
> >> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> >> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
> >> There are a couple of hackbench cases (sockets with high thread counts) that 
> >> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> >>
> >> Let me know if I can usefully do anything else.
> > 
> > If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
> > defining characteristic of BATCH is that it fully ignores wakeup
> > preemption.
> 
> Is there a way I can force all future tasks to use SCHED_BATCH at the system
> level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
> me to do. But if I need to invoke the top level command with chrt -b and hope
> that nothing in the workload explicitly changes the scheduling policy that would
> be both trickier for me to do and (I guess) higher risk that it ends up not
> doing what I expected. Happy to give whatever you recommend a try...

No fancy things here, chrt/schedtool are it.
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 3 weeks, 4 days ago
On 12/01/2026 09:57, Peter Zijlstra wrote:
> On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote:
>> On 12/01/2026 07:47, Peter Zijlstra wrote:
>>> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
>>>
>>>> Here are the updated results, now including column for "revert #1 & #2".
>>>>
>>>> 6-18-0 (base)		(baseline)
>>>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>>>> revert #1 & #2		(NEXT_BUDDY disabled)
>>>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>>>
>>>>
>>>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>>>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>>>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>>>> There are a couple of hackbench cases (sockets with high thread counts) that 
>>>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>>>
>>>> Let me know if I can usefully do anything else.
>>>
>>> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
>>> defining characteristic of BATCH is that it fully ignores wakeup
>>> preemption.
>>
>> Is there a way I can force all future tasks to use SCHED_BATCH at the system
>> level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
>> me to do. But if I need to invoke the top level command with chrt -b and hope
>> that nothing in the workload explicitly changes the scheduling policy that would
>> be both trickier for me to do and (I guess) higher risk that it ends up not
>> doing what I expected. Happy to give whatever you recommend a try...
> 
> No fancy things here, chrt/schedtool are it.

OK I'll figure out how to butcher this into my workflow and get back to you with
results. It probably won't be until Wednesday though.

Thanks,
Ryan
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Shrikanth Hegde 1 month ago
Hi Ryan,

>> node distances:
>> node   0   1
>>    0:  10  20
>>    1:  20  10
>>
>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>> domain? I guess topology has influence in benchmark numbers here as well.
> 
> I can't easily enable scheduler debugging right now (which I think is needed to
> get this info directly?). But that's what I'd expect, yes. lscpu confirms there
> is a single NUMA node and topology for cpu0 gives this if it helps:

If you dump /proc/schedstat it should give you topology info as well.

(you will need to parse it depending on which CPU you are looking this from)
Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Posted by Ryan Roberts 1 month ago
On 05/01/2026 14:38, Shrikanth Hegde wrote:
> 
> Hi Ryan,
> 
>>> node distances:
>>> node   0   1
>>>    0:  10  20
>>>    1:  20  10
>>>
>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>> domain? I guess topology has influence in benchmark numbers here as well.
>>
>> I can't easily enable scheduler debugging right now (which I think is needed to
>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there
>> is a single NUMA node and topology for cpu0 gives this if it helps:
> 
> If you dump /proc/schedstat it should give you topology info as well.
> 
> (you will need to parse it depending on which CPU you are looking this from)

Ahh yes, thanks!

Every cpu is reported as being in "domain0 MC ffffffff,ffffffff". So I guess
that means there is a single MC domain as Dietmar suggests.

Thanks,
Ryan