From nobody Sun Feb 8 03:27:27 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBB6735A148; Tue, 11 Nov 2025 11:37:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762861045; cv=none; b=VdW/7itATWDNAzXci9pnlqfmcwm5T+FIGNafnqEc5CSWh4PIFLnrwOataUyhhiHp9FugsNLUY1volQ5JeNjXvHvuywoO/wZ7BEKBwgPbpoiFJabUrRQmIR2UGw6G+7GYwb33FgGQHTOA8dkJiEV+Dj6QvdxncgG56pJVV8+umkA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762861045; c=relaxed/simple; bh=XQFKDHy59O6bEE1IeHhXoAW69drdAgefdfZ80dWY6ag=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=rhGvcg4R5MG3Ot/g8eR+1GsYBVsQTkB9LPXJ60ze7uhgjnWegz9rRhNyY8CTQdEiRua+K7dmT4v0a3JASO2RwGBwCWp6+9lmHJWK8E8JnMqX+hOxT5KPldgi6cwGg1U6J+rx4sxiElye3h2i94zy7bo1RaFQcPcFe4mFDsCMM2o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=AER+VIbK; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=LN0gWQoo; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="AER+VIbK"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="LN0gWQoo" Date: Tue, 11 Nov 2025 11:37:20 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1762861041; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7Pijp7WiI4wxxiTwlXMUifmU3zXyfJNCFta2ldMxMVU=; b=AER+VIbKsXzwgbR8Firz0MnozrO8WXo/aC35ujllqFHeSIvuIJJLASP506OFjjq72fOyXV r7oSvAusHr3zhmerB8bhpC7dxMV/1aJQSFeUNMQVWAml4MJkWFYwXJBdgiWpvhLLdsSUip RNu5C8FAWLTpO+HYy4yj8wzo4EKvkKF91TPi6aJguOh/Pc6nrGSD0FJGYTD8S4//I3pgEI rI0IBEfjnoVfglxFnCuQTDNnYQA/jGtYx3LodYUiIff3IoBP4G/DTkS1oiW2APeS8Pu7ky uG54WDYeKv+mSp05FQewYIwS59UnAAWZi5ebWWX73It/xeAPeUjUc/B972c68A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1762861041; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7Pijp7WiI4wxxiTwlXMUifmU3zXyfJNCFta2ldMxMVU=; b=LN0gWQooFeqd8FHqDrRN/bG7p7HOkxFOkG6wS1/Ox/v/iB1Z2zcR0CSWUxF7MJOBrNVu61 3SvcCMymdDL9UBCQ== From: "tip-bot2 for Peter Zijlstra" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched/core: Add comment explaining force-idle vruntime snapshots Cc: "Peter Zijlstra (Intel)" , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20251106111603.GB4068168@noisy.programming.kicks-ass.net> References: <20251106111603.GB4068168@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <176286104030.498.2731665810575423837.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the sched/core branch of tip: Commit-ID: 9359d9785d85bb53f1ff1738a59aeeec4b878906 Gitweb: https://git.kernel.org/tip/9359d9785d85bb53f1ff1738a59aeeec4= b878906 Author: Peter Zijlstra AuthorDate: Thu, 06 Nov 2025 10:50:49 +01:00 Committer: Peter Zijlstra CommitterDate: Tue, 11 Nov 2025 12:33:37 +01:00 sched/core: Add comment explaining force-idle vruntime snapshots I always end up having to re-read these emails every time I look at this code. And a future patch is going to change this story a little. This means it is past time to stick them in a comment so it can be modified and stay current. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kic= ks-ass.net Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kic= ks-ass.net Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.k= icks-ass.net --- kernel/sched/fair.c | 181 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 181 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f1d8eb3..4a11a83 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -13015,6 +13015,187 @@ static inline void task_tick_core(struct rq *rq, = struct task_struct *curr) } =20 /* + * Consider any infeasible weight scenario. Take for instance two tasks, + * each bound to their respective sibling, one with weight 1 and one with + * weight 2. Then the lower weight task will run ahead of the higher weight + * task without bound. + * + * This utterly destroys the concept of a shared time base. + * + * Remember; all this is about a proportionally fair scheduling, where each + * tasks receives: + * + * w_i + * dt_i =3D ---------- dt (1) + * \Sum_j w_j + * + * which we do by tracking a virtual time, s_i: + * + * 1 + * s_i =3D --- d[t]_i (2) + * w_i + * + * Where d[t] is a delta of discrete time, while dt is an infinitesimal. + * The immediate corollary is that the ideal schedule S, where (2) to use + * an infinitesimal delta, is: + * + * 1 + * S =3D ---------- dt (3) + * \Sum_i w_i + * + * From which we can define the lag, or deviation from the ideal, as: + * + * lag(i) =3D S - s_i (4) + * + * And since the one and only purpose is to approximate S, we get that: + * + * \Sum_i w_i lag(i) :=3D 0 (5) + * + * If this were not so, we no longer converge to S, and we can no longer + * claim our scheduler has any of the properties we derive from S. This is + * exactly what you did above, you broke it! + * + * + * Let's continue for a while though; to see if there is anything useful to + * be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i: + * + * \Sum_i w_i s_i + * S =3D -------------- (6) + * \Sum_i w_i + * + * Which gives us a way to compute S, given our s_i. Now, if you've read + * our code, you know that we do not in fact do this, the reason for this + * is two-fold. Firstly, computing S in that way requires a 64bit division + * for every time we'd use it (see 12), and secondly, this only describes + * the steady-state, it doesn't handle dynamics. + * + * Anyway, in (6): s_i -> x + (s_i - x), to get: + * + * \Sum_i w_i (s_i - x) + * S - x =3D -------------------- (7) + * \Sum_i w_i + * + * Which shows that S and s_i transform alike (which makes perfect sense + * given that S is basically the (weighted) average of s_i). + * + * Then: + * + * x -> s_min :=3D min{s_i} (8) + * + * to obtain: + * + * \Sum_i w_i (s_i - s_min) + * S =3D s_min + ------------------------ (9) + * \Sum_i w_i + * + * Which already looks familiar, and is the basis for our current + * approximation: + * + * S ~=3D s_min (10) + * + * Now, obviously, (10) is absolute crap :-), but it sorta works. + * + * So the thing to remember is that the above is strictly UP. It is + * possible to generalize to multiple runqueues -- however it gets really + * yuck when you have to add affinity support, as illustrated by our very + * first counter-example. + * + * Luckily I think we can avoid needing a full multi-queue variant for + * core-scheduling (or load-balancing). The crucial observation is that we + * only actually need this comparison in the presence of forced-idle; only + * then do we need to tell if the stalled rq has higher priority over the + * other. + * + * [XXX assumes SMT2; better consider the more general case, I suspect + * it'll work out because our comparison is always between 2 rqs and the + * answer is only interesting if one of them is forced-idle] + * + * And (under assumption of SMT2) when there is forced-idle, there is only + * a single queue, so everything works like normal. + * + * Let, for our runqueue 'k': + * + * T_k =3D \Sum_i w_i s_i + * W_k =3D \Sum_i w_i ; for all i of k (11) + * + * Then we can write (6) like: + * + * T_k + * S_k =3D --- (12) + * W_k + * + * From which immediately follows that: + * + * T_k + T_l + * S_k+l =3D --------- (13) + * W_k + W_l + * + * On which we can define a combined lag: + * + * lag_k+l(i) :=3D S_k+l - s_i (14) + * + * And that gives us the tools to compare tasks across a combined runqueue. + * + * + * Combined this gives the following: + * + * a) when a runqueue enters force-idle, sync it against it's sibling rq(= s) + * using (7); this only requires storing single 'time'-stamps. + * + * b) when comparing tasks between 2 runqueues of which one is forced-idl= e, + * compare the combined lag, per (14). + * + * Now, of course cgroups (I so hate them) make this more interesting in + * that a) seems to suggest we need to iterate all cgroup on a CPU at such + * boundaries, but I think we can avoid that. The force-idle is for the + * whole CPU, all it's rqs. So we can mark it in the root and lazily + * propagate downward on demand. + */ + +/* + * So this sync is basically a relative reset of S to 0. + * + * So with 2 queues, when one goes idle, we drop them both to 0 and one + * then increases due to not being idle, and the idle one builds up lag to + * get re-elected. So far so simple, right? + * + * When there's 3, we can have the situation where 2 run and one is idle, + * we sync to 0 and let the idle one build up lag to get re-election. Now + * suppose another one also drops idle. At this point dropping all to 0 + * again would destroy the built-up lag from the queue that was already + * idle, not good. + * + * So instead of syncing everything, we can: + * + * less :=3D !((s64)(s_a - s_b) <=3D 0) + * + * (v_a - S_a) - (v_b - S_b) =3D=3D v_a - v_b - S_a + S_b + * =3D=3D v_a - (v_b - S_a + S_b) + * + * IOW, we can recast the (lag) comparison to a one-sided difference. + * So if then, instead of syncing the whole queue, sync the idle queue + * against the active queue with S_a + S_b at the point where we sync. + * + * (XXX consider the implication of living in a cyclic group: N / 2^n N) + * + * This gives us means of syncing single queues against the active queue, + * and for already idle queues to preserve their build-up lag. + * + * Of course, then we get the situation where there's 2 active and one + * going idle, who do we pick to sync against? Theory would have us sync + * against the combined S, but as we've already demonstrated, there is no + * such thing in infeasible weight scenarios. + * + * One thing I've considered; and this is where that core_active rudiment + * came from, is having active queues sync up between themselves after + * every tick. This limits the observed divergence due to the work + * conservancy. + * + * On top of that, we can improve upon things by moving away from our + * horrible (10) hack and moving to (9) and employing (13) here. + */ + +/* * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if= needed. */ static void se_fi_update(const struct sched_entity *se, unsigned int fi_se= q,