From nobody Sun Feb 8 16:11:31 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E28AD15E86; Sun, 18 Aug 2024 06:23:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723962194; cv=none; b=SiKrVkulWCSPvI9MjizccHr9m8OZVT0DUzgXWnn8Iqw3e/FNuaoM27veHwLJ+myGtNAAKbs61qLCs/qQRYQB71prVjwzwNdjg7AO/g8BJrZB2Lag21WRamD6Mc+B20dB3B4IDp5Dnd4YSqRqfJ8AI0WPZKzJWlpsQPaqt8do9Y4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723962194; c=relaxed/simple; bh=RvzXxYY4Ug4tgi0oeX95q3qK5GMP4cbQwrRBqhoAUv0=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=s2aTAW9WfLWj/OS4VDeMMGH++dl9Z+2u5jMSkh5qwOoDla1D0q6Poqq5T0bMbjZf2n4RrPtlDqSXbRvjLL2yy7t7JvmfzhcMLPhJNJ9ntxgxZh9D/BB36L8iwauUkSozaAW3BLJnXMPRvK77joWRRSjbxKVzBk/V8wgqY0tQEm4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DzqXnLHz; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=aSa9WKL3; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DzqXnLHz"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="aSa9WKL3" Date: Sun, 18 Aug 2024 06:23:07 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1723962188; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tf29XoGTLB8rkg8x7MCxRIqiO6sBpmIxWfbkixgAP/w=; b=DzqXnLHz8aGqbRkS8C2Frb0pU6ECZIJZtIPPxKJtbIxgnpVLbDeh3F3ZOS3gakLWTt0Bei cnxEMIy0Dr2JmghViCoaqkn3XGNVrOIUI+jofhUNdVHo8pF+wcin1b9uJOqR4YHEmLFE+u JjYz7DcEKPTyXtq6GbH+favzwtx7jHmTQK8rwIBrsB0m5Xmi2AlxN33h02tdcI3Xp7e8kf L+IRWZ4Nga4C7R2KkiQD5s14/H+d4sidli0+e4RHCYRQs/b0DEJwj0fFongCYKON3/fYXj WBk8/R42IB0gOciaxUezmh/UfFwazsfdMLI+X7jxedzM1BTdOdVhS9Ez3DDcWA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1723962188; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tf29XoGTLB8rkg8x7MCxRIqiO6sBpmIxWfbkixgAP/w=; b=aSa9WKL31YSyNK7JvuIyVYqiH5L4D6o8gjsp4lTPf5F+annElRUlLZGGfR7D5dmwQN/SRf FmjViTCUO4iSW+CQ== From: "tip-bot2 for Peter Zijlstra" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched/fair: Implement delayed dequeue Cc: "Peter Zijlstra (Intel)" , Valentin Schneider , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20240727105030.226163742@infradead.org> References: <20240727105030.226163742@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <172396218770.2215.10414003605653657609.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the sched/core branch of tip: Commit-ID: 152e11f6df293e816a6a37c69757033cdc72667d Gitweb: https://git.kernel.org/tip/152e11f6df293e816a6a37c69757033cd= c72667d Author: Peter Zijlstra AuthorDate: Thu, 23 May 2024 12:25:32 +02:00 Committer: Peter Zijlstra CommitterDate: Sat, 17 Aug 2024 11:06:44 +02:00 sched/fair: Implement delayed dequeue Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by noting that lag is fundamentally a temporal measure. It should not be carried around indefinitely. OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. Since lag is intimately tied to the virtual time base, a wall-time based decay is also insufficient, notably competition is required for any of this to make sense. Instead, delay the dequeue and keep the 'tasks' on the runqueue, competing until they are eligible. Strictly speaking, we only care about keeping them until the 0-lag point, but that is a difficult proposition, instead carry them around until they get picked again, and dequeue them at that point. Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Tested-by: Valentin Schneider Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org --- kernel/sched/deadline.c | 1 +- kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++----- kernel/sched/features.h | 9 +++++- 3 files changed, 79 insertions(+), 11 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index bbaeace..0f2df67 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2428,7 +2428,6 @@ again: else p =3D dl_se->server_pick_next(dl_se); if (!p) { - WARN_ON_ONCE(1); dl_se->dl_yielded =3D 1; update_curr_dl_se(rq, dl_se, 0); goto again; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 25b14df..da5065a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5379,20 +5379,39 @@ static void clear_buddies(struct cfs_rq *cfs_rq, st= ruct sched_entity *se) =20 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); =20 -static void +static bool dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - int action =3D UPDATE_TG; + update_curr(cfs_rq); + + if (flags & DEQUEUE_DELAYED) { + SCHED_WARN_ON(!se->sched_delayed); + } else { + bool sleep =3D flags & DEQUEUE_SLEEP; =20 + /* + * DELAY_DEQUEUE relies on spurious wakeups, special task + * states must not suffer spurious wakeups, excempt them. + */ + if (flags & DEQUEUE_SPECIAL) + sleep =3D false; + + SCHED_WARN_ON(sleep && se->sched_delayed); + + if (sched_feat(DELAY_DEQUEUE) && sleep && + !entity_eligible(cfs_rq, se)) { + if (cfs_rq->next =3D=3D se) + cfs_rq->next =3D NULL; + se->sched_delayed =3D 1; + return false; + } + } + + int action =3D UPDATE_TG; if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) action |=3D DO_DETACH; =20 /* - * Update run-time statistics of the 'current'. - */ - update_curr(cfs_rq); - - /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. * - For group_entity, update its runnable_weight to reflect the new @@ -5428,8 +5447,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_e= ntity *se, int flags) if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) !=3D DEQUEUE_SAVE) update_min_vruntime(cfs_rq); =20 + if (flags & DEQUEUE_DELAYED) + se->sched_delayed =3D 0; + if (cfs_rq->nr_running =3D=3D 0) update_idle_cfs_rq_clock_pelt(cfs_rq); + + return true; } =20 static void @@ -5828,11 +5852,21 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) idle_task_delta =3D cfs_rq->idle_h_nr_running; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); + int flags; + /* throttled entity or throttle-on-deactivate */ if (!se->on_rq) goto done; =20 - dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP); + /* + * Abuse SPECIAL to avoid delayed dequeue in this instance. + * This avoids teaching dequeue_entities() about throttled + * entities and keeps things relatively simple. + */ + flags =3D DEQUEUE_SLEEP | DEQUEUE_SPECIAL; + if (se->sched_delayed) + flags |=3D DEQUEUE_DELAYED; + dequeue_entity(qcfs_rq, se, flags); =20 if (cfs_rq_is_idle(group_cfs_rq(se))) idle_task_delta =3D cfs_rq->h_nr_running; @@ -6918,6 +6952,7 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) bool was_sched_idle =3D sched_idle_rq(rq); int rq_h_nr_running =3D rq->cfs.h_nr_running; bool task_sleep =3D flags & DEQUEUE_SLEEP; + bool task_delayed =3D flags & DEQUEUE_DELAYED; struct task_struct *p =3D NULL; int idle_h_nr_running =3D 0; int h_nr_running =3D 0; @@ -6931,7 +6966,13 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) =20 for_each_sched_entity(se) { cfs_rq =3D cfs_rq_of(se); - dequeue_entity(cfs_rq, se, flags); + + if (!dequeue_entity(cfs_rq, se, flags)) { + if (p && &p->se =3D=3D se) + return -1; + + break; + } =20 cfs_rq->h_nr_running -=3D h_nr_running; cfs_rq->idle_h_nr_running -=3D idle_h_nr_running; @@ -6956,6 +6997,7 @@ static int dequeue_entities(struct rq *rq, struct sch= ed_entity *se, int flags) break; } flags |=3D DEQUEUE_SLEEP; + flags &=3D ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL); } =20 for_each_sched_entity(se) { @@ -6985,6 +7027,17 @@ static int dequeue_entities(struct rq *rq, struct sc= hed_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance =3D jiffies; =20 + if (p && task_delayed) { + SCHED_WARN_ON(!task_sleep); + SCHED_WARN_ON(p->on_rq !=3D 1); + + /* Fix-up what dequeue_task_fair() skipped */ + hrtick_update(rq); + + /* Fix-up what block_task() skipped. */ + __block_task(rq, p); + } + return 1; } =20 @@ -6997,8 +7050,10 @@ static bool dequeue_task_fair(struct rq *rq, struct = task_struct *p, int flags) { util_est_dequeue(&rq->cfs, p); =20 - if (dequeue_entities(rq, &p->se, flags) < 0) + if (dequeue_entities(rq, &p->se, flags) < 0) { + util_est_update(&rq->cfs, p, DEQUEUE_SLEEP); return false; + } =20 util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); hrtick_update(rq); @@ -12971,6 +13026,11 @@ static void set_next_task_fair(struct rq *rq, stru= ct task_struct *p, bool first) /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } + + if (!first) + return; + + SCHED_WARN_ON(se->sched_delayed); } =20 void init_cfs_rq(struct cfs_rq *cfs_rq) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 97fb2d4..1feaa7b 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false) SCHED_FEAT(CACHE_HOT_BUDDY, true) =20 /* + * Delay dequeueing tasks until they get selected or woken. + * + * By delaying the dequeue for non-eligible tasks, they remain in the + * competition and can burn off their negative lag. When they get selected + * they'll have positive lag by definition. + */ +SCHED_FEAT(DELAY_DEQUEUE, true) + +/* * Allow wakeup-time preemption of the current task: */ SCHED_FEAT(WAKEUP_PREEMPTION, true)