From nobody Wed Oct 8 04:11:07 2025 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7E6A26D4D8 for ; Wed, 2 Jul 2025 12:12:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; cv=none; b=ehkgB0eGSNT7BVEqpwiPuEiYQOUtZwiddn23a6eO2ziGjZoWE0pqJT+JIypCyrJ8ao0fR/iYyxVhwiuFTutM5yMYYD4YBdvRKXkTsGubWn5F0sQ82t4bEu9VNPHcJk9WT078KPYCAdwCu9c4OrEt6+cPhAKOnJ3YSvoXFyA3mZ8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; c=relaxed/simple; bh=iL6Xr1UbCXh9U0OUHMk9e+4p13YuWktvg/N7dx6E8m4=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=n8/bivDAlqwdoko7ztYnpSHia/pitjDtbXEUGoRDlLXfpDLZLuCJhGCHlK8UKjdBhQ3d66U0XEvqW3ts40Eul3wZKSJBVqjT8reli1nIkaDXFsCkdECZZeM+ed6TGDWowgopyuPJk5H4FU/38KMkbwvUtIVFMjpoN0IlWQisJO8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=q0kmm7Bg; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="q0kmm7Bg" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=yzkICH1Op6XVW/c3NAJn8TeKEXZpkhzNFaNncVS++/g=; b=q0kmm7Bg88UaLh8NWPNQ8zWvbp cgYmOkKGzQAy81ZzkHhRbjyNTVios3ws5XyHfi43RAm98woFFDE2EhQ3Y3ghkdsZE0n0ODN7LiHcp kSudck+hwA8l7YCgdbpru0jaeq38FAvsB2FUDnurIrhYCx9m6hqmBZsncM7oDQO3Cha2NRBUB2ret QZuH2ZZRE2GCosz2HBskNc3HnjCtip8pXG/4hpOvrU8XUWEcRmtW9XDux6Of77DB1nt+jKSPfmdak hKGG9DJN2E+6ijiftaqsaPdiZjnLjb4lOgv5xlH8eT72h9o7VpCqEqmPByKWX/Mb9nQ4+u1EQdk9H DuVpcmRA==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000007Lo8-0okn; Wed, 02 Jul 2025 12:12:47 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id BAABF30012D; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.350561696@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:25 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, Johannes Weiner Subject: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregatio= n race") Reported-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Tested-by: Dietmar Eggemann , Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kick= s-ass.net Reported-by: Chris Mason Suggested-by: Beata Michalska Tested-by: K Prateek Nayak Tested-by: Srikanth Aithal --- include/linux/psi_types.h | 6 -- kernel/sched/psi.c | 121 +++++++++++++++++++++++++----------------= ----- 2 files changed, 68 insertions(+), 59 deletions(-) --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -84,11 +84,9 @@ enum psi_aggregators { struct psi_group_cpu { /* 1st cacheline updated by the scheduler */ =20 - /* Aggregator needs to know of concurrent changes */ - seqcount_t seq ____cacheline_aligned_in_smp; - /* States of the tasks belonging to this group */ - unsigned int tasks[NR_PSI_TASK_COUNTS]; + unsigned int tasks[NR_PSI_TASK_COUNTS] + ____cacheline_aligned_in_smp; =20 /* Aggregate pressure state derived from the tasks */ u32 state_mask; --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -176,6 +176,28 @@ struct psi_group psi_system =3D { .pcpu =3D &system_group_pcpu, }; =20 +static DEFINE_PER_CPU(seqcount_t, psi_seq); + +static inline void psi_write_begin(int cpu) +{ + write_seqcount_begin(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline void psi_write_end(int cpu) +{ + write_seqcount_end(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline u32 psi_read_begin(int cpu) +{ + return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu)); +} + +static inline bool psi_read_retry(int cpu, u32 seq) +{ + return read_seqcount_retry(per_cpu_ptr(&psi_seq, cpu), seq); +} + static void psi_avgs_work(struct work_struct *work); =20 static void poll_timer_fn(struct timer_list *t); @@ -186,7 +208,7 @@ static void group_init(struct psi_group =20 group->enabled =3D true; for_each_possible_cpu(cpu) - seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); + seqcount_init(per_cpu_ptr(&psi_seq, cpu)); group->avg_last_update =3D sched_clock(); group->avg_next_update =3D group->avg_last_update + psi_period; mutex_init(&group->avgs_lock); @@ -266,14 +288,14 @@ static void get_recent_times(struct psi_ =20 /* Snapshot a coherent view of the CPU state */ do { - seq =3D read_seqcount_begin(&groupc->seq); + seq =3D psi_read_begin(cpu); now =3D cpu_clock(cpu); memcpy(times, groupc->times, sizeof(groupc->times)); state_mask =3D groupc->state_mask; state_start =3D groupc->state_start; if (cpu =3D=3D current_cpu) memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); - } while (read_seqcount_retry(&groupc->seq, seq)); + } while (psi_read_retry(cpu, seq)); =20 /* Calculate state time deltas against the previous snapshot */ for (s =3D 0; s < NR_PSI_STATES; s++) { @@ -772,31 +794,21 @@ static void record_times(struct psi_grou groupc->times[PSI_NONIDLE] +=3D delta; } =20 +#define for_each_group(iter, group) \ + for (typeof(group) iter =3D group; iter; iter =3D iter->parent) + static void psi_group_change(struct psi_group *group, int cpu, unsigned int clear, unsigned int set, - bool wake_clock) + u64 now, bool wake_clock) { struct psi_group_cpu *groupc; unsigned int t, m; u32 state_mask; - u64 now; =20 lockdep_assert_rq_held(cpu_rq(cpu)); groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 /* - * First we update the task counts according to the state - * change requested through the @clear and @set bits. - * - * Then if the cgroup PSI stats accounting enabled, we - * assess the aggregate resource states this CPU's tasks - * have been in since the last change, and account any - * SOME and FULL time these may have resulted in. - */ - write_seqcount_begin(&groupc->seq); - now =3D cpu_clock(cpu); - - /* * Start with TSK_ONCPU, which doesn't have a corresponding * task count - it's just a boolean flag directly encoded in * the state mask. Clear, set, or carry the current state if @@ -847,7 +859,6 @@ static void psi_group_change(struct psi_ =20 groupc->state_mask =3D state_mask; =20 - write_seqcount_end(&groupc->seq); return; } =20 @@ -868,8 +879,6 @@ static void psi_group_change(struct psi_ =20 groupc->state_mask =3D state_mask; =20 - write_seqcount_end(&groupc->seq); - if (state_mask & group->rtpoll_states) psi_schedule_rtpoll_work(group, 1, false); =20 @@ -904,24 +913,29 @@ static void psi_flags_change(struct task void psi_task_change(struct task_struct *task, int clear, int set) { int cpu =3D task_cpu(task); - struct psi_group *group; + u64 now; =20 if (!task->pid) return; =20 psi_flags_change(task, clear, set); =20 - group =3D task_psi_group(task); - do { - psi_group_change(group, cpu, clear, set, true); - } while ((group =3D group->parent)); + psi_write_begin(cpu); + now =3D cpu_clock(cpu); + for_each_group(group, task_psi_group(task)) + psi_group_change(group, cpu, clear, set, now, true); + psi_write_end(cpu); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep) { - struct psi_group *group, *common =3D NULL; + struct psi_group *common =3D NULL; int cpu =3D task_cpu(prev); + u64 now; + + psi_write_begin(cpu); + now =3D cpu_clock(cpu); =20 if (next->pid) { psi_flags_change(next, 0, TSK_ONCPU); @@ -930,16 +944,15 @@ void psi_task_switch(struct task_struct * ancestors with @prev, those will already have @prev's * TSK_ONCPU bit set, and we can stop the iteration there. */ - group =3D task_psi_group(next); - do { - if (per_cpu_ptr(group->pcpu, cpu)->state_mask & - PSI_ONCPU) { + for_each_group(group, task_psi_group(next)) { + struct psi_group_cpu *groupc =3D per_cpu_ptr(group->pcpu, cpu); + + if (groupc->state_mask & PSI_ONCPU) { common =3D group; break; } - - psi_group_change(group, cpu, 0, TSK_ONCPU, true); - } while ((group =3D group->parent)); + psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); + } } =20 if (prev->pid) { @@ -972,12 +985,11 @@ void psi_task_switch(struct task_struct =20 psi_flags_change(prev, clear, set); =20 - group =3D task_psi_group(prev); - do { + for_each_group(group, task_psi_group(prev)) { if (group =3D=3D common) break; - psi_group_change(group, cpu, clear, set, wake_clock); - } while ((group =3D group->parent)); + psi_group_change(group, cpu, clear, set, now, wake_clock); + } =20 /* * TSK_ONCPU is handled up to the common ancestor. If there are @@ -987,20 +999,21 @@ void psi_task_switch(struct task_struct */ if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &=3D ~TSK_ONCPU; - for (; group; group =3D group->parent) - psi_group_change(group, cpu, clear, set, wake_clock); + for_each_group(group, common) + psi_group_change(group, cpu, clear, set, now, wake_clock); } } + psi_write_end(cpu); } =20 #ifdef CONFIG_IRQ_TIME_ACCOUNTING void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct t= ask_struct *prev) { int cpu =3D task_cpu(curr); - struct psi_group *group; struct psi_group_cpu *groupc; s64 delta; u64 irq; + u64 now; =20 if (static_branch_likely(&psi_disabled) || !irqtime_enabled()) return; @@ -1009,8 +1022,7 @@ void psi_account_irqtime(struct rq *rq, return; =20 lockdep_assert_rq_held(rq); - group =3D task_psi_group(curr); - if (prev && task_psi_group(prev) =3D=3D group) + if (prev && task_psi_group(prev) =3D=3D task_psi_group(curr)) return; =20 irq =3D irq_time_read(cpu); @@ -1019,25 +1031,22 @@ void psi_account_irqtime(struct rq *rq, return; rq->psi_irq_time =3D irq; =20 - do { - u64 now; + psi_write_begin(cpu); + now =3D cpu_clock(cpu); =20 + for_each_group(group, task_psi_group(curr)) { if (!group->enabled) continue; =20 groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 - write_seqcount_begin(&groupc->seq); - now =3D cpu_clock(cpu); - record_times(groupc, now); groupc->times[PSI_IRQ_FULL] +=3D delta; =20 - write_seqcount_end(&groupc->seq); - if (group->rtpoll_states & (1 << PSI_IRQ_FULL)) psi_schedule_rtpoll_work(group, 1, false); - } while ((group =3D group->parent)); + } + psi_write_end(cpu); } #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ =20 @@ -1225,12 +1234,14 @@ void psi_cgroup_restart(struct psi_group return; =20 for_each_possible_cpu(cpu) { - struct rq *rq =3D cpu_rq(cpu); - struct rq_flags rf; + u64 now; =20 - rq_lock_irq(rq, &rf); - psi_group_change(group, cpu, 0, 0, true); - rq_unlock_irq(rq, &rf); + guard(rq_lock_irq)(cpu_rq(cpu)); + + psi_write_begin(cpu); + now =3D cpu_clock(cpu); + psi_group_change(group, cpu, 0, 0, now, true); + psi_write_end(cpu); } } #endif /* CONFIG_CGROUPS */ From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B198274668 for ; Wed, 2 Jul 2025 12:13:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; cv=none; b=r6o2KKn6M2jaQ+ZBT3oRTuTD5CB99QXIzd7ZYun/4QN3nZaFiOxNpMWi3SKsj+kGaNn4LAPMfY1A702iu6Qx+blCnWwUKKUcjM7NfZgJeAod2yz8FzkCl6oyzHXxU8reVfPOcaBMUn0ziIWcjaQZ2Rl6SfrjCz6aUeb/U9ySao8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; c=relaxed/simple; bh=98o4EUwZvkaJiNLHs0TNmxNMeC//8w5lyaohIp/i0gc=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=reJHuaEoV95pyn95bSTdBF4ZMAguQMOmUGnl+eu0ki+hofgB5+RAYQ08F5zpjus0cMiVtS8ldh98xcpi/citpbfCIM5nQmnu9M9BKRrdHiMR5V76OIx7g3GwQ9qYaCCeMWis0viCIkl3lqMh6KvRvl8wEZ+xCmnq5HJ+EUFE4Ys= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=lpIgoQbS; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="lpIgoQbS" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=QQTEg01HMGHcuECpwoqe+vJNXoQ2Sir/0YX71f4+jJQ=; b=lpIgoQbS7GXrdriJDYOyAFWJ6j 07riU9eKQpNScyeKeEC46zoeabwfcpKGP+sf95E2DxPnwjxyooJtAUwpa9hV2oH9q+n9O2cSBAeAZ 9rtPDG9ognst698Hdf31ckB4zwoQJh99bZeS8eNaEcYiA4WJtrvzsKJczi4NBG8XtQESTsOm4+8Zf /kbmeNydesD5NpOEk3euXA5dJBXCcwhGYnSQLBzdCcf4mD+morqgqFX117k54oOuVrR9JQvT5OF9q e+J42+MijhuNVLsiCPv4iXU9GbZlL3oqdFoIfbUxQHT9CVifY1IYAUxmpA7MALsHbrjqT/JbUfvGP 8gKJXY/g==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000009lF1-3VB2; Wed, 02 Jul 2025 12:12:47 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id BF363300158; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.465086194@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:26 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control") caused a significant dip in his favourite benchmark of the day. Simply disabling dl_server cured things. His workload hammers the 0->1, 1->0 transitions, and the dl_server_{start,stop}() overhead kills it -- fairly obviously a bad idea in hind sight and all that. Change things around to only disable the dl_server when there has not been a fair task around for a whole period. Since the default period is 1 second, this ensures the benchmark never trips this, overhead gone. Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server") Reported-by: Chris Mason Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org Acked-by: Juri Lelli Acked-by: Mel Gorman Reviewed-by: Juri Lelli --- include/linux/sched.h | 1 + kernel/sched/deadline.c | 25 ++++++++++++++++++++++--- kernel/sched/fair.c | 9 --------- 3 files changed, 23 insertions(+), 12 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -701,6 +701,7 @@ struct sched_dl_entity { unsigned int dl_defer : 1; unsigned int dl_defer_armed : 1; unsigned int dl_defer_running : 1; + unsigned int dl_server_idle : 1; =20 /* * Bandwidth enforcement timer. Each -deadline task has its --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq /* a defer timer will not be reset if the runtime consumed was < dl_server= _min_res */ static const u64 dl_server_min_res =3D 1 * NSEC_PER_MSEC; =20 +static bool dl_server_stopped(struct sched_dl_entity *dl_se); + static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct = sched_dl_entity *dl_se) { struct rq *rq =3D rq_of_dl_se(dl_se); @@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti =20 if (!dl_se->server_has_tasks(dl_se)) { replenish_dl_entity(dl_se); + dl_server_stopped(dl_se); return HRTIMER_NORESTART; } =20 @@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec) { /* 0 runtime =3D fair server disabled */ - if (dl_se->dl_runtime) + if (dl_se->dl_runtime) { + dl_se->dl_server_idle =3D 0; update_curr_dl_se(dl_se->rq, dl_se, delta_exec); + } } =20 void dl_server_start(struct sched_dl_entity *dl_se) @@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent setup_new_dl_entity(dl_se); } =20 - if (!dl_se->dl_runtime) + if (!dl_se->dl_runtime || dl_se->dl_server_active) return; =20 dl_se->dl_server_active =3D 1; @@ -1684,6 +1689,20 @@ void dl_server_stop(struct sched_dl_enti dl_se->dl_server_active =3D 0; } =20 +static bool dl_server_stopped(struct sched_dl_entity *dl_se) +{ + if (!dl_se->dl_server_active) + return false; + + if (dl_se->dl_server_idle) { + dl_server_stop(dl_se); + return true; + } + + dl_se->dl_server_idle =3D 1; + return false; +} + void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, dl_server_has_tasks_f has_tasks, dl_server_pick_f pick_task) @@ -2435,7 +2454,7 @@ static struct task_struct *__pick_task_d if (dl_server(dl_se)) { p =3D dl_se->server_pick_task(dl_se); if (!p) { - if (dl_server_active(dl_se)) { + if (!dl_server_stopped(dl_se)) { dl_se->dl_yielded =3D 1; update_curr_dl_se(rq, dl_se, 0); } --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5879,7 +5879,6 @@ static bool throttle_cfs_rq(struct cfs_r struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; long queued_delta, runnable_delta, idle_delta, dequeue =3D 1; - long rq_h_nr_queued =3D rq->cfs.h_nr_queued; =20 raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@ -5963,10 +5962,6 @@ static bool throttle_cfs_rq(struct cfs_r =20 /* At this point se is NULL and we are at root level*/ sub_nr_running(rq, queued_delta); - - /* Stop the fair server if throttling resulted in no runnable tasks */ - if (rq_h_nr_queued && !rq->cfs.h_nr_queued) - dl_server_stop(&rq->fair_server); done: /* * Note: distribution will already see us throttled via the @@ -7060,7 +7055,6 @@ static void set_next_buddy(struct sched_ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int fl= ags) { bool was_sched_idle =3D sched_idle_rq(rq); - int rq_h_nr_queued =3D rq->cfs.h_nr_queued; bool task_sleep =3D flags & DEQUEUE_SLEEP; bool task_delayed =3D flags & DEQUEUE_DELAYED; struct task_struct *p =3D NULL; @@ -7144,9 +7138,6 @@ static int dequeue_entities(struct rq *r =20 sub_nr_running(rq, h_nr_queued); =20 - if (rq_h_nr_queued && !rq->cfs.h_nr_queued) - dl_server_stop(&rq->fair_server); - /* balance early to pull high priority tasks */ if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance =3D jiffies; From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 529382741C2 for ; Wed, 2 Jul 2025 12:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; cv=none; b=ENc6C8FHf0d1ZN4KYjqyIbJRBRlvd33NzZ/lsa9FwopWGiQ+2wfUlcWxcV+Z0pO990K2E1XDkPhtDdmN7a++zl2n6HKMxXoWkMGcpdXYHWMXeqwPwCO/A+DO+JSXdLYus/0DG6gDQXwnOe9PTlET9PSXkff7eVuoUZJ4/9E9qUY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; c=relaxed/simple; bh=J3pXaPLLZcksnCZhDxc8yeL5D5Q4Iphi1855xC+UhSE=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=n5ue8jey3I+S+QC8VZGUaBmj43drTxp65+Eh0AkiMxPaSRHFY6b4jkKkmKFtXgm1XH8fI8K2vKRkeNndtzgwa4aPjI+j2GJsfXzszC46iXSJ6nAX/gbJ6+BdxF2qIm//A2hiXm8DysUVjho40zkPoh0nUmmIjddCyzwVOM4jKGk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=jrb8Lu05; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="jrb8Lu05" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=WehO0N/doWNKmSVaorZhtP5NEdHamhAqluYVdptD3h0=; b=jrb8Lu05VSzkOsQb4ZTL12XcDh 9LONe8K1xgIWpgM229M+/Wcn7Sf6lxHd7k812Ac+5EBSxvfLvJujvTFTJtqKR0o7w/4vSgP69MnJE xk4RHLDZjMtViSbHWnBdx2DLoRQncRPUOmZOFVvi5kFo1i2pR56TYcK98ONEdxo1TwozK5b/mu0b2 U/zjH5ByTTuhcPPZfh9xoSPR/9LZV+0tbE0YdB3wojPg0yPPxO84kXrMyqNqkapWVGy7eR1IzSeWX p8O2IaUE8zOLSeYcnI3aonIqIZStAdIW/Rw9YultwW4/tCmy0pqyxs6rO4ZZGuA84i7DCpFYUStgU GFWcXXww==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000009lF3-3Thw; Wed, 02 Jul 2025 12:12:47 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id C3F4430017D; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.582321755@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:27 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Optimize ttwu() by pushing select_idle_siblings() up above waiting for on_cpu(). This allows making use of the cycles otherwise spend waiting to search for an idle CPU. One little detail is that since the task we're looking for an idle CPU for might still be on the CPU, that CPU won't report as running the idle task, and thus won't find his own CPU idle, even when it is. To compensate, remove the 'rq->curr =3D=3D rq->idle' condition from idle_cpu() -- it doesn't really make sense anyway. Additionally, Chris found (concurrently) that perf-c2c reported that test as being a cache-miss monster. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20250520101727.620602459@infradead.org Acked-by: Mel Gorman Reviewed-by: Vincent Guittot --- kernel/sched/core.c | 5 +++-- kernel/sched/syscalls.c | 3 --- 2 files changed, 3 insertions(+), 5 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3593,7 +3593,7 @@ int select_task_rq(struct task_struct *p cpu =3D p->sched_class->select_task_rq(p, cpu, *wake_flags); *wake_flags |=3D WF_RQ_SELECTED; } else { - cpu =3D cpumask_any(p->cpus_ptr); + cpu =3D task_cpu(p); } =20 /* @@ -4309,6 +4309,8 @@ int try_to_wake_up(struct task_struct *p ttwu_queue_wakelist(p, task_cpu(p), wake_flags)) break; =20 + cpu =3D select_task_rq(p, p->wake_cpu, &wake_flags); + /* * If the owning (remote) CPU is still in the middle of schedule() with * this task as prev, wait until it's done referencing the task. @@ -4320,7 +4322,6 @@ int try_to_wake_up(struct task_struct *p */ smp_cond_load_acquire(&p->on_cpu, !VAL); =20 - cpu =3D select_task_rq(p, p->wake_cpu, &wake_flags); if (task_cpu(p) !=3D cpu) { if (p->in_iowait) { delayacct_blkio_end(p); --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -203,9 +203,6 @@ int idle_cpu(int cpu) { struct rq *rq =3D cpu_rq(cpu); =20 - if (rq->curr !=3D rq->idle) - return 0; - if (rq->nr_running) return 0; From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52A2D2741D6 for ; Wed, 2 Jul 2025 12:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; cv=none; b=ut5nViOgJRxFIZYkV69dVtJRVjwB0Tg+2Of3G1kaaK+cWE0kqnljsycI6hU5p+1ayw7C5G3GkGSmib0uvZUxdaAuNf+ADHedhmztbeoSN0XllOglpyD1oXk1bZ5c77977K4k+mN2v9SI1j3s+DMA4YesIooI0CaoKt8jYDMIXJg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; c=relaxed/simple; bh=sn/9YguX9wsJ7H743gFTLgL75AEOpC28E9r9+nFp4LY=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=nT17xFfU13DYZVJcsTT0WHZsr+tfsM73iLRUEWfuPdTibI5f3WGbzIvNXK6aeh7bhrkS8V/3/aeysfbFoOYTLR3+EOuWiAmUXYvoKqm7aeNSzy0G87ViSm+0ejcvrHP3U5p2FAarFG5joT9Ssk9i2v+69EAPSHNO21Su8FydT6Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=l0ZaEpv2; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="l0ZaEpv2" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=Dw8xA6ngL28O6QU6IT+BVPABKR27cDANTRoGcxg6l5Y=; b=l0ZaEpv23O9u00BzRxhV7qxG6Z IQb+McSra8y/t/KNYhzzlv9fQAISdT+8hLgsxfUEieTQFOPYYT7QJkBrCLKi4eW3nV3MZ2NjsMG2P icLgMMNVHdz2NLdwWHrds5ccCfR4qifhAB2vgmoU0KcylUNdFyoZLQoHv8CwUc1Ja6svhof/Fj/1H MrLpr7zM4V1J1ZJm3sjPBauPBu6KQDOVX/rAVb1De3cBD9bDu3TfHa1X+oRZKzdzsBXdcriIS4fXv pCQTt5aUOu3ojwctF65f4sQAdmqn3r8OlQfBmU2Gmq602Mj71PKq5WU14IQVWsZFvSxGtRCat/Ver 7tBgTGUA==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000009lF4-3VbW; Wed, 02 Jul 2025 12:12:47 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id C7DD23001C0; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.703344062@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:28 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Reflow and get rid of 'ret' variable. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20250520101727.732703833@infradead.org Acked-by: Mel Gorman Reviewed-by: Vincent Guittot --- kernel/sched/core.c | 36 ++++++++++++++++-------------------- kernel/sched/sched.h | 5 +++++ 2 files changed, 21 insertions(+), 20 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3754,28 +3754,24 @@ ttwu_do_activate(struct rq *rq, struct t */ static int ttwu_runnable(struct task_struct *p, int wake_flags) { - struct rq_flags rf; - struct rq *rq; - int ret =3D 0; + CLASS(__task_rq_lock, guard)(p); + struct rq *rq =3D guard.rq; =20 - rq =3D __task_rq_lock(p, &rf); - if (task_on_rq_queued(p)) { - update_rq_clock(rq); - if (p->se.sched_delayed) - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); - if (!task_on_cpu(rq, p)) { - /* - * When on_rq && !on_cpu the task is preempted, see if - * it should preempt the task that is current now. - */ - wakeup_preempt(rq, p, wake_flags); - } - ttwu_do_wakeup(p); - ret =3D 1; - } - __task_rq_unlock(rq, &rf); + if (!task_on_rq_queued(p)) + return 0; =20 - return ret; + update_rq_clock(rq); + if (p->se.sched_delayed) + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); + if (!task_on_cpu(rq, p)) { + /* + * When on_rq && !on_cpu the task is preempted, see if + * it should preempt the task that is current now. + */ + wakeup_preempt(rq, p, wake_flags); + } + ttwu_do_wakeup(p); + return 1; } =20 void sched_ttwu_pending(void *arg) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1806,6 +1806,11 @@ task_rq_unlock(struct rq *rq, struct tas raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } =20 +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct, + _T->rq =3D __task_rq_lock(_T->lock, &_T->rf), + __task_rq_unlock(_T->rq, &_T->rf), + struct rq *rq; struct rq_flags rf) + DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct, _T->rq =3D task_rq_lock(_T->lock, &_T->rf), task_rq_unlock(_T->rq, _T->lock, &_T->rf), From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52A922741DA for ; Wed, 2 Jul 2025 12:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; cv=none; b=G2h+Tj+p3R4tZ39J3RZ5whrxvqRHUBTxoLqLm4hy+v0XwDsSbTQegNpSPUn3YGlhahXwOtFrNwr9LPkl0AEMHgVTlEztTMlreLThXmsAwcztRqrRiKNlQzPJwLB3+razjOwUVnTONA+krP5XZBdB3w5w1VPmt/tnZy5pYgiPVgo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; c=relaxed/simple; bh=p7DxqYTTk43dYWRzOSrkfoujbipH4msjgVgZ8LgoCWc=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=e6btn4nTtH5/sBz+7z+iHMDsZn4zKrNfP/xafBoDga6Yiu+ay1xZvGmQZNX/NmWdKptJ7fJkf4bQPHZEI37kvEKGZXx7oJsICZIEoOeAAWVQqobOI+1JjzzvgBqWUePsO/roxfdQiQmYE8MEGMCRSXcdMhZrMnMEuHOydWUhqlg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=EL706pbK; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="EL706pbK" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=NLnZWeYHXBrSO+2NOeqRavvBvOOcXGVwtmqJ2vX+26g=; b=EL706pbKezLobM8owKANqdxfTh Tk/0XhgHEukPEt2PpV7eKodWchuU5sKg1bNCQKN/1QfBh25VoPdjoDOWrYoq/QGWtxp/8gxJTaOIe dqEaMFdTp52BkUhhse0Gen39MDiWEVEM71XN059XU2i7sJbUkNwQ4j6TMxp36yD6A8nzTGuQOqxS6 9TSkhowlZeOUK/DotZJyJ7F5cjdcW8a5GJg0+PUbe/2jTH1STId0YydPNcN66andcn4bBqTvxNlLC G6bDpmUmjB2cH/0cNL5/GVghgSx/LPtseEWJVcOinI32MB+Zf+S/mCniulu8kOXwTVCqnZq769xeN e4NllWkA==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000009lFB-02Nf; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id CC7F43001E4; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.817814031@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:29 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 05/12] sched: Add ttwu_queue controls References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There are two (soon three) callers of ttwu_queue_wakelist(), distinguish them with their own WF_ and add some knobs on. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20250520101727.874587738@infradead.org --- kernel/sched/core.c | 22 ++++++++++++---------- kernel/sched/features.h | 2 ++ kernel/sched/sched.h | 2 ++ 3 files changed, 16 insertions(+), 10 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3888,7 +3888,7 @@ bool cpus_share_resources(int this_cpu, return per_cpu(sd_share_id, this_cpu) =3D=3D per_cpu(sd_share_id, that_cp= u); } =20 -static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) +static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, bool de= f) { /* See SCX_OPS_ALLOW_QUEUED_WAKEUP. */ if (!scx_allow_ttwu_queue(p)) @@ -3929,18 +3929,19 @@ static inline bool ttwu_queue_cond(struc if (!cpu_rq(cpu)->nr_running) return true; =20 - return false; + return def; } =20 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_f= lags) { - if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu)) { - sched_clock_cpu(cpu); /* Sync clocks across CPUs */ - __ttwu_queue_wakelist(p, cpu, wake_flags); - return true; - } + bool def =3D sched_feat(TTWU_QUEUE_DEFAULT); + + if (!ttwu_queue_cond(p, cpu, def)) + return false; =20 - return false; + sched_clock_cpu(cpu); /* Sync clocks across CPUs */ + __ttwu_queue_wakelist(p, cpu, wake_flags); + return true; } =20 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) @@ -3948,7 +3949,7 @@ static void ttwu_queue(struct task_struc struct rq *rq =3D cpu_rq(cpu); struct rq_flags rf; =20 - if (ttwu_queue_wakelist(p, cpu, wake_flags)) + if (sched_feat(TTWU_QUEUE) && ttwu_queue_wakelist(p, cpu, wake_flags)) return; =20 rq_lock(rq, &rf); @@ -4251,7 +4252,8 @@ int try_to_wake_up(struct task_struct *p * scheduling. */ if (smp_load_acquire(&p->on_cpu) && - ttwu_queue_wakelist(p, task_cpu(p), wake_flags)) + sched_feat(TTWU_QUEUE_ON_CPU) && + ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) break; =20 cpu =3D select_task_rq(p, p->wake_cpu, &wake_flags); --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -81,6 +81,8 @@ SCHED_FEAT(TTWU_QUEUE, false) */ SCHED_FEAT(TTWU_QUEUE, true) #endif +SCHED_FEAT(TTWU_QUEUE_ON_CPU, true) +SCHED_FEAT(TTWU_QUEUE_DEFAULT, false) =20 /* * When doing wakeups, attempt to limit superfluous scans of the LLC domai= n. --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2279,6 +2279,8 @@ static inline int task_on_rq_migrating(s #define WF_CURRENT_CPU 0x40 /* Prefer to move the wakee to the current CP= U. */ #define WF_RQ_SELECTED 0x80 /* ->select_task_rq() was called */ =20 +#define WF_ON_CPU 0x0100 + static_assert(WF_EXEC =3D=3D SD_BALANCE_EXEC); static_assert(WF_FORK =3D=3D SD_BALANCE_FORK); static_assert(WF_TTWU =3D=3D SD_BALANCE_WAKE); From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52B012741DC for ; Wed, 2 Jul 2025 12:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; cv=none; b=Fe4yhwII3mO642/3upOPUxtfX9dd87y4sL+B0RtPaVGY5/RjBHkPIp5Mlpcx/QWif7p90t6CNnFWWp4C4HP3whYmZuees0PHoltnaH3kmpaa2sO26TKeCag//OgnR7wrMPLTcR9F5s+IQFS44OATM3LQ6G2Yjz8gTsv+aj+IcjM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; c=relaxed/simple; bh=pWr1ZEmSxdu1Q+j4kV6uZ6Vskf1/ejlgABe69Z/8ER0=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=SY4JACQlxT6ohbbgmXmPrB7c94qamONHrX8shyd+LNgZupxa+bjkF4BUt87gOZ06DD/PEPv/KjhG3svZ5RE7673IkGG1kYZlDrEVN2yt4ZJ1OoimGhC5grUBXg0YThpzon/xaBoAb8Rf3mANk2xlIFE0zKVyWJBVfKcOuQMiFY4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=mPTLMmfD; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="mPTLMmfD" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=6JH/Lq31BAbDb+mcJJZiHYiB9yKZHuZJZ5ndhniSOZg=; b=mPTLMmfD5kmirj+/mS1DCHTiu8 HsFWQuY3p5WAN3HxtYLTR4vX4yCMS6ACbrEmddnVd1R5ihtVul1S3OhliEjNpVkhpRXAeILqpYIWZ Dh4fH5cZQry0B4TS4dAMZpyHCXDxkF/ZOv5g+/acIGZ6W0JTVypvcY/qZmtipzB/8kWTA95cPnH5L BDjdPomERbS7HhAeD8o1IZZMC1uVpSgEGEWopcKfca+LidQHsYwWBUj6Et99exQmh0Ty/f84z4m33 ZdOnqBtzRl81mGaGot4cdluXEC52aqvKiX+gX2f4oPHWYlF3dKFGW4xOtYvwiNkjof/ZOZN1XreUZ AxZJ0RRw==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000009lFC-471V; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id D0A253001F0; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121158.932926181@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:30 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Split out the migration related bits into their own function for later re-use. Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot --- kernel/sched/core.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3774,6 +3774,21 @@ static int ttwu_runnable(struct task_str return 1; } =20 +static inline bool ttwu_do_migrate(struct task_struct *p, int cpu) +{ + if (task_cpu(p) =3D=3D cpu) + return false; + + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + + psi_ttwu_dequeue(p); + set_task_cpu(p, cpu); + return true; +} + void sched_ttwu_pending(void *arg) { struct llist_node *llist =3D arg; @@ -4268,17 +4283,8 @@ int try_to_wake_up(struct task_struct *p * their previous state and preserve Program Order. */ smp_cond_load_acquire(&p->on_cpu, !VAL); - - if (task_cpu(p) !=3D cpu) { - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); - } - + if (ttwu_do_migrate(p, cpu)) wake_flags |=3D WF_MIGRATED; - psi_ttwu_dequeue(p); - set_task_cpu(p, cpu); - } =20 ttwu_queue(p, cpu, wake_flags); } From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 192C427465A for ; Wed, 2 Jul 2025 12:13:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; cv=none; b=WT9aHkLDAEvY0XEIYabyterAvOfccUvJPNfGtUOxNftf4eQ3Q/Ywu3QPDdWbI3pyjQ+YF4BKjrinwm/ncCZ5Zudrrln7YHm1cB7Kjfd0DkJdEV4IzZKTefxgdYiYkjXaNA2TIWVhYtEK44GQDEJi0CSc4GszRNRaJ1tEuAkeyM8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; c=relaxed/simple; bh=P8jphMyMMPpfZst8bg+ufd2/3jz+uX9EB9UuFN8r+0g=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=JSQWcNB0oXuTmDCPbewMpAzQwwsa8+CW9/RgPgp3E6Am0iVTeko/1kE432IuRNxtbk+vbH2ptwI8xs1v8Bp82Fu4tRJMCb2uts366d0z5cZ7iNWpN1dkXEpEPRe72KHerTpv7B6xP9nNILG6fL1XlVghNI75JMBzBuEkJN6pQ+E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=UPZGRsJ3; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="UPZGRsJ3" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=6xCtwYthTWBcEmQ5LUN1G1tEYsMC0rBESkEr3zoStCA=; b=UPZGRsJ3JQo4jYqfnYr22wPqps fdJIVOGFTZ4Dq+vueGR4oXN5AjidByT9lu0XsN0NGqmtD/Gl/PvB9bQ7l8AxGmXOeC/YztEDHmDdX beUzaugyJ60aOEpMnzf6aIEu+R7zq/xQecJqN5qxWUDSBJ87IXKsrTZyIt5piQwUJaLhjpZKImfdO UTXlcK4c6f35K7x/gnh1wb3cFjRInoLvys4KZhh9ThH2CZjhqdx+KDDXLec9nvx5Ih4dH/gSEzuKK S082KVk2OGMX2yqpZoSUhyeU4JinBVe/hIStQhpb21ZupcBuOdc7SxcZixvp3XYSgLr/93TIDDQFo cH4WuCBg==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKV-00000009lFA-3yzX; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id D4E1F3001F7; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.050144163@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:31 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently psi_ttwu_dequeue() is called while holding p->pi_lock and takes rq->lock. Split the function in preparation for calling ttwu_do_migration() while already holding rq->lock. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 18 ++++++++++++++---- kernel/sched/stats.h | 24 +++++++++++++----------- 2 files changed, 27 insertions(+), 15 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3774,17 +3774,27 @@ static int ttwu_runnable(struct task_str return 1; } =20 -static inline bool ttwu_do_migrate(struct task_struct *p, int cpu) +static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, i= nt cpu) { + struct rq *p_rq =3D rq ? : task_rq(p); + if (task_cpu(p) =3D=3D cpu) return false; =20 if (p->in_iowait) { delayacct_blkio_end(p); - atomic_dec(&task_rq(p)->nr_iowait); + atomic_dec(&p_rq->nr_iowait); } =20 - psi_ttwu_dequeue(p); + if (psi_ttwu_need_dequeue(p)) { + if (rq) { + lockdep_assert(task_rq(p) =3D=3D rq); + __psi_ttwu_dequeue(p); + } else { + guard(__task_rq_lock)(p); + __psi_ttwu_dequeue(p); + } + } set_task_cpu(p, cpu); return true; } @@ -4283,7 +4293,7 @@ int try_to_wake_up(struct task_struct *p * their previous state and preserve Program Order. */ smp_cond_load_acquire(&p->on_cpu, !VAL); - if (ttwu_do_migrate(p, cpu)) + if (ttwu_do_migrate(NULL, p, cpu)) wake_flags |=3D WF_MIGRATED; =20 ttwu_queue(p, cpu, wake_flags); --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -191,23 +191,24 @@ static inline void psi_dequeue(struct ta psi_task_change(p, p->psi_flags, 0); } =20 -static inline void psi_ttwu_dequeue(struct task_struct *p) +static inline bool psi_ttwu_need_dequeue(struct task_struct *p) { if (static_branch_likely(&psi_disabled)) - return; + return false; /* * Is the task being migrated during a wakeup? Make sure to * deregister its sleep-persistent psi states from the old * queue, and let psi_enqueue() know it has to requeue. */ - if (unlikely(p->psi_flags)) { - struct rq_flags rf; - struct rq *rq; - - rq =3D __task_rq_lock(p, &rf); - psi_task_change(p, p->psi_flags, 0); - __task_rq_unlock(rq, &rf); - } + if (!likely(!p->psi_flags)) + return false; + + return true; +} + +static inline void __psi_ttwu_dequeue(struct task_struct *p) +{ + psi_task_change(p, p->psi_flags, 0); } =20 static inline void psi_sched_switch(struct task_struct *prev, @@ -223,7 +224,8 @@ static inline void psi_sched_switch(stru #else /* !CONFIG_PSI: */ static inline void psi_enqueue(struct task_struct *p, bool migrate) {} static inline void psi_dequeue(struct task_struct *p, bool migrate) {} -static inline void psi_ttwu_dequeue(struct task_struct *p) {} +static inline bool psi_ttwu_need_dequeue(struct task_struct *p) { return f= alse; } +static inline void __psi_ttwu_dequeue(struct task_struct *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} From nobody Wed Oct 8 04:11:07 2025 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B0BD26D4D9 for ; Wed, 2 Jul 2025 12:12:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; cv=none; b=RW4gHSbkKlIu1BLtV3t245UdRrLPnLzLWjoGiYnDgS/z1jVlFqDXDPghx0+OLvRWbs8ai1ugrj9HAlueE7kozCfobDyMBQpWJWlt+oGMBXHV/Ka6rrOyiksdupHRpgRCY5Mp5t/QQwkFMfNsL60f9P714BeRq5JsJ/TLei3+TvE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; c=relaxed/simple; bh=OJ5GCBmIvAPN0pMDBWv2WeEEvtrkmyaMrK9hOo6rLP8=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=GzOpKmEBTtPWhBz8tBTAoBm/ljulu6vfZSUMHtGJxwBy5mojMGhiB4U7yKIGASblqiIjhY/VC9b5ke/IwZLZZrH/cVM2lP4pXABefb/ZXy0pTBcnH2PW5EZ7YMJBGbVEUFHiE5ju1mF484cWqANZJAB+85qTiJ/nuoFkLi4rKBE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=Q+OU4Rwe; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="Q+OU4Rwe" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=onhBm3Dupd3NVH6sTDnfi4lnWVArcDQumqorYE8AWno=; b=Q+OU4RweT1hcq/jHY1l7Nd8S9M xMDR0gwrG9M407kRfKybl0ZGGoqx8RefbpLYVX9HklxqTL1lqBbWKJiCbycJ3LHCwoW6EkhHiIrj0 14KNlB7Uiq5hHxWhRFRO9bPdMFDcgEmiMFCrUsJ2K/OTR56BTHaIuFWSM1pGIJKI04u1dTlPidi4j 9q+xxZ0d9Uo6tHfrqaSQuu7f+Gr4Yt1eqpAykRLksrMu+0efonc3rrreFOlz+WWj06o1LNYtpVeP4 oswI59fHhGj8HqshWe5LZqfJVJ2yjgLnE4FrMGxXrECxPTO4Ms98ZJ30FjtfU9EuzE4nVeHNSOnUz l89HGT2g==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000007LoD-013a; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id D911A300212; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.172688305@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:32 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The relation between ttwu_queue_wakelist() and __ttwu_queue_wakelist() is ill defined -- probably because the former is the only caller of the latter and it grew into an arbitrary subfunction. Clean things up a little such that __ttwu_queue_wakelist() no longer takes the wake_flags argument, making for a more sensible separation. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3848,11 +3848,11 @@ bool call_function_single_prep_ipi(int c * via sched_ttwu_wakeup() for activation so the wakee incurs the cost * of the wakeup instead of the waker. */ -static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake= _flags) +static void __ttwu_queue_wakelist(struct task_struct *p, int cpu) { struct rq *rq =3D cpu_rq(cpu); =20 - p->sched_remote_wakeup =3D !!(wake_flags & WF_MIGRATED); + sched_clock_cpu(cpu); /* Sync clocks across CPUs */ =20 WRITE_ONCE(rq->ttwu_pending, 1); #ifdef CONFIG_SMP @@ -3954,8 +3954,9 @@ static bool ttwu_queue_wakelist(struct t if (!ttwu_queue_cond(p, cpu, def)) return false; =20 - sched_clock_cpu(cpu); /* Sync clocks across CPUs */ - __ttwu_queue_wakelist(p, cpu, wake_flags); + p->sched_remote_wakeup =3D !!(wake_flags & WF_MIGRATED); + + __ttwu_queue_wakelist(p, cpu); return true; } From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 529A22741D0 for ; Wed, 2 Jul 2025 12:13:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; cv=none; b=PWNa7dkJtV7WwTLGuNIXWQgvSxPkaUuvq1oTSS0c0pJEx5m5EPj3b5foOLw1jPUskv9mtlBa9uet/Def0XZKHZxVR6ArcTqIoOTKGeXXRdtt56tNiwJNXoc2NYzZUg+Jmgz8dTQ93kLdmvBoqMrzNAG5cjvDUBirwbsW8Ajl5Gs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458383; c=relaxed/simple; bh=FWv0wUWinfxRvRf2jp6kAyQT3DxEUSWPhksp51+zDRM=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=gm2cyHRYErV6OvWpR1o6BnDh1SR0n/qrB3LeGzVFLpCvoiRflY06czPaLM9b6dHI5ONkb7u/ttSrGnkcUJb2nmMjCUuTmAH/2bNwLFp8vmbOy9PCspOSxTOECHH9lVYD+owJxLTwITKpqLCc0X2rYSy04179ksaLc84gdojiyto= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=NWCWpilK; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="NWCWpilK" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=8PPBxPGtoZgH0uq70gRQQr/CkrObDZsJfEUhnPhGB4E=; b=NWCWpilKWXwh5EOu/FlUk/CdNV 5f0qzy1Bujnt91gQ+dKlk1p4SVR+BqH3VGu3rXSGniO90uo3U7uhJ2P8w+4MXnGFCGmN4iTkAOYU/ L68EbPmaOBsZw+4Oqecz2p2Yz7QGOYvC9ufv+cgVhn4z4UVqXtf0/5LEP7JbzNnzcdY7H3LTe2QzF xGFk0N7n1ude0JJzEmn4yMvp3qEShVgd6yJQ1uT+9wn/wWlqpAPviuBxiA3+QwPXB/aypvxNwI1ZI /lJMwVKonvYgy3v5mknils0CwTjlLRAOPv0/KwyHaAa6twCGOgSeJcrNNoLc56RTfdSblTSGA/krR cRNOibwA==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000009lFH-1t0f; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id DD50F300220; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.287358119@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:33 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 09/12] sched: Clean up ttwu comments References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Various changes have rendered these comments slightly out-of-date. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4276,8 +4276,8 @@ int try_to_wake_up(struct task_struct *p * __schedule(). See the comment for smp_mb__after_spinlock(). * * Form a control-dep-acquire with p->on_rq =3D=3D 0 above, to ensure - * schedule()'s deactivate_task() has 'happened' and p will no longer - * care about it's own p->state. See the comment in __schedule(). + * schedule()'s try_to_block_task() has 'happened' and p will no longer + * care about its own p->state. See the comment in try_to_block_task(). */ smp_acquire__after_ctrl_dep(); =20 @@ -6708,8 +6708,8 @@ static void __sched notrace __schedule(i preempt =3D sched_mode =3D=3D SM_PREEMPT; =20 /* - * We must load prev->state once (task_struct::state is volatile), such - * that we form a control dependency vs deactivate_task() below. + * We must load prev->state once, such that we form a control + * dependency vs try_to_block_task() below. */ prev_state =3D READ_ONCE(prev->__state); if (sched_mode =3D=3D SM_IDLE) { From nobody Wed Oct 8 04:11:07 2025 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B12D26D4E2 for ; Wed, 2 Jul 2025 12:12:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; cv=none; b=H0inbS5FKSAH7pbly89ugQ4Y0RxFE6e990IJ6o8QcGPkvz8YvtVn8aSh+Z98SAF1qMmVFhaEwImJy0RLkeTCbMSQ2CZtNPHB42x+3PAN75ApxW6KRRkP/Ed9nosVlnKCt/yRKC0bH6zWPhNSmosTX518r9CvpLxdOPbZ0rWrFVw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; c=relaxed/simple; bh=xo1fc3NqjbobotA/bdTPvvoK1PWxBcIdl8XWayqHS7g=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=svJqcDpSSL72ysC48UO4zhsIprc9CLFmLzyTaOougaRvlVGgISwwxDoikbNVeqRoIj7/7eQV4/GTnd5YoEs65JC9g85dZTr1vgzORjOaPnkc/hywp/94tpFZAdlO7KHNcHqlEV7vhcCAglKTyEGvUOPOA4pL67SWPV5vC8JkG8A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=gI47Ux83; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="gI47Ux83" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=08lm1e1J7UNKF0ChbTwoML968O+ldkunmaw1MPPMELY=; b=gI47Ux83fuBKsTqik12bS3ZTv2 S+gysos9W4A+XDX52d32DbhxyfIv2gmG5Y2drqImaDdYp8yOSrz7XE6ydhSsxB0lZl2PLRLt6DtCR 6Xq9GRiCKU43ohuuRQytRCfRmbt0syr2k3NGoIcqykzDO2QkkfKj6LDST2+T9rFqTel5O4gfVGKzP a4e8c541TPyN5igbzPpE4zjhFchaLUew/71YmW3pb+lkM6zbR8W5bmsMnIfpAqw1DM/OCJM7hWRD8 nXsx+LyvxXR7HYmUuGhhYEhregf0io1BQVsEGmNTxML2mRDpTbC/hXYO4ngslPspXuqn65Dk7UeFy 1+bCqeOw==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000007LoE-1y26; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id E18BA300233; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.418420130@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:34 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3827,22 +3827,26 @@ void sched_ttwu_pending(void *arg) struct llist_node *llist =3D arg; struct rq *rq =3D this_rq(); struct task_struct *p, *t; - struct rq_flags rf; =20 if (!llist) return; =20 - rq_lock_irqsave(rq, &rf); + CLASS(rq_lock_irqsave, guard)(rq); update_rq_clock(rq); =20 llist_for_each_entry_safe(p, t, llist, wake_entry.llist) { + int wake_flags =3D WF_TTWU; + if (WARN_ON_ONCE(p->on_cpu)) smp_cond_load_acquire(&p->on_cpu, !VAL); =20 if (WARN_ON_ONCE(task_cpu(p) !=3D cpu_of(rq))) set_task_cpu(p, cpu_of(rq)); =20 - ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf); + if (p->sched_remote_wakeup) + wake_flags |=3D WF_MIGRATED; + + ttwu_do_activate(rq, p, wake_flags, &guard.rf); } =20 /* @@ -3856,7 +3860,6 @@ void sched_ttwu_pending(void *arg) * Since now nr_running > 0, idle_cpu() will always get correct result. */ WRITE_ONCE(rq->ttwu_pending, 0); - rq_unlock_irqrestore(rq, &rf); } =20 /* From nobody Wed Oct 8 04:11:07 2025 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7E0023D2A3 for ; Wed, 2 Jul 2025 12:12:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; cv=none; b=bjiPUAokMRqd5FcQTs4EWdPLPGdrxxhti+CdO6WoJCmA7OmlO+AxlEMZmkh1Oe6+Jx5HlBfKDapYx/2/67aQMHzcE1130+Sj4garx4JEMb1dyVbq3DaqMrBy/PTsA3MFD11UxhOpS0TVvJ9mGuJ+G5WqOcnmw7/L+Y9RQVKxxHc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458375; c=relaxed/simple; bh=n6YvUoh5mJjmncI/HyrKP8JIi1UurM6NTiEAnfPMinE=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=tEJk7d9td+gWQLDOhvTdXww5t+VuSLYIvC6oEiWyHyav7GGcTiEI6Lvsai4RoDMwBdiX4JARDbdFGzdFo0Tox2QfONy3XHHs8TmGJIA93qK83R36U+PQh9BAeVPkKsh7REk6wEIBQ9vPtLMjGinWdVRC4Jd0i1d23lQWxfV3UaU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=OQMsyfLv; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="OQMsyfLv" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=aEfrksgxUVyoL9HQyfY6r7osBGPHcSr7/3n7z+lf8so=; b=OQMsyfLvpprv4hggis5FxeTPS+ UZodVo/hcaiqdZU1lXDDItorNTV6yquE1VspiEh1gQ8EA7FbTBi9SHufMYEw0CzRoQI/yBIi5p9ud 78u7anqCKIXsJgZc9z/5zCdcrB2AfATrLDOvR+4XEOSEstbqGMsJcS+29YJsczNAkVK4BZft1smo0 VLcKSvwYwp4EiHvgHGeQZsj0P42oEnSb2fojMllhTyKEmtDDDs2avNti2/u9Gu2+CkoJPHGHrWfbz /aLiof8nSWEZuEEeSz1I1IGEKYmqVfHtPqh/h+6Ac2O0dvnPWg10jKpvVnfL3flAKJxGI9JzGuhnk nKW8zsxw==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000007LoF-1zUB; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id E614A300237; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.535226098@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:35 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Change how TTWU handles sched_delayed tasks. Currently sched_delayed tasks are seen as on_rq and will hit ttwu_runnable(), which treats sched_delayed tasks the same as other on_rq tasks, it makes them runnable on the runqueue they're on. However, tasks that were dequeued (and not delayed) will get a different wake-up path, notably they will pass through wakeup balancing. Change ttwu_runnable() to dequeue delayed tasks and report it isn't on_rq after all, ensuring the task continues down the regular wakeup path. Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3793,8 +3793,10 @@ static int ttwu_runnable(struct task_str return 0; =20 update_rq_clock(rq); - if (p->se.sched_delayed) - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); + if (p->se.sched_delayed) { + dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_DELAYED | DEQUEUE_SLEEP); + return 0; + } if (!task_on_cpu(rq, p)) { /* * When on_rq && !on_cpu the task is preempted, see if From nobody Wed Oct 8 04:11:07 2025 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1792D274659 for ; Wed, 2 Jul 2025 12:13:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; cv=none; b=kM75PArGKJSzFIpWOtUj/X+v13/3ltVWMpcAQwiDOs6MrZh46DILqv1zv0pgLjd+fGiRHdxlMP60n4sgZ4ToNpKGEdxuG3bpMI75RaXBxj9ipzFUO+zr8EZPncPN7dAI13Z/G8RdmfqyvikJCtAKJUbqqqWl5r9qtzb506/T1IE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751458382; c=relaxed/simple; bh=O4fB1ytCYp8R6bFWsHDxEMVK8ik8kb1jHCv0k1+JVso=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=QKYn4CXwKD5wSErK0u0B+fe8r9vKb3jLMzhY0CvyqvqX6A5cjqKT+DIzggZK+crr2SFdcjRdgAczOhH0NfEk2NDh2Tj+RC09B6725wz9GQNJiUbb/EP2VykNs6xM+89WlmHvqac2g1qx6vjnXodz+lZPGGL6UsbLumsjEoMNbIA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=JUQi5LKH; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="JUQi5LKH" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=aOXHO3W/zaz2mG5Ga3zMsmUmXkp279to5l/xdH3Itiw=; b=JUQi5LKHuCgxHnK30GXNYkVuZ4 JhH4YQEuwrSqT9TAjrCMCwizJxox/Hy6RJSB0/pE295GnuLa215iAPTF8arvT7nDOB3blZ77dgw1c UM1MvQPml6PZeLTF+6rnrxcQw7ekkSraqtZ8VVutrB4UALTxBrn3OkWLx3Ms+Ct60cip/UPSObPpV wn52ekmvtpY/JRin4lCoPOQ1Tw7fcaHWI6cGvqbDbBtH6P123jriykC/LGDfaD/aHZLvenj7CFp3E xcCZBe94rY5gFt0jWOVFA8Y1DeNTgWMtbHBuBcFaTbozBfNIeORzLZhF2kOr3aTsojGF0mqCFvfJL migWsE8A==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1uWwKW-00000009lFI-1xPR; Wed, 02 Jul 2025 12:12:48 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id EA7DD300243; Wed, 02 Jul 2025 14:12:46 +0200 (CEST) Message-ID: <20250702121159.652969404@infradead.org> User-Agent: quilt/0.68 Date: Wed, 02 Jul 2025 13:49:36 +0200 From: Peter Zijlstra To: mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, clm@meta.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org Subject: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks References: <20250702114924.091581796@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" One of the more expensive things to do is take a remote runqueue lock; which is exactly what ttwu_runnable() ends up doing. However, in the case of sched_delayed tasks it is possible to queue up an IPI instead. Reported-by: Chris Mason Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20250520101727.984171377@infradead.org --- include/linux/sched.h | 1=20 kernel/sched/core.c | 96 +++++++++++++++++++++++++++++++++++++++++++= ++--- kernel/sched/fair.c | 17 ++++++++ kernel/sched/features.h | 1=20 kernel/sched/sched.h | 1=20 5 files changed, 110 insertions(+), 6 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -984,6 +984,7 @@ struct task_struct { * ->sched_remote_wakeup gets used, so it can be in this word. */ unsigned sched_remote_wakeup:1; + unsigned sched_remote_delayed:1; #ifdef CONFIG_RT_MUTEXES unsigned sched_rt_mutex:1; #endif --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -675,7 +675,12 @@ struct rq *__task_rq_lock(struct task_st { struct rq *rq; =20 - lockdep_assert_held(&p->pi_lock); + /* + * TASK_WAKING is used to serialize the remote end of wakeup, rather + * than p->pi_lock. + */ + lockdep_assert(p->__state =3D=3D TASK_WAKING || + lockdep_is_held(&p->pi_lock) !=3D LOCK_STATE_NOT_HELD); =20 for (;;) { rq =3D task_rq(p); @@ -3727,6 +3732,8 @@ ttwu_do_activate(struct rq *rq, struct t } } =20 +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_f= lags); + /* * Consider @p being inside a wait loop: * @@ -3754,6 +3761,35 @@ ttwu_do_activate(struct rq *rq, struct t */ static int ttwu_runnable(struct task_struct *p, int wake_flags) { + if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) { + /* + * Similar to try_to_block_task(): + * + * __schedule() ttwu() + * prev_state =3D prev->state if (p->sched_delayed) + * if (prev_state) smp_acquire__after_ctrl_dep() + * try_to_block_task() p->state =3D TASK_WAKING + * ... set_delayed() + * RELEASE p->sched_delayed =3D 1 + * + * __schedule() and ttwu() have matching control dependencies. + * + * Notably, once we observe sched_delayed we know the task has + * passed try_to_block_task() and p->state is ours to modify. + * + * TASK_WAKING controls ttwu() concurrency. + */ + smp_acquire__after_ctrl_dep(); + WRITE_ONCE(p->__state, TASK_WAKING); + /* + * Bit of a hack, see select_task_rq_fair()'s WF_DELAYED case. + */ + p->wake_cpu =3D smp_processor_id(); + + if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED)) + return 1; + } + CLASS(__task_rq_lock, guard)(p); struct rq *rq =3D guard.rq; =20 @@ -3776,6 +3812,8 @@ static int ttwu_runnable(struct task_str return 1; } =20 +static void __ttwu_queue_wakelist(struct task_struct *p, int cpu); + static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, i= nt cpu) { struct rq *p_rq =3D rq ? : task_rq(p); @@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc return true; } =20 +static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_fla= gs, + struct rq_flags *rf) +{ + struct rq *p_rq =3D task_rq(p); + int cpu; + + /* + * Notably it is possible for on-rq entities to get migrated -- even + * sched_delayed ones. This should be rare though, so flip the locks + * rather than IPI chase after it. + */ + if (unlikely(rq !=3D p_rq)) { + rq_unlock(rq, rf); + p_rq =3D __task_rq_lock(p, rf); + update_rq_clock(p_rq); + } + + if (task_on_rq_queued(p)) + dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED); + + /* + * NOTE: unlike the regular try_to_wake_up() path, this runs both + * select_task_rq() and ttwu_do_migrate() while holding rq->lock + * rather than p->pi_lock. + */ + cpu =3D select_task_rq(p, p->wake_cpu, &wake_flags); + if (ttwu_do_migrate(rq, p, cpu)) + wake_flags |=3D WF_MIGRATED; + + if (unlikely(rq !=3D p_rq)) { + __task_rq_unlock(p_rq, rf); + rq_lock(rq, rf); + } + + p->sched_remote_wakeup =3D !!(wake_flags & WF_MIGRATED); + p->sched_remote_delayed =3D 0; + + /* it wants to run here */ + if (cpu_of(rq) =3D=3D cpu) + return 0; + + /* shoot it to the CPU it wants to run on */ + __ttwu_queue_wakelist(p, cpu); + return 1; +} + void sched_ttwu_pending(void *arg) { struct llist_node *llist =3D arg; @@ -3819,12 +3903,13 @@ void sched_ttwu_pending(void *arg) if (WARN_ON_ONCE(p->on_cpu)) smp_cond_load_acquire(&p->on_cpu, !VAL); =20 - if (WARN_ON_ONCE(task_cpu(p) !=3D cpu_of(rq))) - set_task_cpu(p, cpu_of(rq)); - if (p->sched_remote_wakeup) wake_flags |=3D WF_MIGRATED; =20 + if (p->sched_remote_delayed && + ttwu_delayed(rq, p, wake_flags | WF_DELAYED, &guard.rf)) + continue; + ttwu_do_activate(rq, p, wake_flags, &guard.rf); } =20 @@ -3964,12 +4049,13 @@ static inline bool ttwu_queue_cond(struc =20 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_f= lags) { - bool def =3D sched_feat(TTWU_QUEUE_DEFAULT); + bool def =3D sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED); =20 if (!ttwu_queue_cond(p, cpu, def)) return false; =20 p->sched_remote_wakeup =3D !!(wake_flags & WF_MIGRATED); + p->sched_remote_delayed =3D !!(wake_flags & WF_DELAYED); =20 __ttwu_queue_wakelist(p, cpu); return true; --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5327,7 +5327,10 @@ static __always_inline void return_cfs_r =20 static void set_delayed(struct sched_entity *se) { - se->sched_delayed =3D 1; + /* + * See TTWU_QUEUE_DELAYED in ttwu_runnable(). + */ + smp_store_release(&se->sched_delayed, 1); =20 /* * Delayed se of cfs_rq have no tasks queued on them. @@ -8481,6 +8484,18 @@ select_task_rq_fair(struct task_struct * /* SD_flags and WF_flags share the first nibble */ int sd_flag =3D wake_flags & 0xF; =20 + if (wake_flags & WF_DELAYED) { + /* + * This is the ttwu_delayed() case; where prev_cpu is in fact + * the CPU that did the wakeup, while @p is running on the + * current CPU. + * + * Make sure to flip them the right way around, otherwise + * wake-affine is going to do the wrong thing. + */ + swap(cpu, new_cpu); + } + /* * required for stable ->cpus_allowed */ --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false) SCHED_FEAT(TTWU_QUEUE, true) #endif SCHED_FEAT(TTWU_QUEUE_ON_CPU, true) +SCHED_FEAT(TTWU_QUEUE_DELAYED, true) SCHED_FEAT(TTWU_QUEUE_DEFAULT, false) =20 /* --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s #define WF_RQ_SELECTED 0x80 /* ->select_task_rq() was called */ =20 #define WF_ON_CPU 0x0100 +#define WF_DELAYED 0x0200 =20 static_assert(WF_EXEC =3D=3D SD_BALANCE_EXEC); static_assert(WF_FORK =3D=3D SD_BALANCE_FORK);