From nobody Sat Feb 7 08:44:09 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA79918AEE for ; Fri, 2 Feb 2024 08:10:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861449; cv=none; b=mchae6UVb09j9/DrSfksCbRbuTm33GYoYrhFbn4n/Kp9brXXvEMEIYUIc4OqgyAWJEJeDpfSTeXHZofwT53C2E75Rfu0nz5UUujKWqnP+/o1oeeCN2erXvZH4CIEZS6Uhidt4AqrBWp3n+qSgPGS1DrpvPzKeK32ldK99ra1gdU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861449; c=relaxed/simple; bh=XvGrGNouSr+0XomitksSmtHJyT2Bn0wbu87wMiy9Dy0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=TRtJoSW8vwiF1oGErbLJCC8KB9qJhfSwklr9LCvSpSClSze1Kpmw1xhFD9jsogqJyzTSVjC7A8E0XxIQ2Eug6IsNnbopeRwsp/T8TpNNQzYw+je1RfinbXbNiC8idgWRw2kfNYz/LwlN/ZEfo4/JNJ5/XAtYyZgs3k8xONcT/z0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=SaD4K5vu; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SaD4K5vu" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706861445; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DCgceAE8jBD2WohfFUuNrr+O/9K6+IZGaVDTilf9JRw=; b=SaD4K5vupbXwYT4BtFqvofZeVGIyvUNaM5hDA+KDghMl0zNfyC/pyl3aWcEmHOB3ef/laT bcFD1VbRjRIoIWWI+DI9jB24gSpGQDppiHlbTrzAcV7IKcWrG0o/9gSXtm2AS7HLzlM49y apv4Tgn41Ysn4N5rk65Yuo2DoneIomc= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-47-dtpG1UN6MtqxFsoNZDIZvg-1; Fri, 02 Feb 2024 03:10:40 -0500 X-MC-Unique: dtpG1UN6MtqxFsoNZDIZvg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DCF6A185A780; Fri, 2 Feb 2024 08:10:39 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 04626C2590D; Fri, 2 Feb 2024 08:10:36 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Benjamin Segall , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to userspace Date: Fri, 2 Feb 2024 09:09:16 +0100 Message-ID: <20240202080920.3337862-2-vschneid@redhat.com> In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com> References: <20240202080920.3337862-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Content-Type: text/plain; charset="utf-8" From: Benjamin Segall The basic idea of this implementation is to maintain duplicate runqueues in each cfs_rq that contain duplicate pointers to sched_entitys which should bypass throttling. Then we can skip throttling cfs_rqs that have any such children, and when we pick inside any not-actually-throttled cfs_rq, we only look at this duplicated list. "Which tasks should bypass throttling" here is "all schedule() calls that don't set a special flag", but could instead involve the lockdep markers (except for the problem of percpu-rwsem and similar) or explicit flags around syscalls and faults, or something else. This approach avoids any O(tasks) loops, but leaves partially-throttled cfs_rqs still contributing their full h_nr_running to their parents, which might result in worse balancing. Also it adds more (generally still small) overhead to the common enqueue/dequeue/pick paths. The very basic debug test added is to run a cpusoaker and "cat /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same cgroup with a quota < 1 cpu. Not-signed-off-by: Benjamin Segall [Slight comment / naming changes] Signed-off-by: Valentin Schneider --- include/linux/sched.h | 7 ++ kernel/entry/common.c | 2 +- kernel/entry/kvm.c | 2 +- kernel/sched/core.c | 20 ++++ kernel/sched/debug.c | 28 +++++ kernel/sched/fair.c | 232 ++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 3 + 7 files changed, 281 insertions(+), 13 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 03bfe9ab29511..4a0105d1eaa21 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout); extern long schedule_timeout_uninterruptible(long timeout); extern long schedule_timeout_idle(long timeout); asmlinkage void schedule(void); +asmlinkage void schedule_usermode(void); extern void schedule_preempt_disabled(void); asmlinkage void preempt_schedule_irq(void); #ifdef CONFIG_PREEMPT_RT @@ -553,6 +554,9 @@ struct sched_entity { struct cfs_rq *my_q; /* cached value of my_q->h_nr_running */ unsigned long runnable_weight; +#ifdef CONFIG_CFS_BANDWIDTH + struct list_head kernel_node; +#endif #endif =20 #ifdef CONFIG_SMP @@ -1539,6 +1543,9 @@ struct task_struct { struct user_event_mm *user_event_mm; #endif =20 +#ifdef CONFIG_CFS_BANDWIDTH + atomic_t in_return_to_user; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/kernel/entry/common.c b/kernel/entry/common.c index d7ee4bc3f2ba3..16b5432a62c6f 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_r= egs *regs, local_irq_enable_exit_to_user(ti_work); =20 if (ti_work & _TIF_NEED_RESCHED) - schedule(); + schedule_usermode(); /* TODO: also all of the arch/ loops that don't us= e this yet */ =20 if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 2e0f75bcb7fd1..fc4b73de07539 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,= unsigned long ti_work) } =20 if (ti_work & _TIF_NEED_RESCHED) - schedule(); + schedule_usermode(); =20 if (ti_work & _TIF_NOTIFY_RESUME) resume_user_mode_work(NULL); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index db4be4921e7f0..a7c028fad5a89 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, = struct task_struct *p) #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq =3D NULL; #endif +#ifdef CONFIG_CFS_BANDWIDTH + INIT_LIST_HEAD(&p->se.kernel_node); + atomic_set(&p->in_return_to_user, 0); +#endif =20 #ifdef CONFIG_SCHEDSTATS /* Even if schedstat is disabled, there should not be garbage */ @@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void) } EXPORT_SYMBOL(schedule); =20 +asmlinkage __visible void __sched schedule_usermode(void) +{ +#ifdef CONFIG_CFS_BANDWIDTH + /* + * This is only atomic because of this simple implementation. We could + * do something with an SM_USER to avoid other-cpu scheduler operations + * racing against these writes. + */ + atomic_set(¤t->in_return_to_user, true); + schedule(); + atomic_set(¤t->in_return_to_user, false); +#else + schedule(); +#endif +} + /* * synchronize_rcu_tasks() makes sure that no task is stuck in preempted * state (have scheduled out non-voluntarily) by making sure that all diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 8d5d98a5834df..4a89dbc3ddfcd 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops= =3D { .llseek =3D default_llseek, }; =20 +static DEFINE_MUTEX(sched_debug_spin_mutex); +static int sched_debug_spin_show(struct seq_file *m, void *v) { + int count; + mutex_lock(&sched_debug_spin_mutex); + for (count =3D 0; count < 1000; count++) { + u64 start2; + start2 =3D jiffies; + while (jiffies =3D=3D start2) + cpu_relax(); + schedule(); + } + mutex_unlock(&sched_debug_spin_mutex); + return 0; +} +static int sched_debug_spin_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, sched_debug_spin_show, NULL); +} + +static const struct file_operations sched_debug_spin_fops =3D { + .open =3D sched_debug_spin_open, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D single_release, +}; + static const struct seq_operations sched_debug_sops; =20 static int sched_debug_open(struct inode *inode, struct file *filp) @@ -374,6 +400,8 @@ static __init int sched_init_debug(void) =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); =20 + debugfs_create_file("sched_locked_spin", 0444, NULL, NULL, + &sched_debug_spin_fops); return 0; } late_initcall(sched_init_debug); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b803030c3a037..a1808459a5acc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu) * (default: 5 msec, units: microseconds) */ static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; +static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass =3D 1; #endif =20 #ifdef CONFIG_NUMA_BALANCING @@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ONE, }, + { + .procname =3D "sched_cfs_bandwidth_kernel_bypass", + .data =3D &sysctl_sched_cfs_bandwidth_kernel_bypass, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + .extra2 =3D SYSCTL_ONE, + }, #endif #ifdef CONFIG_NUMA_BALANCING { @@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched= _entity *se) =20 /* * Pick the next process, keeping these things in mind, in this order: - * 1) keep things fair between processes/task groups - * 2) pick the "next" process, since someone really wants that to run - * 3) pick the "last" process, for cache locality - * 4) do not run the "skip" process, if something else is available + * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel + * 2) keep things fair between processes/task groups + * 3) pick the "next" process, since someone really wants that to run + * 4) pick the "last" process, for cache locality + * 5) do not run the "skip" process, if something else is available */ static struct sched_entity * -pick_next_entity(struct cfs_rq *cfs_rq) +pick_next_entity(struct cfs_rq *cfs_rq, bool throttled) { +#ifdef CONFIG_CFS_BANDWIDTH + /* + * TODO: This might trigger, I'm not sure/don't remember. Regardless, + * while we do not explicitly handle the case where h_kernel_running + * goes to 0, we will call account/check_cfs_rq_runtime at worst in + * entity_tick and notice that we can now properly do the full + * throttle_cfs_rq. + */ + WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children)); + if (throttled && !list_empty(&cfs_rq->kernel_children)) { + /* + * TODO: you'd want to factor out pick_eevdf to just take + * tasks_timeline, and replace this list with a second rbtree + * and a call to pick_eevdf. + */ + return list_first_entry(&cfs_rq->kernel_children, + struct sched_entity, kernel_node); + } +#endif /* * Enabling NEXT_BUDDY will affect latency but not fairness. */ @@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *= cfs_rq, u64 delta_exec) /* * if we're unable to extend our runtime we resched so that the active * hierarchy can be throttled + * + * Don't resched_curr() if curr is in the kernel. We won't throttle the + * cfs_rq if any task is in the kernel, and if curr in particular is we + * don't need to preempt it in favor of whatever other task is in the + * kernel. */ - if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) + if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) && + list_empty(&rq_of(cfs_rq)->curr->se.kernel_node)) resched_curr(rq_of(cfs_rq)); } =20 @@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, = void *data) return 0; } =20 +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count); +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count); + static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta, dequeue =3D 1; + long task_delta, idle_task_delta, kernel_delta, dequeue =3D 1; + + /* + * We don't actually throttle, though account() will have made sure to + * resched us so that we pick into a kernel task. + */ + if (cfs_rq->h_kernel_running) + return false; =20 raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) =20 task_delta =3D cfs_rq->h_nr_running; idle_task_delta =3D cfs_rq->idle_h_nr_running; + kernel_delta =3D cfs_rq->h_kernel_running; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); /* throttled entity or throttle-on-deactivate */ @@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) =20 qcfs_rq->h_nr_running -=3D task_delta; qcfs_rq->idle_h_nr_running -=3D idle_task_delta; + dequeue_kernel(qcfs_rq, se, kernel_delta); =20 if (qcfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ @@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) =20 qcfs_rq->h_nr_running -=3D task_delta; qcfs_rq->idle_h_nr_running -=3D idle_task_delta; + dequeue_kernel(qcfs_rq, se, kernel_delta); } =20 /* At this point se is NULL and we are at root level*/ @@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta; + long task_delta, idle_task_delta, kernel_delta; =20 se =3D cfs_rq->tg->se[cpu_of(rq)]; =20 @@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) =20 task_delta =3D cfs_rq->h_nr_running; idle_task_delta =3D cfs_rq->idle_h_nr_running; + kernel_delta =3D cfs_rq->h_kernel_running; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); =20 @@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) =20 qcfs_rq->h_nr_running +=3D task_delta; qcfs_rq->idle_h_nr_running +=3D idle_task_delta; + enqueue_kernel(qcfs_rq, se, kernel_delta); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) =20 qcfs_rq->h_nr_running +=3D task_delta; qcfs_rq->idle_h_nr_running +=3D idle_task_delta; + enqueue_kernel(qcfs_rq, se, kernel_delta); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *r= q, struct task_struct *p) } #endif =20 +/* + * We keep track of all children that are runnable in the kernel with a co= unt of + * all descendants. The state is checked on enqueue and put_prev (and hard + * cleared on dequeue), and is stored just as the filled/empty state of the + * kernel_node list entry. + * + * These are simple helpers that do both parts, and should be called botto= m-up + * until hitting a throttled cfs_rq whenever a task changes state (or a cf= s_rq + * is (un)throttled). + */ +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count) +{ + if (count =3D=3D 0) + return; + + if (list_empty(&se->kernel_node)) + list_add(&se->kernel_node, &cfs_rq->kernel_children); + cfs_rq->h_kernel_running +=3D count; +} + +static bool is_kernel_task(struct task_struct *p) +{ + return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_re= turn_to_user); +} + +/* + * When called on a task this always transitions it to a !kernel state. + * + * When called on a group it is just synchronizing the state with the new + * h_kernel_waiters, unless this it has been throttled and is !on_rq + */ +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count) +{ + if (count =3D=3D 0) + return; + + if (!se->on_rq || entity_is_task(se) || + !group_cfs_rq(se)->h_kernel_running) + list_del_init(&se->kernel_node); + cfs_rq->h_kernel_running -=3D count; +} + +/* + * Returns if the cfs_rq "should" be throttled but might not be because of + * kernel threads bypassing throttle. + */ +static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq) +{ + if (!cfs_bandwidth_used()) + return false; + + if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)) + return false; + return true; +} + +static void unthrottle_on_enqueue(struct task_struct *p) +{ + struct sched_entity *se =3D &p->se; + + if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass) + return; + if (!cfs_rq_of(&p->se)->throttle_count) + return; + + /* + * MAYBE TODO: doing it this simple way is O(throttle_count * + * cgroup_depth). We could optimize that into a single pass, but making + * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy + * to get wrong. (And even without unthrottle_on_enqueue it's O(nm), + * just not while holding rq->lock the whole time) + */ + + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq =3D cfs_rq_of(se); + if (cfs_rq->throttled) + unthrottle_cfs_rq(cfs_rq); + } +} + #else /* CONFIG_CFS_BANDWIDTH */ =20 static inline bool cfs_bandwidth_used(void) @@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p) return false; } #endif +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count) {} +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count) {} +static inline bool is_kernel_task(struct task_struct *p) +{ + return false; +} +static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq) +{ + return false; +} #endif /* CONFIG_CFS_BANDWIDTH */ =20 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL) @@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) struct sched_entity *se =3D &p->se; int idle_h_nr_running =3D task_has_idle_policy(p); int task_new =3D !(flags & ENQUEUE_WAKEUP); + bool kernel_task =3D is_kernel_task(p); =20 /* * The code below (indirectly) updates schedutil which looks at @@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; + if (kernel_task) + enqueue_kernel(cfs_rq, se, 1); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; + if (kernel_task) + enqueue_kernel(cfs_rq, se, 1); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) assert_list_leaf_cfs_rq(rq); =20 hrtick_update(rq); + + if (kernel_task) + unthrottle_on_enqueue(p); } =20 static void set_next_buddy(struct sched_entity *se); @@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) int task_sleep =3D flags & DEQUEUE_SLEEP; int idle_h_nr_running =3D task_has_idle_policy(p); bool was_sched_idle =3D sched_idle_rq(rq); + bool kernel_task =3D !list_empty(&p->se.kernel_node); =20 util_est_dequeue(&rq->cfs, p); =20 @@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; + if (kernel_task) + dequeue_kernel(cfs_rq, se, 1); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) =20 if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; + if (kernel_task) + dequeue_kernel(cfs_rq, se, 1); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq= , struct task_struct *p, int resched_curr(rq); } =20 +static void handle_kernel_task_prev(struct task_struct *prev) +{ +#ifdef CONFIG_CFS_BANDWIDTH + struct sched_entity *se =3D &prev->se; + bool p_in_kernel =3D is_kernel_task(prev); + bool p_in_kernel_tree =3D !list_empty(&se->kernel_node); + /* + * These extra loops are bad and against the whole point of the merged + * PNT, but it's a pain to merge, particularly since we want it to occur + * before check_cfs_runtime(). + */ + if (p_in_kernel_tree && !p_in_kernel) { + WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */ + for_each_sched_entity(se) { + dequeue_kernel(cfs_rq_of(se), se, 1); + if (cfs_rq_throttled(cfs_rq_of(se))) + break; + } + } else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) { + for_each_sched_entity(se) { + enqueue_kernel(cfs_rq_of(se), se, 1); + if (cfs_rq_throttled(cfs_rq_of(se))) + break; + } + } +#endif +} + #ifdef CONFIG_SMP static struct task_struct *pick_task_fair(struct rq *rq) { struct sched_entity *se; struct cfs_rq *cfs_rq; + bool throttled =3D false; =20 again: cfs_rq =3D &rq->cfs; @@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq = *rq) goto again; } =20 - se =3D pick_next_entity(cfs_rq); + if (cfs_rq_throttled_loose(cfs_rq)) + throttled =3D true; + + se =3D pick_next_entity(cfs_rq, throttled); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 @@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struc= t *prev, struct rq_flags *rf struct sched_entity *se; struct task_struct *p; int new_tasks; + bool throttled; + + /* + * We want to handle this before check_cfs_runtime(prev). We'll + * duplicate a little work in the goto simple case, but that's fine + */ + if (prev) + handle_kernel_task_prev(prev); =20 again: if (!sched_fair_runnable(rq)) @@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf * hierarchy, only change the part that actually changes. */ =20 + throttled =3D false; do { struct sched_entity *curr =3D cfs_rq->curr; =20 @@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struc= t *prev, struct rq_flags *rf } } =20 - se =3D pick_next_entity(cfs_rq); + if (cfs_rq_throttled_loose(cfs_rq)) + throttled =3D true; + + se =3D pick_next_entity(cfs_rq, throttled); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 @@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struc= t *prev, struct rq_flags *rf if (prev) put_prev_task(rq, prev); =20 + throttled =3D false; do { - se =3D pick_next_entity(cfs_rq); + if (cfs_rq_throttled_loose(cfs_rq)) + throttled =3D true; + se =3D pick_next_entity(cfs_rq, throttled); set_next_entity(cfs_rq, se); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); @@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct = task_struct *prev) struct sched_entity *se =3D &prev->se; struct cfs_rq *cfs_rq; =20 + handle_kernel_task_prev(prev); + for_each_sched_entity(se) { cfs_rq =3D cfs_rq_of(se); put_prev_entity(cfs_rq, se); @@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) #ifdef CONFIG_SMP raw_spin_lock_init(&cfs_rq->removed.lock); #endif +#ifdef CONFIG_CFS_BANDWIDTH + INIT_LIST_HEAD(&cfs_rq->kernel_children); +#endif } =20 #ifdef CONFIG_FAIR_GROUP_SCHED @@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struc= t cfs_rq *cfs_rq, /* guarantee group entities always have weight */ update_load_set(&se->load, NICE_0_LOAD); se->parent =3D parent; +#ifdef CONFIG_CFS_BANDWIDTH + INIT_LIST_HEAD(&se->kernel_node); +#endif } =20 static DEFINE_MUTEX(shares_mutex); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e58a54bda77de..0b33ce2e60555 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -580,6 +580,7 @@ struct cfs_rq { =20 struct rb_root_cached tasks_timeline; =20 + /* * 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are currently running). @@ -658,8 +659,10 @@ struct cfs_rq { u64 throttled_clock_self_time; int throttled; int throttle_count; + int h_kernel_running; struct list_head throttled_list; struct list_head throttled_csd_list; + struct list_head kernel_children; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; --=20 2.43.0 From nobody Sat Feb 7 08:44:09 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4953F18658 for ; Fri, 2 Feb 2024 08:10:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861448; cv=none; b=YJ5SdYelmP32V3WJ9ZZmuEQZOpdyetACuqhFFboobVH/WYEk3LU0/x64cqEJ/+Swre4jfMrJqoUsk5Pk8PIfKhoLwYG5UlfqE4J2l188PynAWB3PSBt9NjHsOMG1+W+IEO15KRTfHJg1XhJ9tI89XMSXFmtW4io3/L7lCIJTmiI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861448; c=relaxed/simple; bh=Zk2mQ+ZW7M5p5xaR77pQtedUVa3RgHq4JSSfbueXse4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=E2ZVtPYSX020Xt22N2wuUOqxIo2zw9UvY61WwQufs54BhuxnGMHuysEc9avSt0ASYJrBtBdhZYn/eUVhTqArO2GEJ+PAlgn8ZXIiDTUYLWNB+IbzMNcLg4HHTdcOqxYIRYRy6ONVOr1vOB3Dj7nSSTlaK1BurPK6r72EF7w+OlM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=anlPeXsQ; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="anlPeXsQ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706861445; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MWiW15gYCd0qq5hmfxU5I8UTcZTFW4ysIqeBM84iI5Y=; b=anlPeXsQPG37DeisH0PRA/kAA1MDRcU5pHlNSWKRmnKqAMxVRn/smSrQ4OOUP9hHhpUsoN /odbLvP0hFbwrcRtPPAmNICScyZh4pKqEqSJengL7AMvaV2K6JhZRv0AVDcGJPZUmnZ2Tu MxmtIYHc9+CmBkSagrIHiyszppRqHS0= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-394-tjCGo5PvMb2pCh-tcMxXTQ-1; Fri, 02 Feb 2024 03:10:43 -0500 X-MC-Unique: tjCGo5PvMb2pCh-tcMxXTQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F00E22812FFA; Fri, 2 Feb 2024 08:10:42 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3A3ADC2590E; Fri, 2 Feb 2024 08:10:40 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Benjamin Segall , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH v2 2/5] sched: Note schedule() invocations at return-to-user with SM_USER Date: Fri, 2 Feb 2024 09:09:17 +0100 Message-ID: <20240202080920.3337862-3-vschneid@redhat.com> In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com> References: <20240202080920.3337862-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Content-Type: text/plain; charset="utf-8" task_struct.in_return_to_user is currently updated via atomic operations in schedule_usermode(). However, one can note: o .in_return_to_user is only updated for the current task o There are no remote (smp_processor_id() !=3D task_cpu(p)) accesses to .in_return_to_user Add schedule_with_mode() to factorize schedule() with different flags to pass down to __schedule_loop(). Add SM_USER to denote schedule() calls from return-to-userspace points. Update .in_return_to_user from within the preemption-disabled, rq_lock-held part of __schedule(). Suggested-by: Benjamin Segall Signed-off-by: Valentin Schneider --- include/linux/sched.h | 2 +- kernel/sched/core.c | 43 ++++++++++++++++++++++++++++++++----------- kernel/sched/fair.c | 17 ++++++++++++++++- 3 files changed, 49 insertions(+), 13 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 4a0105d1eaa21..1b6f17b2150a6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1544,7 +1544,7 @@ struct task_struct { #endif =20 #ifdef CONFIG_CFS_BANDWIDTH - atomic_t in_return_to_user; + int in_return_to_user; #endif /* * New fields for task_struct should be added above here, so that diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a7c028fad5a89..54e6690626b13 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) #endif #ifdef CONFIG_CFS_BANDWIDTH INIT_LIST_HEAD(&p->se.kernel_node); - atomic_set(&p->in_return_to_user, 0); + p->in_return_to_user =3D false; #endif =20 #ifdef CONFIG_SCHEDSTATS @@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct= *next, struct rq_flags *rf =20 static inline void finish_lock_switch(struct rq *rq) { +#ifdef CONFIG_CFS_BANDWIDTH + current->in_return_to_user =3D false; +#endif /* * If we are tracking spinlock dependencies then we have to * fix up the runqueue lock - which gets 'carried over' from @@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *pr= ev, struct rq_flags *rf) #define SM_PREEMPT 0x1 #define SM_RTLOCK_WAIT 0x2 =20 +/* + * Special case for CFS_BANDWIDTH where we need to know if the call to + * __schedule() is directely preceding an entry into userspace. + * It is removed from the mode argument as soon as it is used to not go ag= ainst + * the SM_MASK_PREEMPT optimisation below. + */ +#ifdef CONFIG_CFS_BANDWIDTH +# define SM_USER 0x4 +#else +# define SM_USER SM_NONE +#endif + #ifndef CONFIG_PREEMPT_RT # define SM_MASK_PREEMPT (~0U) #else @@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int = sched_mode) rq_lock(rq, &rf); smp_mb__after_spinlock(); =20 +#ifdef CONFIG_CFS_BANDWIDTH + if (sched_mode & SM_USER) { + prev->in_return_to_user =3D true; + sched_mode &=3D ~SM_USER; + } +#endif + SCHED_WARN_ON(sched_mode & SM_USER); + /* Promote REQ to ACT */ rq->clock_update_flags <<=3D 1; update_rq_clock(rq); @@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned = int sched_mode) } while (need_resched()); } =20 -asmlinkage __visible void __sched schedule(void) +static __always_inline void schedule_with_mode(unsigned int sched_mode) { struct task_struct *tsk =3D current; =20 @@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void) =20 if (!task_is_running(tsk)) sched_submit_work(tsk); - __schedule_loop(SM_NONE); + __schedule_loop(sched_mode); sched_update_worker(tsk); } + +asmlinkage __visible void __sched schedule(void) +{ + schedule_with_mode(SM_NONE); +} EXPORT_SYMBOL(schedule); =20 asmlinkage __visible void __sched schedule_usermode(void) { #ifdef CONFIG_CFS_BANDWIDTH - /* - * This is only atomic because of this simple implementation. We could - * do something with an SM_USER to avoid other-cpu scheduler operations - * racing against these writes. - */ - atomic_set(¤t->in_return_to_user, true); - schedule(); - atomic_set(¤t->in_return_to_user, false); + schedule_with_mode(SM_USER); #else schedule(); #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a1808459a5acc..96504be6ee14a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, st= ruct sched_entity *se, int c =20 static bool is_kernel_task(struct task_struct *p) { - return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_re= turn_to_user); + /* + * The flag is updated within __schedule() with preemption disabled, + * under the rq lock, and only when the task is current. + * + * Holding the rq lock for that task's CPU is thus sufficient for the + * value to be stable, if the task is enqueued. + * + * If the task is dequeued, then task_cpu(p) *can* change, but this + * so far only happens in enqueue_task_fair() which means either: + * - the task is being activated, its CPU has been set previously in ttwu= () + * - the task is going through a "change" cycle (e.g. sched_move_task()), + * the pi_lock is also held so the CPU is stable. + */ + lockdep_assert_rq_held(cpu_rq(task_cpu(p))); + + return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user; } =20 /* --=20 2.43.0 From nobody Sat Feb 7 08:44:09 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9086E19477 for ; Fri, 2 Feb 2024 08:10:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861453; cv=none; b=Xbv6Hkh1+4Wa87hU/AfTGrhyqTHBe34FkbhB9QXeDEfiPdaJ+AWajiQnB1juVlFTC88SgLf7aKNeARRAjWEi1dVlVwo3EDzklkXWhpqkNVqpKIwBdvyAtcamC3f3bL57x8pF+325eEj44k03JVOD+VWCnAhEI6IJUe/OXvTCG0k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861453; c=relaxed/simple; bh=dWuR9UQw2rwTVshyZQG5PCMcxyEdawkP54MOYUetJAM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iB3w/RGRzNdwRlPgTiGybUFW0vFZ+O41iw8Xq1ZsTFiQt/f/SRsPDAVIGmoKbMkcJU/TQ9n9+rl2ijwXf7GNO0dNdkaYQaxJffse7s1LcIksibw2oq4yP32ZT5Z0CVizexecHKd6SDVsTlzcst5tWGC9cqnvIAetMldLypjIbSc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=N3ZpuSiS; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="N3ZpuSiS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706861450; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HxaIkJskuFw38ApXZhFFaPGSaq/2ThOojlsU5lw5wRE=; b=N3ZpuSiSogq4Fzc3MLaKHNggOKGsr76/9qDKxi11ss+02s+Wy4SPjACFD4rjrS/gq/3HMy Mx2izwBJoIhvuqJPjmer9qYtFByOYBYG1TRGX5jG3upGzeVvRxdONUtihwoeSvkrkUF8VG y7QyhvAXiGX7MkJdMmAFQ1Gnz2S6y5k= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-319--CXo1PhyNSiCukS3YGEorQ-1; Fri, 02 Feb 2024 03:10:46 -0500 X-MC-Unique: -CXo1PhyNSiCukS3YGEorQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E676D1013663; Fri, 2 Feb 2024 08:10:45 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 59EE0C2590E; Fri, 2 Feb 2024 08:10:43 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(), use cfs_rq->throttle_pending instead Date: Fri, 2 Feb 2024 09:09:18 +0100 Message-ID: <20240202080920.3337862-4-vschneid@redhat.com> In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com> References: <20240202080920.3337862-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Content-Type: text/plain; charset="utf-8" cfs_rq_throttled_loose() does not check if there is runtime remaining in the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously for that to be checked. Cache the throttle attempt in throttle_cfs_rq and reuse that where needed. Signed-off-by: Valentin Schneider --- kernel/sched/fair.c | 44 ++++++++++---------------------------------- 1 file changed, 10 insertions(+), 34 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 96504be6ee14a..60778afbff207 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_e= ntity *se) * 5) do not run the "skip" process, if something else is available */ static struct sched_entity * -pick_next_entity(struct cfs_rq *cfs_rq, bool throttled) +pick_next_entity(struct cfs_rq *cfs_rq) { #ifdef CONFIG_CFS_BANDWIDTH /* @@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttle= d) * throttle_cfs_rq. */ WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children)); - if (throttled && !list_empty(&cfs_rq->kernel_children)) { + if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) { /* * TODO: you'd want to factor out pick_eevdf to just take * tasks_timeline, and replace this list with a second rbtree @@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) * We don't actually throttle, though account() will have made sure to * resched us so that we pick into a kernel task. */ - if (cfs_rq->h_kernel_running) + if (cfs_rq->h_kernel_running) { + cfs_rq->throttle_pending =3D true; return false; + } + + cfs_rq->throttle_pending =3D false; =20 raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, st= ruct sched_entity *se, int c cfs_rq->h_kernel_running -=3D count; } =20 -/* - * Returns if the cfs_rq "should" be throttled but might not be because of - * kernel threads bypassing throttle. - */ -static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq) -{ - if (!cfs_bandwidth_used()) - return false; - - if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)) - return false; - return true; -} - static void unthrottle_on_enqueue(struct task_struct *p) { struct sched_entity *se =3D &p->se; @@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *= rq) { struct sched_entity *se; struct cfs_rq *cfs_rq; - bool throttled =3D false; =20 again: cfs_rq =3D &rq->cfs; @@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq = *rq) goto again; } =20 - if (cfs_rq_throttled_loose(cfs_rq)) - throttled =3D true; - - se =3D pick_next_entity(cfs_rq, throttled); + se =3D pick_next_entity(cfs_rq); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 @@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf struct sched_entity *se; struct task_struct *p; int new_tasks; - bool throttled; =20 /* * We want to handle this before check_cfs_runtime(prev). We'll @@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct= *prev, struct rq_flags *rf * Therefore attempt to avoid putting and setting the entire cgroup * hierarchy, only change the part that actually changes. */ - - throttled =3D false; do { struct sched_entity *curr =3D cfs_rq->curr; =20 @@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struc= t *prev, struct rq_flags *rf goto simple; } } - - if (cfs_rq_throttled_loose(cfs_rq)) - throttled =3D true; - - se =3D pick_next_entity(cfs_rq, throttled); + se =3D pick_next_entity(cfs_rq); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); =20 @@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struc= t *prev, struct rq_flags *rf if (prev) put_prev_task(rq, prev); =20 - throttled =3D false; do { - if (cfs_rq_throttled_loose(cfs_rq)) - throttled =3D true; - se =3D pick_next_entity(cfs_rq, throttled); + se =3D pick_next_entity(cfs_rq); set_next_entity(cfs_rq, se); cfs_rq =3D group_cfs_rq(se); } while (cfs_rq); --=20 2.43.0 From nobody Sat Feb 7 08:44:09 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 536BF1AADD for ; Fri, 2 Feb 2024 08:10:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861456; cv=none; b=rEqZ4EZKraYr+wtHlPrm1b/XTdZ2ICgUrQ2iRkWLmLFRPyAEll/WOm8I4yKhca9fbeh2HLuRNdnVwxPD1QfGanFvRQ/o/VKuznsz8qxqTrFFnfk1DNlHl4Qw/tQTVsvCAHmdACyI2Z+CefS0sQtBdGmePSwzekcejEtj0MtL25E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861456; c=relaxed/simple; bh=XSUm7ZVkUeJgzY+9JV/lu1DGc96b7lL1jb2zM9BjCJw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EYM2x8UDmGABzZaMedauhb3ttg72aB1JOPC0PIDNqjNE2ZS5CAde05rHAsECldeDT5BlobcKmg7d5pU8O++xrl9T8K1wJCodxjhZadJV180U9zHDo2GZD2w4Z+FcByS+fRLAsTtCmG7RMJETPaeEnCa6FZc2MENzkeygfVMk81w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UPlur54y; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UPlur54y" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706861453; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=59FY7Fjt6DUz/1aW5sr5w/uHLVwuGkenIcuvCVNSVT4=; b=UPlur54yBIBI1UWQTbEn1GaxfNJniS+Ptuvdfc/bT5D403PrC1lJqyuUYA09QKCJTWOEcY wfTbCpu66ChWChvUox8o0aYdNHgD0Yg4YRK74ihZ33nypfKia/fTWZ3vDigNOBOEZ3/PI3 Zw1pLYz12P3HHUYf4wxeM84hzjxrjPo= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-427-vO7VM0CHMYa-y6FzF5P9nA-1; Fri, 02 Feb 2024 03:10:48 -0500 X-MC-Unique: vO7VM0CHMYa-y6FzF5P9nA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 635D983B82B; Fri, 2 Feb 2024 08:10:48 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3F9A0C2590D; Fri, 2 Feb 2024 08:10:46 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in userspace Date: Fri, 2 Feb 2024 09:09:19 +0100 Message-ID: <20240202080920.3337862-5-vschneid@redhat.com> In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com> References: <20240202080920.3337862-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Content-Type: text/plain; charset="utf-8" While having a second tree to pick from solves the throttling aspect of thi= ngs, it also requires modification of the task count at the cfs_rq level. .h_nr_running is used throughout load_balance(), and it needs to accurately reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=3D1 m= ay have many tasks in userspace (thus effectively throttled), and this "excess" of = tasks shouldn't cause find_busiest_group() / find_busiest_queue() to pick that cfs_rq's CPU to pull load from when there are other CPUs with more pickable tasks to pull. The approach taken here is to track both the count of tasks in kernelspace = and the count of tasks in userspace (technically tasks-just-about-to-enter-user= space). When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=3D1.= From this point on, only tasks executing in kernelspace are pickable, and this is reflected up the hierarchy by removing that cfs_rq.h_user_running from its parents' .h_nr_running. To aid in validating the proper behaviour of the implementation, we assert = the following invariants: o For any cfs_rq with .throttle_pending =3D=3D 0: .h_kernel_running + .h_user_running =3D=3D .h_nr_running o For any cfs_rq with .throttle_pending =3D=3D 1: .h_kernel_running =3D=3D .h_nr_running This means the .h_user_running also needs to be updated as cfs_rq's become .throttle_pending=3D1. When a cfs_rq becomes .throttle_pending=3D1, its .h_user_running remains untouched, but it is subtracted from its parents' .h_user_running. Another way to look at it is that the .h_user_running is "stored" at the le= vel of the .throttle_pending cfs_rq, and restored to the upper part of the hier= archy at unthrottle. An overview of the count logic is: Consider: cfs_rq.kernel :=3D count of kernel *tasks* enqueued on this cfs_rq cfs_rq.user :=3D count of user *tasks* enqueued on this cfs_rq Then, the following logic is implemented: cfs_rq.h_kernel_running =3D Sum(child.kernel) for all child cfs_rq cfs_rq.h_user_running =3D Sum(child.user) for all child cfs_rq with = !child.throttle_pending cfs_rq.h_nr_running =3D Sum(child.kernel) for all child cfs_rq + Sum(child.user) for all child cfs_rq with !child.throttle_pending An application of that logic to an A/B/C cgroup hierarchy: Initial condition, no throttling +------+ .h_kernel_running =3D C.kernel + B.kernel + A.kernel A |cfs_rq| .h_user_running =3D C.user + B.user + A.user +------+ .h_nr_running =3D C.{kernel+user} + B.{kernel+user} + A.{k= ernel+user} ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D C.kernel + B.kernel B |cfs_rq| .h_user_running =3D C.user + B.user +------+ .h_nr_running =3D C.{kernel+user} + B.{kernel+user} ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D C.kernel C |cfs_rq| .h_user_running =3D C.user +------+ .h_nr_running =3D C.{kernel+user} .throttle_pending =3D 0 C becomes .throttle_pending +------+ .h_kernel_running =3D C.kernel + B.kernel + A.kernel = <- Untouched A |cfs_rq| .h_user_running =3D B.user + A.user = <- Decremented by C.user +------+ .h_nr_running =3D C.kernel + B.{kernel+user} + A.{kernel+u= ser} <- Decremented by C.user ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D C.kernel + B.kernel = <- Untouched B |cfs_rq| .h_user_running =3D B.user = <- Decremented by C.user +------+ .h_nr_running =3D C.kernel + B.{kernel+user} + A.{kernel+u= ser} <- Decremented by C.user ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D C.kernel C |cfs_rq| .h_user_running =3D C.user <- Untouched, the count is "sto= red" at this level +------+ .h_nr_running =3D C.kernel <- Decremented by C.user .throttle_pending =3D 1 C becomes throttled +------+ .h_kernel_running =3D B.kernel + A.kernel <- Dec= remented by C.kernel A |cfs_rq| .h_user_running =3D B.user + A.user +------+ .h_nr_running =3D B.{kernel+user} + A.{kernel+user} <- Dec= remented by C.kernel ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D B.kernel <- Dec= remented by C.kernel B |cfs_rq| .h_user_running =3D B.user +------+ .h_nr_running =3D B.{kernel+user} + A.{kernel+user} <- Dec= remented by C.kernel ^ .throttle_pending =3D 0 | | parent | +------+ .h_kernel_running =3D C.kernel C |cfs_rq| .h_user_running =3D C.user +------+ .h_nr_running =3D C.{kernel+user} <- Incremented by C.user .throttle_pending =3D 0 Could we get away with just one count, e.g. the user count and not the kern= el count? Technically yes, we could follow this scheme: if (throttle_pending) =3D> kernel count :=3D h_nr_running - h_user_running else =3D> kernel count :=3D h_nr_running this however prevents any sort of assertion or sanity checking on the count= s, which I am not the biggest fan on - CFS group scheduling is enough of a hea= dache as it is. Signed-off-by: Valentin Schneider --- kernel/sched/fair.c | 174 ++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 2 + 2 files changed, 151 insertions(+), 25 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 60778afbff207..2b54d3813d18d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta, kernel_delta, dequeue =3D 1; + long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue =3D 1; + bool was_pending; =20 /* - * We don't actually throttle, though account() will have made sure to - * resched us so that we pick into a kernel task. + * We don't actually throttle just yet, though account_cfs_rq_runtime() + * will have made sure to resched us so that we pick into a kernel task. */ if (cfs_rq->h_kernel_running) { + if (cfs_rq->throttle_pending) + return false; + + /* + * From now on we're only going to pick tasks that are in the + * second tree. Reflect this by discounting tasks that aren't going + * to be pickable from the ->h_nr_running counts. + */ cfs_rq->throttle_pending =3D true; + + se =3D cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]; + user_delta =3D cfs_rq->h_user_running; + cfs_rq->h_nr_running -=3D user_delta; + + for_each_sched_entity(se) { + struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); + + if (!se->on_rq) + goto done; + + qcfs_rq->h_nr_running -=3D user_delta; + qcfs_rq->h_user_running -=3D user_delta; + + assert_cfs_rq_counts(qcfs_rq); + } return false; } =20 + /* + * Unlikely as it may be, we may only have user tasks as we hit the + * throttle, in which case we won't have discount them from the + * h_nr_running, and we need to be aware of that. + */ + was_pending =3D cfs_rq->throttle_pending; cfs_rq->throttle_pending =3D false; =20 raw_spin_lock(&cfs_b->lock); @@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); rcu_read_unlock(); =20 - task_delta =3D cfs_rq->h_nr_running; + /* + * At this point, h_nr_running =3D=3D h_kernel_running. We add back the + * h_user_running to the throttled cfs_rq, and only remove the difference + * to the upper cfs_rq's. + */ + if (was_pending) { + WARN_ON_ONCE(cfs_rq->h_nr_running !=3D cfs_rq->h_kernel_running); + cfs_rq->h_nr_running +=3D cfs_rq->h_user_running; + } else { + WARN_ON_ONCE(cfs_rq->h_nr_running !=3D cfs_rq->h_user_running); + } + + /* + * We always discount user tasks from h_nr_running when throttle_pending + * so only h_kernel_running remains to be removed + */ + task_delta =3D was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_runn= ing; idle_task_delta =3D cfs_rq->idle_h_nr_running; kernel_delta =3D cfs_rq->h_kernel_running; + user_delta =3D was_pending ? 0 : cfs_rq->h_user_running; + for_each_sched_entity(se) { struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); /* throttled entity or throttle-on-deactivate */ @@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running -=3D task_delta; qcfs_rq->idle_h_nr_running -=3D idle_task_delta; dequeue_kernel(qcfs_rq, se, kernel_delta); + qcfs_rq->h_user_running -=3D user_delta; + =20 if (qcfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ @@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running -=3D task_delta; qcfs_rq->idle_h_nr_running -=3D idle_task_delta; dequeue_kernel(qcfs_rq, se, kernel_delta); + qcfs_rq->h_user_running -=3D user_delta; } =20 /* At this point se is NULL and we are at root level*/ @@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq =3D rq_of(cfs_rq); struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - long task_delta, idle_task_delta, kernel_delta; + long task_delta, idle_task_delta, kernel_delta, user_delta; =20 se =3D cfs_rq->tg->se[cpu_of(rq)]; =20 @@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) task_delta =3D cfs_rq->h_nr_running; idle_task_delta =3D cfs_rq->idle_h_nr_running; kernel_delta =3D cfs_rq->h_kernel_running; + user_delta =3D cfs_rq->h_user_running; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq =3D cfs_rq_of(se); =20 @@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running +=3D task_delta; qcfs_rq->idle_h_nr_running +=3D idle_task_delta; enqueue_kernel(qcfs_rq, se, kernel_delta); + qcfs_rq->h_user_running +=3D user_delta; + + assert_cfs_rq_counts(qcfs_rq); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->h_nr_running +=3D task_delta; qcfs_rq->idle_h_nr_running +=3D idle_task_delta; enqueue_kernel(qcfs_rq, se, kernel_delta); + qcfs_rq->h_user_running +=3D user_delta; =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) @@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) int idle_h_nr_running =3D task_has_idle_policy(p); int task_new =3D !(flags & ENQUEUE_WAKEUP); bool kernel_task =3D is_kernel_task(p); + bool throttle_pending =3D false; =20 /* * The code below (indirectly) updates schedutil which looks at @@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct= *p, int flags) cfs_rq =3D cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); =20 - cfs_rq->h_nr_running++; - cfs_rq->idle_h_nr_running +=3D idle_h_nr_running; =20 - if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running =3D 1; + if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending)) + cfs_rq->h_nr_running++; if (kernel_task) enqueue_kernel(cfs_rq, se, 1); + else if (!throttle_pending) + cfs_rq->h_user_running++; + + throttle_pending |=3D cfs_rq->throttle_pending; + + cfs_rq->idle_h_nr_running +=3D idle_h_nr_running; + if (cfs_rq_is_idle(cfs_rq)) + idle_h_nr_running =3D 1; + =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct= *p, int flags) se_update_runnable(se); update_cfs_group(se); =20 - cfs_rq->h_nr_running++; - cfs_rq->idle_h_nr_running +=3D idle_h_nr_running; =20 - if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running =3D 1; + if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending)) + cfs_rq->h_nr_running++; if (kernel_task) enqueue_kernel(cfs_rq, se, 1); + else if (!throttle_pending) + cfs_rq->h_user_running++; + + throttle_pending |=3D cfs_rq->throttle_pending; + + cfs_rq->idle_h_nr_running +=3D idle_h_nr_running; + if (cfs_rq_is_idle(cfs_rq)) + idle_h_nr_running =3D 1; + =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) int idle_h_nr_running =3D task_has_idle_policy(p); bool was_sched_idle =3D sched_idle_rq(rq); bool kernel_task =3D !list_empty(&p->se.kernel_node); + bool throttle_pending =3D false; =20 util_est_dequeue(&rq->cfs, p); =20 @@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct= task_struct *p, int flags) cfs_rq =3D cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); =20 - cfs_rq->h_nr_running--; - cfs_rq->idle_h_nr_running -=3D idle_h_nr_running; =20 - if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running =3D 1; + if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending)) + cfs_rq->h_nr_running--; if (kernel_task) dequeue_kernel(cfs_rq, se, 1); + else if (!throttle_pending) + cfs_rq->h_user_running--; + + throttle_pending |=3D cfs_rq->throttle_pending; + + cfs_rq->idle_h_nr_running -=3D idle_h_nr_running; + if (cfs_rq_is_idle(cfs_rq)) + idle_h_nr_running =3D 1; + =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct= task_struct *p, int flags) se_update_runnable(se); update_cfs_group(se); =20 - cfs_rq->h_nr_running--; - cfs_rq->idle_h_nr_running -=3D idle_h_nr_running; =20 - if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running =3D 1; + if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending)) + cfs_rq->h_nr_running--; if (kernel_task) dequeue_kernel(cfs_rq, se, 1); + else if (!throttle_pending) + cfs_rq->h_user_running--; + + throttle_pending |=3D cfs_rq->throttle_pending; + + cfs_rq->idle_h_nr_running -=3D idle_h_nr_running; + if (cfs_rq_is_idle(cfs_rq)) + idle_h_nr_running =3D 1; + =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq= , struct task_struct *p, int resched_curr(rq); } =20 +/* + * Consider: + * cfs_rq.kernel :=3D count of kernel *tasks* enqueued on this cfs_rq + * cfs_rq.user :=3D count of user *tasks* enqueued on this cfs_rq + * + * Then, the following logic is implemented: + * cfs_rq.h_kernel_running =3D Sum(child.kernel) for all child cfs_rq + * cfs_rq.h_user_running =3D Sum(child.user) for all child cfs_rq wi= th !child.throttle_pending + * cfs_rq.h_nr_running =3D Sum(child.kernel) for all child cfs_rq + * + Sum(child.user) for all child cfs_rq with !child.throttle_pe= nding + * + * IOW, count of kernel tasks is always propagated up the hierarchy, and c= ount + * of user tasks is only propagated up if the cfs_rq isn't .throttle_pendi= ng. + */ static void handle_kernel_task_prev(struct task_struct *prev) { #ifdef CONFIG_CFS_BANDWIDTH struct sched_entity *se =3D &prev->se; bool p_in_kernel =3D is_kernel_task(prev); bool p_in_kernel_tree =3D !list_empty(&se->kernel_node); + bool throttle_pending =3D false; /* * These extra loops are bad and against the whole point of the merged * PNT, but it's a pain to merge, particularly since we want it to occur * before check_cfs_runtime(). */ if (p_in_kernel_tree && !p_in_kernel) { + /* Switch from KERNEL -> USER */ WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */ + for_each_sched_entity(se) { - dequeue_kernel(cfs_rq_of(se), se, 1); - if (cfs_rq_throttled(cfs_rq_of(se))) + struct cfs_rq *cfs_rq =3D cfs_rq_of(se); + + if (throttle_pending || cfs_rq->throttle_pending) + cfs_rq->h_nr_running--; + dequeue_kernel(cfs_rq, se, 1); + if (!throttle_pending) + cfs_rq->h_user_running++; + + throttle_pending |=3D cfs_rq->throttle_pending; + + if (cfs_rq_throttled(cfs_rq)) break; } } else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) { + /* Switch from USER -> KERNEL */ + for_each_sched_entity(se) { - enqueue_kernel(cfs_rq_of(se), se, 1); - if (cfs_rq_throttled(cfs_rq_of(se))) + struct cfs_rq *cfs_rq =3D cfs_rq_of(se); + + if (throttle_pending || cfs_rq->throttle_pending) + cfs_rq->h_nr_running++; + enqueue_kernel(cfs_rq, se, 1); + if (!throttle_pending) + cfs_rq->h_user_running--; + + throttle_pending |=3D cfs_rq->throttle_pending; + + if (cfs_rq_throttled(cfs_rq)) break; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0b33ce2e60555..e8860e0d6fbc7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -660,6 +660,8 @@ struct cfs_rq { int throttled; int throttle_count; int h_kernel_running; + int h_user_running; + int throttle_pending; struct list_head throttled_list; struct list_head throttled_csd_list; struct list_head kernel_children; --=20 2.43.0 From nobody Sat Feb 7 08:44:09 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F5A3182C3 for ; Fri, 2 Feb 2024 08:10:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861458; cv=none; b=gNgWZWtvJuCrDK5FspCJYCtc2B9OKoZsJ/l+iyzK+OlwoM158scm34rPouvSWmPZa0DiAkwBXlkVR1+SUUSn+HHioHlagQAZNmvBFoZUGtK6sT4zRCVOma4JE0ARLbatLVqozdpTt3ZB5ZtKLvZxa779T8GEbdDGjHTZPi5nOkQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861458; c=relaxed/simple; bh=GKjGZ0nEYiWo/e0g4SCEgfAGOP3qsyIKHYvTEjh/1rg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RdB/Db2q69k17I6sSbW4eKgJuj6Nw5IFEJHBlO1Rjg2ScyrffzlkBP5+i+Y41j3hKur82idrr+2YCNx1ymdwkIDTzlM+MTeuHxkRN8havlKBCGhkz5mCut5qrMtCqYHCV0Emlvm9x7jsxIRzj27nsQrbvJBlixWWLv3i9timpNY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=W18EPO9K; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="W18EPO9K" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706861455; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kzFtBHR5zk59g+CzbXsW0xRKrMzCG41VI7+6LhlJ3kY=; b=W18EPO9Kjyaqv4DAb3oPrHqA6oxzwkQe2K3jGyhYIMP4/8YhfMdsGMjWvkOFjraKRKJyQd ugL34cqJy/ZMYsF1vdRtoDjhYSmhbVhM+ecAOyLLcwP4xpcrd0YUD2cK+d1OV/0oF9t+sp F5CV32/Fn89Q8YDj8tqukjNva3w85ks= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-135-VKghfcTmM0WgeKid_SqZiw-1; Fri, 02 Feb 2024 03:10:51 -0500 X-MC-Unique: VKghfcTmM0WgeKid_SqZiw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6132A3C0ED5E; Fri, 2 Feb 2024 08:10:51 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C2AB6C2590D; Fri, 2 Feb 2024 08:10:48 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants Date: Fri, 2 Feb 2024 09:09:20 +0100 Message-ID: <20240202080920.3337862-6-vschneid@redhat.com> In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com> References: <20240202080920.3337862-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Content-Type: text/plain; charset="utf-8" Previous commits have added .h_kernel_running and .h_user_running to struct cfs_rq, and are using them to play games with the hierarchical .h_nr_running. Assert some count invariants under SCHED_DEBUG to improve debugging. Signed-off-by: Valentin Schneider --- kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2b54d3813d18d..52d0ee0e4d47c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, v= oid *data) static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count); static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,= int count); =20 +#ifdef CONFIG_CFS_BANDWIDTH +static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) +{ + lockdep_assert_rq_held(rq_of(cfs_rq)); + + /* + * When !throttle_pending, this is the normal operating mode, all tasks + * are pickable, so: + * nr_kernel_tasks + nr_user_tasks =3D=3D nr_pickable_tasks + */ + SCHED_WARN_ON(!cfs_rq->throttle_pending && + (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=3D + cfs_rq->h_nr_running)); + /* + * When throttle_pending, only kernel tasks are pickable, so: + * nr_kernel_tasks =3D=3D nr_pickable_tasks + */ + SCHED_WARN_ON(cfs_rq->throttle_pending && + (cfs_rq->h_kernel_running !=3D cfs_rq->h_nr_running)); +} +#else +static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { } +#endif + static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) { struct rq *rq =3D rq_of(cfs_rq); @@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) dequeue_kernel(qcfs_rq, se, kernel_delta); qcfs_rq->h_user_running -=3D user_delta; =20 + assert_cfs_rq_counts(qcfs_rq); =20 if (qcfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ @@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) qcfs_rq->idle_h_nr_running -=3D idle_task_delta; dequeue_kernel(qcfs_rq, se, kernel_delta); qcfs_rq->h_user_running -=3D user_delta; + + assert_cfs_rq_counts(qcfs_rq); } =20 /* At this point se is NULL and we are at root level*/ @@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) enqueue_kernel(qcfs_rq, se, kernel_delta); qcfs_rq->h_user_running +=3D user_delta; =20 + assert_cfs_rq_counts(qcfs_rq); + /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(qcfs_rq)) goto unthrottle_throttle; @@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; =20 + assert_cfs_rq_counts(cfs_rq); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) se_update_runnable(se); update_cfs_group(se); =20 + assert_cfs_rq_counts(cfs_rq); =20 if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending)) cfs_rq->h_nr_running++; @@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *= p, int flags) if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; =20 + assert_cfs_rq_counts(cfs_rq); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; =20 + assert_cfs_rq_counts(cfs_rq); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running =3D 1; =20 + assert_cfs_rq_counts(cfs_rq); =20 /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) @@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struc= t *prev) =20 throttle_pending |=3D cfs_rq->throttle_pending; =20 + assert_cfs_rq_counts(cfs_rq); + if (cfs_rq_throttled(cfs_rq)) break; } @@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struc= t *prev) =20 throttle_pending |=3D cfs_rq->throttle_pending; =20 + assert_cfs_rq_counts(cfs_rq); + if (cfs_rq_throttled(cfs_rq)) break; } --=20 2.43.0