From nobody Sat Feb  7 08:44:09 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA79918AEE
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861449; cv=none;
 b=mchae6UVb09j9/DrSfksCbRbuTm33GYoYrhFbn4n/Kp9brXXvEMEIYUIc4OqgyAWJEJeDpfSTeXHZofwT53C2E75Rfu0nz5UUujKWqnP+/o1oeeCN2erXvZH4CIEZS6Uhidt4AqrBWp3n+qSgPGS1DrpvPzKeK32ldK99ra1gdU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861449; c=relaxed/simple;
	bh=XvGrGNouSr+0XomitksSmtHJyT2Bn0wbu87wMiy9Dy0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=TRtJoSW8vwiF1oGErbLJCC8KB9qJhfSwklr9LCvSpSClSze1Kpmw1xhFD9jsogqJyzTSVjC7A8E0XxIQ2Eug6IsNnbopeRwsp/T8TpNNQzYw+je1RfinbXbNiC8idgWRw2kfNYz/LwlN/ZEfo4/JNJ5/XAtYyZgs3k8xONcT/z0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=SaD4K5vu; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="SaD4K5vu"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861445;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DCgceAE8jBD2WohfFUuNrr+O/9K6+IZGaVDTilf9JRw=;
	b=SaD4K5vupbXwYT4BtFqvofZeVGIyvUNaM5hDA+KDghMl0zNfyC/pyl3aWcEmHOB3ef/laT
	bcFD1VbRjRIoIWWI+DI9jB24gSpGQDppiHlbTrzAcV7IKcWrG0o/9gSXtm2AS7HLzlM49y
	apv4Tgn41Ysn4N5rk65Yuo2DoneIomc=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-47-dtpG1UN6MtqxFsoNZDIZvg-1; Fri, 02 Feb 2024 03:10:40 -0500
X-MC-Unique: dtpG1UN6MtqxFsoNZDIZvg-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DCF6A185A780;
	Fri,  2 Feb 2024 08:10:39 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 04626C2590D;
	Fri,  2 Feb 2024 08:10:36 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Benjamin Segall <bsegall@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 1/5] sched/fair: Only throttle CFS tasks on return to
 userspace
Date: Fri,  2 Feb 2024 09:09:16 +0100
Message-ID: <20240202080920.3337862-2-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Content-Type: text/plain; charset="utf-8"

From: Benjamin Segall <bsegall@google.com>

The basic idea of this implementation is to maintain duplicate runqueues
in each cfs_rq that contain duplicate pointers to sched_entitys which
should bypass throttling. Then we can skip throttling cfs_rqs that have
any such children, and when we pick inside any not-actually-throttled
cfs_rq, we only look at this duplicated list.

"Which tasks should bypass throttling" here is "all schedule() calls
that don't set a special flag", but could instead involve the lockdep
markers (except for the problem of percpu-rwsem and similar) or explicit
flags around syscalls and faults, or something else.

This approach avoids any O(tasks) loops, but leaves partially-throttled
cfs_rqs still contributing their full h_nr_running to their parents,
which might result in worse balancing. Also it adds more (generally
still small) overhead to the common enqueue/dequeue/pick paths.

The very basic debug test added is to run a cpusoaker and "cat
/sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
cgroup with a quota < 1 cpu.

Not-signed-off-by: Benjamin Segall <bsegall@google.com>
[Slight comment / naming changes]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |   7 ++
 kernel/entry/common.c |   2 +-
 kernel/entry/kvm.c    |   2 +-
 kernel/sched/core.c   |  20 ++++
 kernel/sched/debug.c  |  28 +++++
 kernel/sched/fair.c   | 232 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |   3 +
 7 files changed, 281 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab29511..4a0105d1eaa21 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,6 +303,7 @@ extern long schedule_timeout_killable(long timeout);
 extern long schedule_timeout_uninterruptible(long timeout);
 extern long schedule_timeout_idle(long timeout);
 asmlinkage void schedule(void);
+asmlinkage void schedule_usermode(void);
 extern void schedule_preempt_disabled(void);
 asmlinkage void preempt_schedule_irq(void);
 #ifdef CONFIG_PREEMPT_RT
@@ -553,6 +554,9 @@ struct sched_entity {
 	struct cfs_rq			*my_q;
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct list_head		kernel_node;
+#endif
 #endif
=20
 #ifdef CONFIG_SMP
@@ -1539,6 +1543,9 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
=20
+#ifdef CONFIG_CFS_BANDWIDTH
+	atomic_t			in_return_to_user;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba3..16b5432a62c6f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -156,7 +156,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_r=
egs *regs,
 		local_irq_enable_exit_to_user(ti_work);
=20
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode(); /* TODO: also all of the arch/ loops that don't us=
e this yet */
=20
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..fc4b73de07539 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -14,7 +14,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,=
 unsigned long ti_work)
 		}
=20
 		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+			schedule_usermode();
=20
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			resume_user_mode_work(NULL);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f0..a7c028fad5a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,10 @@ static void __sched_fork(unsigned long clone_flags, =
struct task_struct *p)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			=3D NULL;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&p->se.kernel_node);
+	atomic_set(&p->in_return_to_user, 0);
+#endif
=20
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
@@ -6818,6 +6822,22 @@ asmlinkage __visible void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
=20
+asmlinkage __visible void __sched schedule_usermode(void)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * This is only atomic because of this simple implementation. We could
+	 * do something with an SM_USER to avoid other-cpu scheduler operations
+	 * racing against these writes.
+	 */
+	atomic_set(&current->in_return_to_user, true);
+	schedule();
+	atomic_set(&current->in_return_to_user, false);
+#else
+	schedule();
+#endif
+}
+
 /*
  * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
  * state (have scheduled out non-voluntarily) by making sure that all
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834df..4a89dbc3ddfcd 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -319,6 +319,32 @@ static const struct file_operations sched_verbose_fops=
 =3D {
 	.llseek =3D       default_llseek,
 };
=20
+static DEFINE_MUTEX(sched_debug_spin_mutex);
+static int sched_debug_spin_show(struct seq_file *m, void *v) {
+	int count;
+	mutex_lock(&sched_debug_spin_mutex);
+	for (count =3D 0; count < 1000; count++) {
+		u64 start2;
+		start2 =3D jiffies;
+		while (jiffies =3D=3D start2)
+			cpu_relax();
+		schedule();
+	}
+	mutex_unlock(&sched_debug_spin_mutex);
+	return 0;
+}
+static int sched_debug_spin_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_debug_spin_show, NULL);
+}
+
+static const struct file_operations sched_debug_spin_fops =3D {
+	.open		=3D sched_debug_spin_open,
+	.read		=3D seq_read,
+	.llseek		=3D seq_lseek,
+	.release	=3D single_release,
+};
+
 static const struct seq_operations sched_debug_sops;
=20
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -374,6 +400,8 @@ static __init int sched_init_debug(void)
=20
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops=
);
=20
+	debugfs_create_file("sched_locked_spin", 0444, NULL, NULL,
+			    &sched_debug_spin_fops);
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a037..a1808459a5acc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -128,6 +128,7 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 static unsigned int sysctl_sched_cfs_bandwidth_slice		=3D 5000UL;
+static unsigned int sysctl_sched_cfs_bandwidth_kernel_bypass	=3D 1;
 #endif
=20
 #ifdef CONFIG_NUMA_BALANCING
@@ -146,6 +147,15 @@ static struct ctl_table sched_fair_sysctls[] =3D {
 		.proc_handler   =3D proc_dointvec_minmax,
 		.extra1         =3D SYSCTL_ONE,
 	},
+	{
+		.procname       =3D "sched_cfs_bandwidth_kernel_bypass",
+		.data           =3D &sysctl_sched_cfs_bandwidth_kernel_bypass,
+		.maxlen         =3D sizeof(unsigned int),
+		.mode           =3D 0644,
+		.proc_handler   =3D proc_dointvec_minmax,
+		.extra1         =3D SYSCTL_ZERO,
+		.extra2         =3D SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -5445,14 +5455,34 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched=
_entity *se)
=20
 /*
  * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
+ * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
+ * 2) keep things fair between processes/task groups
+ * 3) pick the "next" process, since someone really wants that to run
+ * 4) pick the "last" process, for cache locality
+ * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * TODO: This might trigger, I'm not sure/don't remember. Regardless,
+	 * while we do not explicitly handle the case where h_kernel_running
+	 * goes to 0, we will call account/check_cfs_rq_runtime at worst in
+	 * entity_tick and notice that we can now properly do the full
+	 * throttle_cfs_rq.
+	 */
+	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
+	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+		/*
+		 * TODO: you'd want to factor out pick_eevdf to just take
+		 * tasks_timeline, and replace this list with a second rbtree
+		 * and a call to pick_eevdf.
+		 */
+		return list_first_entry(&cfs_rq->kernel_children,
+					struct sched_entity, kernel_node);
+	}
+#endif
 	/*
 	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
@@ -5651,8 +5681,14 @@ static void __account_cfs_rq_runtime(struct cfs_rq *=
cfs_rq, u64 delta_exec)
 	/*
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
+	 *
+	 * Don't resched_curr() if curr is in the kernel. We won't throttle the
+	 * cfs_rq if any task is in the kernel, and if curr in particular is we
+	 * don't need to preempt it in favor of whatever other task is in the
+	 * kernel.
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr) &&
+	    list_empty(&rq_of(cfs_rq)->curr->se.kernel_node))
 		resched_curr(rq_of(cfs_rq));
 }
=20
@@ -5741,12 +5777,22 @@ static int tg_throttle_down(struct task_group *tg, =
void *data)
 	return 0;
 }
=20
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count);
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count);
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq =3D rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, dequeue =3D 1;
+	long task_delta, idle_task_delta, kernel_delta, dequeue =3D 1;
+
+	/*
+	 * We don't actually throttle, though account() will have made sure to
+	 * resched us so that we pick into a kernel task.
+	 */
+	if (cfs_rq->h_kernel_running)
+		return false;
=20
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5778,6 +5824,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 	task_delta =3D cfs_rq->h_nr_running;
 	idle_task_delta =3D cfs_rq->idle_h_nr_running;
+	kernel_delta =3D cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq =3D cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5791,6 +5838,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 		qcfs_rq->h_nr_running -=3D task_delta;
 		qcfs_rq->idle_h_nr_running -=3D idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
=20
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5813,6 +5861,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 		qcfs_rq->h_nr_running -=3D task_delta;
 		qcfs_rq->idle_h_nr_running -=3D idle_task_delta;
+		dequeue_kernel(qcfs_rq, se, kernel_delta);
 	}
=20
 	/* At this point se is NULL and we are at root level*/
@@ -5835,7 +5884,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq =3D rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta;
+	long task_delta, idle_task_delta, kernel_delta;
=20
 	se =3D cfs_rq->tg->se[cpu_of(rq)];
=20
@@ -5870,6 +5919,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 	task_delta =3D cfs_rq->h_nr_running;
 	idle_task_delta =3D cfs_rq->idle_h_nr_running;
+	kernel_delta =3D cfs_rq->h_kernel_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq =3D cfs_rq_of(se);
=20
@@ -5882,6 +5932,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 		qcfs_rq->h_nr_running +=3D task_delta;
 		qcfs_rq->idle_h_nr_running +=3D idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5899,6 +5950,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
=20
 		qcfs_rq->h_nr_running +=3D task_delta;
 		qcfs_rq->idle_h_nr_running +=3D idle_task_delta;
+		enqueue_kernel(qcfs_rq, se, kernel_delta);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6557,6 +6609,86 @@ static void sched_fair_update_stop_tick(struct rq *r=
q, struct task_struct *p)
 }
 #endif
=20
+/*
+ * We keep track of all children that are runnable in the kernel with a co=
unt of
+ * all descendants. The state is checked on enqueue and put_prev (and hard
+ * cleared on dequeue), and is stored just as the filled/empty state of the
+ * kernel_node list entry.
+ *
+ * These are simple helpers that do both parts, and should be called botto=
m-up
+ * until hitting a throttled cfs_rq whenever a task changes state (or a cf=
s_rq
+ * is (un)throttled).
+ */
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count)
+{
+	if (count =3D=3D 0)
+		return;
+
+	if (list_empty(&se->kernel_node))
+		list_add(&se->kernel_node, &cfs_rq->kernel_children);
+	cfs_rq->h_kernel_running +=3D count;
+}
+
+static bool is_kernel_task(struct task_struct *p)
+{
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_re=
turn_to_user);
+}
+
+/*
+ * When called on a task this always transitions it to a !kernel state.
+ *
+ * When called on a group it is just synchronizing the state with the new
+ * h_kernel_waiters, unless this it has been throttled and is !on_rq
+ */
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count)
+{
+	if (count =3D=3D 0)
+		return;
+
+	if (!se->on_rq || entity_is_task(se) ||
+	    !group_cfs_rq(se)->h_kernel_running)
+		list_del_init(&se->kernel_node);
+	cfs_rq->h_kernel_running -=3D count;
+}
+
+/*
+ * Returns if the cfs_rq "should" be throttled but might not be because of
+ * kernel threads bypassing throttle.
+ */
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
+		return false;
+	return true;
+}
+
+static void unthrottle_on_enqueue(struct task_struct *p)
+{
+	struct sched_entity *se =3D &p->se;
+
+	if (!cfs_bandwidth_used() || !sysctl_sched_cfs_bandwidth_kernel_bypass)
+		return;
+	if (!cfs_rq_of(&p->se)->throttle_count)
+		return;
+
+	/*
+	 * MAYBE TODO: doing it this simple way is O(throttle_count *
+	 * cgroup_depth). We could optimize that into a single pass, but making
+	 * a mostly-copy of unthrottle_cfs_rq that does that is a pain and easy
+	 * to get wrong. (And even without unthrottle_on_enqueue it's O(nm),
+	 * just not while holding rq->lock the whole time)
+	 */
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq =3D cfs_rq_of(se);
+		if (cfs_rq->throttled)
+			unthrottle_cfs_rq(cfs_rq);
+	}
+}
+
 #else /* CONFIG_CFS_BANDWIDTH */
=20
 static inline bool cfs_bandwidth_used(void)
@@ -6604,6 +6736,16 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 #endif
+static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count) {}
+static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count) {}
+static inline bool is_kernel_task(struct task_struct *p)
+{
+	return false;
+}
+static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
=20
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -6707,6 +6849,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 	struct sched_entity *se =3D &p->se;
 	int idle_h_nr_running =3D task_has_idle_policy(p);
 	int task_new =3D !(flags & ENQUEUE_WAKEUP);
+	bool kernel_task =3D is_kernel_task(p);
=20
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6735,6 +6878,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
=20
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6755,6 +6900,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
=20
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
+		if (kernel_task)
+			enqueue_kernel(cfs_rq, se, 1);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6785,6 +6932,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 	assert_list_leaf_cfs_rq(rq);
=20
 	hrtick_update(rq);
+
+	if (kernel_task)
+		unthrottle_on_enqueue(p);
 }
=20
 static void set_next_buddy(struct sched_entity *se);
@@ -6801,6 +6951,7 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
 	int task_sleep =3D flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running =3D task_has_idle_policy(p);
 	bool was_sched_idle =3D sched_idle_rq(rq);
+	bool kernel_task =3D !list_empty(&p->se.kernel_node);
=20
 	util_est_dequeue(&rq->cfs, p);
=20
@@ -6813,6 +6964,8 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
=20
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6845,6 +6998,8 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
=20
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
+		if (kernel_task)
+			dequeue_kernel(cfs_rq, se, 1);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8343,11 +8498,40 @@ static void check_preempt_wakeup_fair(struct rq *rq=
, struct task_struct *p, int
 	resched_curr(rq);
 }
=20
+static void handle_kernel_task_prev(struct task_struct *prev)
+{
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct sched_entity *se =3D &prev->se;
+	bool p_in_kernel =3D is_kernel_task(prev);
+	bool p_in_kernel_tree =3D !list_empty(&se->kernel_node);
+	/*
+	 * These extra loops are bad and against the whole point of the merged
+	 * PNT, but it's a pain to merge, particularly since we want it to occur
+	 * before check_cfs_runtime().
+	 */
+	if (p_in_kernel_tree && !p_in_kernel) {
+		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+		for_each_sched_entity(se) {
+			dequeue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		for_each_sched_entity(se) {
+			enqueue_kernel(cfs_rq_of(se), se, 1);
+			if (cfs_rq_throttled(cfs_rq_of(se)))
+				break;
+		}
+	}
+#endif
+}
+
 #ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	bool throttled =3D false;
=20
 again:
 	cfs_rq =3D &rq->cfs;
@@ -8368,7 +8552,10 @@ static struct task_struct *pick_task_fair(struct rq =
*rq)
 				goto again;
 		}
=20
-		se =3D pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled =3D true;
+
+		se =3D pick_next_entity(cfs_rq, throttled);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
=20
@@ -8383,6 +8570,14 @@ pick_next_task_fair(struct rq *rq, struct task_struc=
t *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
+	bool throttled;
+
+	/*
+	 * We want to handle this before check_cfs_runtime(prev). We'll
+	 * duplicate a little work in the goto simple case, but that's fine
+	 */
+	if (prev)
+		handle_kernel_task_prev(prev);
=20
 again:
 	if (!sched_fair_runnable(rq))
@@ -8400,6 +8595,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct=
 *prev, struct rq_flags *rf
 	 * hierarchy, only change the part that actually changes.
 	 */
=20
+	throttled =3D false;
 	do {
 		struct sched_entity *curr =3D cfs_rq->curr;
=20
@@ -8431,7 +8627,10 @@ pick_next_task_fair(struct rq *rq, struct task_struc=
t *prev, struct rq_flags *rf
 			}
 		}
=20
-		se =3D pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled =3D true;
+
+		se =3D pick_next_entity(cfs_rq, throttled);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
=20
@@ -8469,8 +8668,11 @@ pick_next_task_fair(struct rq *rq, struct task_struc=
t *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
=20
+	throttled =3D false;
 	do {
-		se =3D pick_next_entity(cfs_rq);
+		if (cfs_rq_throttled_loose(cfs_rq))
+			throttled =3D true;
+		se =3D pick_next_entity(cfs_rq, throttled);
 		set_next_entity(cfs_rq, se);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
@@ -8534,6 +8736,8 @@ static void put_prev_task_fair(struct rq *rq, struct =
task_struct *prev)
 	struct sched_entity *se =3D &prev->se;
 	struct cfs_rq *cfs_rq;
=20
+	handle_kernel_task_prev(prev);
+
 	for_each_sched_entity(se) {
 		cfs_rq =3D cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
@@ -12818,6 +13022,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&cfs_rq->kernel_children);
+#endif
 }
=20
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -12970,6 +13177,9 @@ void init_tg_cfs_entry(struct task_group *tg, struc=
t cfs_rq *cfs_rq,
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent =3D parent;
+#ifdef CONFIG_CFS_BANDWIDTH
+	INIT_LIST_HEAD(&se->kernel_node);
+#endif
 }
=20
 static DEFINE_MUTEX(shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77de..0b33ce2e60555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,6 +580,7 @@ struct cfs_rq {
=20
 	struct rb_root_cached	tasks_timeline;
=20
+
 	/*
 	 * 'curr' points to currently running entity on this cfs_rq.
 	 * It is set to NULL otherwise (i.e when none are currently running).
@@ -658,8 +659,10 @@ struct cfs_rq {
 	u64			throttled_clock_self_time;
 	int			throttled;
 	int			throttle_count;
+	int			h_kernel_running;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head	kernel_children;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
--=20
2.43.0
From nobody Sat Feb  7 08:44:09 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4953F18658
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:45 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861448; cv=none;
 b=YJ5SdYelmP32V3WJ9ZZmuEQZOpdyetACuqhFFboobVH/WYEk3LU0/x64cqEJ/+Swre4jfMrJqoUsk5Pk8PIfKhoLwYG5UlfqE4J2l188PynAWB3PSBt9NjHsOMG1+W+IEO15KRTfHJg1XhJ9tI89XMSXFmtW4io3/L7lCIJTmiI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861448; c=relaxed/simple;
	bh=Zk2mQ+ZW7M5p5xaR77pQtedUVa3RgHq4JSSfbueXse4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=E2ZVtPYSX020Xt22N2wuUOqxIo2zw9UvY61WwQufs54BhuxnGMHuysEc9avSt0ASYJrBtBdhZYn/eUVhTqArO2GEJ+PAlgn8ZXIiDTUYLWNB+IbzMNcLg4HHTdcOqxYIRYRy6ONVOr1vOB3Dj7nSSTlaK1BurPK6r72EF7w+OlM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=anlPeXsQ; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="anlPeXsQ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861445;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=MWiW15gYCd0qq5hmfxU5I8UTcZTFW4ysIqeBM84iI5Y=;
	b=anlPeXsQPG37DeisH0PRA/kAA1MDRcU5pHlNSWKRmnKqAMxVRn/smSrQ4OOUP9hHhpUsoN
	/odbLvP0hFbwrcRtPPAmNICScyZh4pKqEqSJengL7AMvaV2K6JhZRv0AVDcGJPZUmnZ2Tu
	MxmtIYHc9+CmBkSagrIHiyszppRqHS0=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-394-tjCGo5PvMb2pCh-tcMxXTQ-1; Fri,
 02 Feb 2024 03:10:43 -0500
X-MC-Unique: tjCGo5PvMb2pCh-tcMxXTQ-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F00E22812FFA;
	Fri,  2 Feb 2024 08:10:42 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 3A3ADC2590E;
	Fri,  2 Feb 2024 08:10:40 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Benjamin Segall <bsegall@google.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 2/5] sched: Note schedule() invocations at
 return-to-user with SM_USER
Date: Fri,  2 Feb 2024 09:09:17 +0100
Message-ID: <20240202080920.3337862-3-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Content-Type: text/plain; charset="utf-8"

task_struct.in_return_to_user is currently updated via atomic operations in
schedule_usermode().

However, one can note:
o .in_return_to_user is only updated for the current task
o There are no remote (smp_processor_id() !=3D task_cpu(p)) accesses to
  .in_return_to_user

Add schedule_with_mode() to factorize schedule() with different flags to
pass down to __schedule_loop().

Add SM_USER to denote schedule() calls from return-to-userspace points.

Update .in_return_to_user from within the preemption-disabled, rq_lock-held
part of __schedule().

Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/core.c   | 43 ++++++++++++++++++++++++++++++++-----------
 kernel/sched/fair.c   | 17 ++++++++++++++++-
 3 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a0105d1eaa21..1b6f17b2150a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1544,7 +1544,7 @@ struct task_struct {
 #endif
=20
 #ifdef CONFIG_CFS_BANDWIDTH
-	atomic_t			in_return_to_user;
+	int				in_return_to_user;
 #endif
 	/*
 	 * New fields for task_struct should be added above here, so that
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7c028fad5a89..54e6690626b13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4531,7 +4531,7 @@ static void __sched_fork(unsigned long clone_flags, s=
truct task_struct *p)
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	INIT_LIST_HEAD(&p->se.kernel_node);
-	atomic_set(&p->in_return_to_user, 0);
+	p->in_return_to_user            =3D false;
 #endif
=20
 #ifdef CONFIG_SCHEDSTATS
@@ -5147,6 +5147,9 @@ prepare_lock_switch(struct rq *rq, struct task_struct=
 *next, struct rq_flags *rf
=20
 static inline void finish_lock_switch(struct rq *rq)
 {
+#ifdef CONFIG_CFS_BANDWIDTH
+	current->in_return_to_user =3D false;
+#endif
 	/*
 	 * If we are tracking spinlock dependencies then we have to
 	 * fix up the runqueue lock - which gets 'carried over' from
@@ -6562,6 +6565,18 @@ pick_next_task(struct rq *rq, struct task_struct *pr=
ev, struct rq_flags *rf)
 #define SM_PREEMPT		0x1
 #define SM_RTLOCK_WAIT		0x2
=20
+/*
+ * Special case for CFS_BANDWIDTH where we need to know if the call to
+ * __schedule() is directely preceding an entry into userspace.
+ * It is removed from the mode argument as soon as it is used to not go ag=
ainst
+ * the SM_MASK_PREEMPT optimisation below.
+ */
+#ifdef CONFIG_CFS_BANDWIDTH
+# define SM_USER                0x4
+#else
+# define SM_USER                SM_NONE
+#endif
+
 #ifndef CONFIG_PREEMPT_RT
 # define SM_MASK_PREEMPT	(~0U)
 #else
@@ -6646,6 +6661,14 @@ static void __sched notrace __schedule(unsigned int =
sched_mode)
 	rq_lock(rq, &rf);
 	smp_mb__after_spinlock();
=20
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (sched_mode & SM_USER) {
+		prev->in_return_to_user =3D true;
+		sched_mode &=3D ~SM_USER;
+	}
+#endif
+	SCHED_WARN_ON(sched_mode & SM_USER);
+
 	/* Promote REQ to ACT */
 	rq->clock_update_flags <<=3D 1;
 	update_rq_clock(rq);
@@ -6807,7 +6830,7 @@ static __always_inline void __schedule_loop(unsigned =
int sched_mode)
 	} while (need_resched());
 }
=20
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void schedule_with_mode(unsigned int sched_mode)
 {
 	struct task_struct *tsk =3D current;
=20
@@ -6817,22 +6840,20 @@ asmlinkage __visible void __sched schedule(void)
=20
 	if (!task_is_running(tsk))
 		sched_submit_work(tsk);
-	__schedule_loop(SM_NONE);
+	__schedule_loop(sched_mode);
 	sched_update_worker(tsk);
 }
+
+asmlinkage __visible void __sched schedule(void)
+{
+	schedule_with_mode(SM_NONE);
+}
 EXPORT_SYMBOL(schedule);
=20
 asmlinkage __visible void __sched schedule_usermode(void)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
-	/*
-	 * This is only atomic because of this simple implementation. We could
-	 * do something with an SM_USER to avoid other-cpu scheduler operations
-	 * racing against these writes.
-	 */
-	atomic_set(&current->in_return_to_user, true);
-	schedule();
-	atomic_set(&current->in_return_to_user, false);
+	schedule_with_mode(SM_USER);
 #else
 	schedule();
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1808459a5acc..96504be6ee14a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6631,7 +6631,22 @@ static void enqueue_kernel(struct cfs_rq *cfs_rq, st=
ruct sched_entity *se, int c
=20
 static bool is_kernel_task(struct task_struct *p)
 {
-	return sysctl_sched_cfs_bandwidth_kernel_bypass && !atomic_read(&p->in_re=
turn_to_user);
+	/*
+	 * The flag is updated within __schedule() with preemption disabled,
+	 * under the rq lock, and only when the task is current.
+	 *
+	 * Holding the rq lock for that task's CPU is thus sufficient for the
+	 * value to be stable, if the task is enqueued.
+	 *
+	 * If the task is dequeued, then task_cpu(p) *can* change, but this
+	 * so far only happens in enqueue_task_fair() which means either:
+	 * - the task is being activated, its CPU has been set previously in ttwu=
()
+	 * - the task is going through a "change" cycle (e.g. sched_move_task()),
+	 *   the pi_lock is also held so the CPU is stable.
+	 */
+	lockdep_assert_rq_held(cpu_rq(task_cpu(p)));
+
+	return sysctl_sched_cfs_bandwidth_kernel_bypass && !p->in_return_to_user;
 }
=20
 /*
--=20
2.43.0
From nobody Sat Feb  7 08:44:09 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9086E19477
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861453; cv=none;
 b=Xbv6Hkh1+4Wa87hU/AfTGrhyqTHBe34FkbhB9QXeDEfiPdaJ+AWajiQnB1juVlFTC88SgLf7aKNeARRAjWEi1dVlVwo3EDzklkXWhpqkNVqpKIwBdvyAtcamC3f3bL57x8pF+325eEj44k03JVOD+VWCnAhEI6IJUe/OXvTCG0k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861453; c=relaxed/simple;
	bh=dWuR9UQw2rwTVshyZQG5PCMcxyEdawkP54MOYUetJAM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=iB3w/RGRzNdwRlPgTiGybUFW0vFZ+O41iw8Xq1ZsTFiQt/f/SRsPDAVIGmoKbMkcJU/TQ9n9+rl2ijwXf7GNO0dNdkaYQaxJffse7s1LcIksibw2oq4yP32ZT5Z0CVizexecHKd6SDVsTlzcst5tWGC9cqnvIAetMldLypjIbSc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=N3ZpuSiS; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="N3ZpuSiS"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861450;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HxaIkJskuFw38ApXZhFFaPGSaq/2ThOojlsU5lw5wRE=;
	b=N3ZpuSiSogq4Fzc3MLaKHNggOKGsr76/9qDKxi11ss+02s+Wy4SPjACFD4rjrS/gq/3HMy
	Mx2izwBJoIhvuqJPjmer9qYtFByOYBYG1TRGX5jG3upGzeVvRxdONUtihwoeSvkrkUF8VG
	y7QyhvAXiGX7MkJdMmAFQ1Gnz2S6y5k=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-319--CXo1PhyNSiCukS3YGEorQ-1; Fri, 02 Feb 2024 03:10:46 -0500
X-MC-Unique: -CXo1PhyNSiCukS3YGEorQ-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E676D1013663;
	Fri,  2 Feb 2024 08:10:45 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 59EE0C2590E;
	Fri,  2 Feb 2024 08:10:43 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 3/5] sched/fair: Delete cfs_rq_throttled_loose(),
 use cfs_rq->throttle_pending instead
Date: Fri,  2 Feb 2024 09:09:18 +0100
Message-ID: <20240202080920.3337862-4-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Content-Type: text/plain; charset="utf-8"

cfs_rq_throttled_loose() does not check if there is runtime remaining in
the cfs_b, and thus relies on check_cfs_rq_runtime() being ran previously
for that to be checked.

Cache the throttle attempt in throttle_cfs_rq and reuse that where
needed.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 44 ++++++++++----------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96504be6ee14a..60778afbff207 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5462,7 +5462,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_e=
ntity *se)
  * 5) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
+pick_next_entity(struct cfs_rq *cfs_rq)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	/*
@@ -5473,7 +5473,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, bool throttle=
d)
 	 * throttle_cfs_rq.
 	 */
 	WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));
-	if (throttled && !list_empty(&cfs_rq->kernel_children)) {
+	if (cfs_rq->throttle_pending && !list_empty(&cfs_rq->kernel_children)) {
 		/*
 		 * TODO: you'd want to factor out pick_eevdf to just take
 		 * tasks_timeline, and replace this list with a second rbtree
@@ -5791,8 +5791,12 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * We don't actually throttle, though account() will have made sure to
 	 * resched us so that we pick into a kernel task.
 	 */
-	if (cfs_rq->h_kernel_running)
+	if (cfs_rq->h_kernel_running) {
+		cfs_rq->throttle_pending =3D true;
 		return false;
+	}
+
+	cfs_rq->throttle_pending =3D false;
=20
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -6666,20 +6670,6 @@ static void dequeue_kernel(struct cfs_rq *cfs_rq, st=
ruct sched_entity *se, int c
 	cfs_rq->h_kernel_running -=3D count;
 }
=20
-/*
- * Returns if the cfs_rq "should" be throttled but might not be because of
- * kernel threads bypassing throttle.
- */
-static bool cfs_rq_throttled_loose(struct cfs_rq *cfs_rq)
-{
-	if (!cfs_bandwidth_used())
-		return false;
-
-	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
-	return true;
-}
-
 static void unthrottle_on_enqueue(struct task_struct *p)
 {
 	struct sched_entity *se =3D &p->se;
@@ -8546,7 +8536,6 @@ static struct task_struct *pick_task_fair(struct rq *=
rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
-	bool throttled =3D false;
=20
 again:
 	cfs_rq =3D &rq->cfs;
@@ -8567,10 +8556,7 @@ static struct task_struct *pick_task_fair(struct rq =
*rq)
 				goto again;
 		}
=20
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled =3D true;
-
-		se =3D pick_next_entity(cfs_rq, throttled);
+		se =3D pick_next_entity(cfs_rq);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
=20
@@ -8585,7 +8571,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct=
 *prev, struct rq_flags *rf
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
-	bool throttled;
=20
 	/*
 	 * We want to handle this before check_cfs_runtime(prev). We'll
@@ -8609,8 +8594,6 @@ pick_next_task_fair(struct rq *rq, struct task_struct=
 *prev, struct rq_flags *rf
 	 * Therefore attempt to avoid putting and setting the entire cgroup
 	 * hierarchy, only change the part that actually changes.
 	 */
-
-	throttled =3D false;
 	do {
 		struct sched_entity *curr =3D cfs_rq->curr;
=20
@@ -8641,11 +8624,7 @@ pick_next_task_fair(struct rq *rq, struct task_struc=
t *prev, struct rq_flags *rf
 				goto simple;
 			}
 		}
-
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled =3D true;
-
-		se =3D pick_next_entity(cfs_rq, throttled);
+		se =3D pick_next_entity(cfs_rq);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
=20
@@ -8683,11 +8662,8 @@ pick_next_task_fair(struct rq *rq, struct task_struc=
t *prev, struct rq_flags *rf
 	if (prev)
 		put_prev_task(rq, prev);
=20
-	throttled =3D false;
 	do {
-		if (cfs_rq_throttled_loose(cfs_rq))
-			throttled =3D true;
-		se =3D pick_next_entity(cfs_rq, throttled);
+		se =3D pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
 		cfs_rq =3D group_cfs_rq(se);
 	} while (cfs_rq);
--=20
2.43.0
From nobody Sat Feb  7 08:44:09 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 536BF1AADD
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861456; cv=none;
 b=rEqZ4EZKraYr+wtHlPrm1b/XTdZ2ICgUrQ2iRkWLmLFRPyAEll/WOm8I4yKhca9fbeh2HLuRNdnVwxPD1QfGanFvRQ/o/VKuznsz8qxqTrFFnfk1DNlHl4Qw/tQTVsvCAHmdACyI2Z+CefS0sQtBdGmePSwzekcejEtj0MtL25E=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861456; c=relaxed/simple;
	bh=XSUm7ZVkUeJgzY+9JV/lu1DGc96b7lL1jb2zM9BjCJw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EYM2x8UDmGABzZaMedauhb3ttg72aB1JOPC0PIDNqjNE2ZS5CAde05rHAsECldeDT5BlobcKmg7d5pU8O++xrl9T8K1wJCodxjhZadJV180U9zHDo2GZD2w4Z+FcByS+fRLAsTtCmG7RMJETPaeEnCa6FZc2MENzkeygfVMk81w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UPlur54y; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="UPlur54y"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861453;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=59FY7Fjt6DUz/1aW5sr5w/uHLVwuGkenIcuvCVNSVT4=;
	b=UPlur54yBIBI1UWQTbEn1GaxfNJniS+Ptuvdfc/bT5D403PrC1lJqyuUYA09QKCJTWOEcY
	wfTbCpu66ChWChvUox8o0aYdNHgD0Yg4YRK74ihZ33nypfKia/fTWZ3vDigNOBOEZ3/PI3
	Zw1pLYz12P3HHUYf4wxeM84hzjxrjPo=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-427-vO7VM0CHMYa-y6FzF5P9nA-1; Fri, 02 Feb 2024 03:10:48 -0500
X-MC-Unique: vO7VM0CHMYa-y6FzF5P9nA-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 635D983B82B;
	Fri,  2 Feb 2024 08:10:48 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 3F9A0C2590D;
	Fri,  2 Feb 2024 08:10:46 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 4/5] sched/fair: Track count of tasks running in
 userspace
Date: Fri,  2 Feb 2024 09:09:19 +0100
Message-ID: <20240202080920.3337862-5-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Content-Type: text/plain; charset="utf-8"

While having a second tree to pick from solves the throttling aspect of thi=
ngs,
it also requires modification of the task count at the cfs_rq level.

.h_nr_running is used throughout load_balance(), and it needs to accurately
reflect the amount of pickable tasks: a cfs_rq with .throttle_pending=3D1 m=
ay have
many tasks in userspace (thus effectively throttled), and this "excess" of =
tasks
shouldn't cause find_busiest_group() / find_busiest_queue() to pick that
cfs_rq's CPU to pull load from when there are other CPUs with more pickable
tasks to pull.

The approach taken here is to track both the count of tasks in kernelspace =
and
the count of tasks in userspace (technically tasks-just-about-to-enter-user=
space).

When a cfs_rq runs out of runtime, it gets marked as .throttle_pending=3D1.=
 From
this point on, only tasks executing in kernelspace are pickable, and this is
reflected up the hierarchy by removing that cfs_rq.h_user_running from its
parents' .h_nr_running.

To aid in validating the proper behaviour of the implementation, we assert =
the
following invariants:
  o For any cfs_rq with .throttle_pending =3D=3D 0:
    .h_kernel_running + .h_user_running =3D=3D .h_nr_running
  o For any cfs_rq with .throttle_pending =3D=3D 1:
    .h_kernel_running =3D=3D .h_nr_running

This means the .h_user_running also needs to be updated as cfs_rq's become
.throttle_pending=3D1. When a cfs_rq becomes .throttle_pending=3D1, its
.h_user_running remains untouched, but it is subtracted from its parents'
.h_user_running.

Another way to look at it is that the .h_user_running is "stored" at the le=
vel
of the .throttle_pending cfs_rq, and restored to the upper part of the hier=
archy
at unthrottle.

An overview of the count logic is:

 Consider:
   cfs_rq.kernel :=3D count of kernel *tasks* enqueued on this cfs_rq
   cfs_rq.user   :=3D count of user   *tasks* enqueued on this cfs_rq

 Then, the following logic is implemented:
   cfs_rq.h_kernel_running =3D Sum(child.kernel) for all child cfs_rq
   cfs_rq.h_user_running   =3D Sum(child.user)   for all child cfs_rq with =
!child.throttle_pending
   cfs_rq.h_nr_running     =3D Sum(child.kernel) for all child cfs_rq
			   + Sum(child.user)   for all child cfs_rq with !child.throttle_pending

An application of that logic to an A/B/C cgroup hierarchy:

  Initial condition, no throttling

    +------+ .h_kernel_running =3D C.kernel + B.kernel + A.kernel
  A |cfs_rq| .h_user_running   =3D C.user   + B.user   + A.user
    +------+ .h_nr_running     =3D C.{kernel+user} + B.{kernel+user} + A.{k=
ernel+user}
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D C.kernel + B.kernel
  B |cfs_rq| .h_user_running   =3D C.user   + B.user
    +------+ .h_nr_running     =3D C.{kernel+user} + B.{kernel+user}
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D C.kernel
  C |cfs_rq| .h_user_running   =3D C.user
    +------+ .h_nr_running     =3D C.{kernel+user}
	     .throttle_pending =3D 0

  C becomes .throttle_pending

    +------+ .h_kernel_running =3D C.kernel + B.kernel + A.kernel          =
     <- Untouched
  A |cfs_rq| .h_user_running   =3D B.user   + A.user                       =
     <- Decremented by C.user
    +------+ .h_nr_running     =3D C.kernel + B.{kernel+user} + A.{kernel+u=
ser} <- Decremented by C.user
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D C.kernel + B.kernel                     =
     <- Untouched
  B |cfs_rq| .h_user_running   =3D B.user                                  =
     <- Decremented by C.user
    +------+ .h_nr_running     =3D C.kernel + B.{kernel+user} + A.{kernel+u=
ser} <- Decremented by C.user
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D C.kernel
  C |cfs_rq| .h_user_running   =3D C.user   <- Untouched, the count is "sto=
red" at this level
    +------+ .h_nr_running     =3D C.kernel <- Decremented by C.user
	     .throttle_pending =3D 1

  C becomes throttled

    +------+ .h_kernel_running =3D B.kernel + A.kernel               <- Dec=
remented by C.kernel
  A |cfs_rq| .h_user_running   =3D B.user   + A.user
    +------+ .h_nr_running     =3D B.{kernel+user} + A.{kernel+user} <- Dec=
remented by C.kernel
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D B.kernel                          <- Dec=
remented by C.kernel
  B |cfs_rq| .h_user_running   =3D B.user
    +------+ .h_nr_running     =3D B.{kernel+user} + A.{kernel+user} <- Dec=
remented by C.kernel
       ^     .throttle_pending =3D 0
       |
       | parent
       |
    +------+ .h_kernel_running =3D C.kernel
  C |cfs_rq| .h_user_running   =3D C.user
    +------+ .h_nr_running     =3D C.{kernel+user} <- Incremented by C.user
	     .throttle_pending =3D 0

Could we get away with just one count, e.g. the user count and not the kern=
el
count? Technically yes, we could follow this scheme:
  if (throttle_pending) =3D> kernel count :=3D h_nr_running - h_user_running
  else                  =3D> kernel count :=3D h_nr_running
this however prevents any sort of assertion or sanity checking on the count=
s,
which I am not the biggest fan on - CFS group scheduling is enough of a hea=
dache
as it is.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c  | 174 ++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |   2 +
 2 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60778afbff207..2b54d3813d18d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5785,17 +5785,48 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq =3D rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta, dequeue =3D 1;
+	long task_delta, idle_task_delta, kernel_delta, user_delta, dequeue =3D 1;
+	bool was_pending;
=20
 	/*
-	 * We don't actually throttle, though account() will have made sure to
-	 * resched us so that we pick into a kernel task.
+	 * We don't actually throttle just yet, though account_cfs_rq_runtime()
+	 * will have made sure to resched us so that we pick into a kernel task.
 	 */
 	if (cfs_rq->h_kernel_running) {
+		if (cfs_rq->throttle_pending)
+			return false;
+
+		/*
+		 * From now on we're only going to pick tasks that are in the
+		 * second tree. Reflect this by discounting tasks that aren't going
+		 * to be pickable from the ->h_nr_running counts.
+		 */
 		cfs_rq->throttle_pending =3D true;
+
+		se =3D cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		user_delta =3D cfs_rq->h_user_running;
+		cfs_rq->h_nr_running -=3D user_delta;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *qcfs_rq =3D cfs_rq_of(se);
+
+			if (!se->on_rq)
+				goto done;
+
+			qcfs_rq->h_nr_running -=3D user_delta;
+			qcfs_rq->h_user_running -=3D user_delta;
+
+			assert_cfs_rq_counts(qcfs_rq);
+		}
 		return false;
 	}
=20
+	/*
+	 * Unlikely as it may be, we may only have user tasks as we hit the
+	 * throttle, in which case we won't have discount them from the
+	 * h_nr_running, and we need to be aware of that.
+	 */
+	was_pending =3D cfs_rq->throttle_pending;
 	cfs_rq->throttle_pending =3D false;
=20
 	raw_spin_lock(&cfs_b->lock);
@@ -5826,9 +5857,27 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
=20
-	task_delta =3D cfs_rq->h_nr_running;
+	/*
+	 * At this point, h_nr_running =3D=3D h_kernel_running. We add back the
+	 * h_user_running to the throttled cfs_rq, and only remove the difference
+	 * to the upper cfs_rq's.
+	 */
+	if (was_pending) {
+		WARN_ON_ONCE(cfs_rq->h_nr_running !=3D cfs_rq->h_kernel_running);
+		cfs_rq->h_nr_running +=3D cfs_rq->h_user_running;
+	} else {
+		WARN_ON_ONCE(cfs_rq->h_nr_running !=3D cfs_rq->h_user_running);
+	}
+
+	/*
+	 * We always discount user tasks from h_nr_running when throttle_pending
+	 * so only h_kernel_running remains to be removed
+	 */
+	task_delta =3D was_pending ? cfs_rq->h_kernel_running : cfs_rq->h_nr_runn=
ing;
 	idle_task_delta =3D cfs_rq->idle_h_nr_running;
 	kernel_delta =3D cfs_rq->h_kernel_running;
+	user_delta   =3D was_pending ? 0 : cfs_rq->h_user_running;
+
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq =3D cfs_rq_of(se);
 		/* throttled entity or throttle-on-deactivate */
@@ -5843,6 +5892,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -=3D task_delta;
 		qcfs_rq->idle_h_nr_running -=3D idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -=3D user_delta;
+
=20
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5866,6 +5917,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running -=3D task_delta;
 		qcfs_rq->idle_h_nr_running -=3D idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running -=3D user_delta;
 	}
=20
 	/* At this point se is NULL and we are at root level*/
@@ -5888,7 +5940,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq =3D rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b =3D tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, kernel_delta;
+	long task_delta, idle_task_delta, kernel_delta, user_delta;
=20
 	se =3D cfs_rq->tg->se[cpu_of(rq)];
=20
@@ -5924,6 +5976,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta =3D cfs_rq->h_nr_running;
 	idle_task_delta =3D cfs_rq->idle_h_nr_running;
 	kernel_delta =3D cfs_rq->h_kernel_running;
+	user_delta =3D cfs_rq->h_user_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq =3D cfs_rq_of(se);
=20
@@ -5937,6 +5990,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running +=3D task_delta;
 		qcfs_rq->idle_h_nr_running +=3D idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running +=3D user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -5955,6 +6011,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_running +=3D task_delta;
 		qcfs_rq->idle_h_nr_running +=3D idle_task_delta;
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
+		qcfs_rq->h_user_running +=3D user_delta;
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6855,6 +6912,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 	int idle_h_nr_running =3D task_has_idle_policy(p);
 	int task_new =3D !(flags & ENQUEUE_WAKEUP);
 	bool kernel_task =3D is_kernel_task(p);
+	bool throttle_pending =3D false;
=20
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6878,13 +6936,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct=
 *p, int flags)
 		cfs_rq =3D cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
=20
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running +=3D idle_h_nr_running;
=20
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running =3D 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |=3D cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running +=3D idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running =3D 1;
+
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6900,13 +6965,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct=
 *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
=20
-		cfs_rq->h_nr_running++;
-		cfs_rq->idle_h_nr_running +=3D idle_h_nr_running;
=20
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running =3D 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running++;
 		if (kernel_task)
 			enqueue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running++;
+
+		throttle_pending |=3D cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running +=3D idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running =3D 1;
+
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6957,6 +7029,7 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
 	int idle_h_nr_running =3D task_has_idle_policy(p);
 	bool was_sched_idle =3D sched_idle_rq(rq);
 	bool kernel_task =3D !list_empty(&p->se.kernel_node);
+	bool throttle_pending =3D false;
=20
 	util_est_dequeue(&rq->cfs, p);
=20
@@ -6964,13 +7037,20 @@ static void dequeue_task_fair(struct rq *rq, struct=
 task_struct *p, int flags)
 		cfs_rq =3D cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
=20
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -=3D idle_h_nr_running;
=20
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running =3D 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |=3D cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -=3D idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running =3D 1;
+
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6998,13 +7078,20 @@ static void dequeue_task_fair(struct rq *rq, struct=
 task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
=20
-		cfs_rq->h_nr_running--;
-		cfs_rq->idle_h_nr_running -=3D idle_h_nr_running;
=20
-		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running =3D 1;
+		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
+			cfs_rq->h_nr_running--;
 		if (kernel_task)
 			dequeue_kernel(cfs_rq, se, 1);
+		else if (!throttle_pending)
+			cfs_rq->h_user_running--;
+
+		throttle_pending |=3D cfs_rq->throttle_pending;
+
+		cfs_rq->idle_h_nr_running -=3D idle_h_nr_running;
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running =3D 1;
+
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8503,28 +8590,65 @@ static void check_preempt_wakeup_fair(struct rq *rq=
, struct task_struct *p, int
 	resched_curr(rq);
 }
=20
+/*
+ * Consider:
+ *   cfs_rq.kernel :=3D count of kernel *tasks* enqueued on this cfs_rq
+ *   cfs_rq.user   :=3D count of user   *tasks* enqueued on this cfs_rq
+ *
+ * Then, the following logic is implemented:
+ *   cfs_rq.h_kernel_running =3D Sum(child.kernel) for all child cfs_rq
+ *   cfs_rq.h_user_running   =3D Sum(child.user)   for all child cfs_rq wi=
th !child.throttle_pending
+ *   cfs_rq.h_nr_running     =3D Sum(child.kernel) for all child cfs_rq
+ *			     + Sum(child.user)   for all child cfs_rq with !child.throttle_pe=
nding
+ *
+ * IOW, count of kernel tasks is always propagated up the hierarchy, and c=
ount
+ * of user tasks is only propagated up if the cfs_rq isn't .throttle_pendi=
ng.
+ */
 static void handle_kernel_task_prev(struct task_struct *prev)
 {
 #ifdef CONFIG_CFS_BANDWIDTH
 	struct sched_entity *se =3D &prev->se;
 	bool p_in_kernel =3D is_kernel_task(prev);
 	bool p_in_kernel_tree =3D !list_empty(&se->kernel_node);
+	bool throttle_pending =3D false;
 	/*
 	 * These extra loops are bad and against the whole point of the merged
 	 * PNT, but it's a pain to merge, particularly since we want it to occur
 	 * before check_cfs_runtime().
 	 */
 	if (p_in_kernel_tree && !p_in_kernel) {
+		/* Switch from KERNEL -> USER */
 		WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
+
 		for_each_sched_entity(se) {
-			dequeue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq =3D cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running--;
+			dequeue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running++;
+
+			throttle_pending |=3D cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	} else if (!p_in_kernel_tree && p_in_kernel && se->on_rq) {
+		/* Switch from USER -> KERNEL */
+
 		for_each_sched_entity(se) {
-			enqueue_kernel(cfs_rq_of(se), se, 1);
-			if (cfs_rq_throttled(cfs_rq_of(se)))
+			struct cfs_rq *cfs_rq =3D cfs_rq_of(se);
+
+			if (throttle_pending || cfs_rq->throttle_pending)
+				cfs_rq->h_nr_running++;
+			enqueue_kernel(cfs_rq, se, 1);
+			if (!throttle_pending)
+				cfs_rq->h_user_running--;
+
+			throttle_pending |=3D cfs_rq->throttle_pending;
+
+			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b33ce2e60555..e8860e0d6fbc7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -660,6 +660,8 @@ struct cfs_rq {
 	int			throttled;
 	int			throttle_count;
 	int			h_kernel_running;
+	int			h_user_running;
+	int                     throttle_pending;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
 	struct list_head	kernel_children;
--=20
2.43.0
From nobody Sat Feb  7 08:44:09 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F5A3182C3
	for <linux-kernel@vger.kernel.org>; Fri,  2 Feb 2024 08:10:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706861458; cv=none;
 b=gNgWZWtvJuCrDK5FspCJYCtc2B9OKoZsJ/l+iyzK+OlwoM158scm34rPouvSWmPZa0DiAkwBXlkVR1+SUUSn+HHioHlagQAZNmvBFoZUGtK6sT4zRCVOma4JE0ARLbatLVqozdpTt3ZB5ZtKLvZxa779T8GEbdDGjHTZPi5nOkQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706861458; c=relaxed/simple;
	bh=GKjGZ0nEYiWo/e0g4SCEgfAGOP3qsyIKHYvTEjh/1rg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=RdB/Db2q69k17I6sSbW4eKgJuj6Nw5IFEJHBlO1Rjg2ScyrffzlkBP5+i+Y41j3hKur82idrr+2YCNx1ymdwkIDTzlM+MTeuHxkRN8havlKBCGhkz5mCut5qrMtCqYHCV0Emlvm9x7jsxIRzj27nsQrbvJBlixWWLv3i9timpNY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=W18EPO9K; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="W18EPO9K"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706861455;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kzFtBHR5zk59g+CzbXsW0xRKrMzCG41VI7+6LhlJ3kY=;
	b=W18EPO9Kjyaqv4DAb3oPrHqA6oxzwkQe2K3jGyhYIMP4/8YhfMdsGMjWvkOFjraKRKJyQd
	ugL34cqJy/ZMYsF1vdRtoDjhYSmhbVhM+ecAOyLLcwP4xpcrd0YUD2cK+d1OV/0oF9t+sp
	F5CV32/Fn89Q8YDj8tqukjNva3w85ks=
Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73])
 by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-135-VKghfcTmM0WgeKid_SqZiw-1; Fri,
 02 Feb 2024 03:10:51 -0500
X-MC-Unique: VKghfcTmM0WgeKid_SqZiw-1
Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com
 [10.11.54.8])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6132A3C0ED5E;
	Fri,  2 Feb 2024 08:10:51 +0000 (UTC)
Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.193.2])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id C2AB6C2590D;
	Fri,  2 Feb 2024 08:10:48 +0000 (UTC)
From: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Phil Auld <pauld@redhat.com>,
	Clark Williams <williams@redhat.com>,
	Tomas Glozar <tglozar@redhat.com>
Subject: [RFC PATCH v2 5/5] sched/fair: Assert user/kernel/total nr invariants
Date: Fri,  2 Feb 2024 09:09:20 +0100
Message-ID: <20240202080920.3337862-6-vschneid@redhat.com>
In-Reply-To: <20240202080920.3337862-1-vschneid@redhat.com>
References: <20240202080920.3337862-1-vschneid@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8
Content-Type: text/plain; charset="utf-8"

Previous commits have added .h_kernel_running and .h_user_running to struct
cfs_rq, and are using them to play games with the hierarchical
.h_nr_running.

Assert some count invariants under SCHED_DEBUG to improve debugging.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b54d3813d18d..52d0ee0e4d47c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5780,6 +5780,30 @@ static int tg_throttle_down(struct task_group *tg, v=
oid *data)
 static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count);
 static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se,=
 int count);
=20
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq)
+{
+	lockdep_assert_rq_held(rq_of(cfs_rq));
+
+	/*
+	 * When !throttle_pending, this is the normal operating mode, all tasks
+	 * are pickable, so:
+	 * nr_kernel_tasks + nr_user_tasks =3D=3D nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(!cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running + cfs_rq->h_user_running !=3D
+		       cfs_rq->h_nr_running));
+	/*
+	 * When throttle_pending, only kernel tasks are pickable, so:
+	 * nr_kernel_tasks =3D=3D nr_pickable_tasks
+	 */
+	SCHED_WARN_ON(cfs_rq->throttle_pending &&
+		      (cfs_rq->h_kernel_running !=3D cfs_rq->h_nr_running));
+}
+#else
+static inline void assert_cfs_rq_counts(struct cfs_rq *cfs_rq) { }
+#endif
+
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq =3D rq_of(cfs_rq);
@@ -5894,6 +5918,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -=3D user_delta;
=20
+		assert_cfs_rq_counts(qcfs_rq);
=20
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5918,6 +5943,8 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->idle_h_nr_running -=3D idle_task_delta;
 		dequeue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running -=3D user_delta;
+
+		assert_cfs_rq_counts(qcfs_rq);
 	}
=20
 	/* At this point se is NULL and we are at root level*/
@@ -6013,6 +6040,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		enqueue_kernel(qcfs_rq, se, kernel_delta);
 		qcfs_rq->h_user_running +=3D user_delta;
=20
+		assert_cfs_rq_counts(qcfs_rq);
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
@@ -6950,6 +6979,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
=20
+		assert_cfs_rq_counts(cfs_rq);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -6965,6 +6995,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
=20
+		assert_cfs_rq_counts(cfs_rq);
=20
 		if (kernel_task || (!throttle_pending && !cfs_rq->throttle_pending))
 			cfs_rq->h_nr_running++;
@@ -6979,6 +7010,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *=
p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
=20
+		assert_cfs_rq_counts(cfs_rq);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7051,6 +7083,7 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
=20
+		assert_cfs_rq_counts(cfs_rq);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -7092,6 +7125,7 @@ static void dequeue_task_fair(struct rq *rq, struct t=
ask_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running =3D 1;
=20
+		assert_cfs_rq_counts(cfs_rq);
=20
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
@@ -8631,6 +8665,8 @@ static void handle_kernel_task_prev(struct task_struc=
t *prev)
=20
 			throttle_pending |=3D cfs_rq->throttle_pending;
=20
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
@@ -8648,6 +8684,8 @@ static void handle_kernel_task_prev(struct task_struc=
t *prev)
=20
 			throttle_pending |=3D cfs_rq->throttle_pending;
=20
+			assert_cfs_rq_counts(cfs_rq);
+
 			if (cfs_rq_throttled(cfs_rq))
 				break;
 		}
--=20
2.43.0