[PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining

Aaron Lu posted 1 patch 1 month, 2 weeks ago
kernel/sched/core.c |  2 +-
kernel/sched/fair.c | 15 ++++++---------
2 files changed, 7 insertions(+), 10 deletions(-)
[PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Aaron Lu 1 month, 2 weeks ago
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!

 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 15 ++++++---------
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67b48e21..f754a60de8484 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
 
 		guard(rq_lock_irq)(rq);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb2795..5b752324270b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 
 	/*
-	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * It's possible we are called with runtime_remaining < 0 due to things
+	 * like async unthrottled us with a positive runtime_remaining but other
+	 * still running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining because
+	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
+	 * which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;
 
 	update_rq_clock(rq);
-- 
2.39.5
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Benjamin Segall 1 month, 1 week ago
Aaron Lu <ziqianlu@bytedance.com> writes:

> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
>
> When running a test with the following hierarchy:
>
>           root
>         /      \
>         A*     ...
>      /  |  \   ...
>         B
>        /  \
>       C*
>
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
>
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
>
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
>
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
>
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
>
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
> v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!
>
>  kernel/sched/core.c |  2 +-
>  kernel/sched/fair.c | 15 ++++++---------
>  2 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..f754a60de8484 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>  
>  		guard(rq_lock_irq)(rq);
>  		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;
>  
>  		if (cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb2795..5b752324270b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
>  
>  	/*
> -	 * It's possible we are called with !runtime_remaining due to things
> -	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> -	 * unthrottled us with a positive runtime_remaining but other still
> -	 * running entities consumed those runtime before we reached here.
> +	 * It's possible we are called with runtime_remaining < 0 due to things
> +	 * like async unthrottled us with a positive runtime_remaining but other
> +	 * still running entities consumed those runtime before we reached here.
>  	 *
> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> -	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
> -	 * throttle, which is not supposed to happen on unthrottle path.
> +	 * We can't unthrottle this cfs_rq without any runtime remaining because
> +	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
> +	 * which is not supposed to happen on unthrottle path.
>  	 */
>  	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>  		return;
>  
> -	se = cfs_rq->tg->se[cpu_of(rq)];
> -
>  	cfs_rq->throttled = 0;
>  
>  	update_rq_clock(rq);

Reviewed-By: Benjamin Segall <bsegall@google.com>
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Aaron Lu 1 month, 1 week ago
On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
... ...
> Reviewed-By: Benjamin Segall <bsegall@google.com>

Thank you Ben.

Hi Peter,

Do you have any other comments about this patch? Or do I need to send an
updated version with all the tags collected? Thanks.
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Peter Zijlstra 1 month, 1 week ago
On Thu, Nov 06, 2025 at 07:25:00PM +0800, Aaron Lu wrote:
> On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
> ... ...
> > Reviewed-By: Benjamin Segall <bsegall@google.com>
> 
> Thank you Ben.
> 
> Hi Peter,
> 
> Do you have any other comments about this patch? Or do I need to send an
> updated version with all the tags collected? Thanks.

I can tag the whole thread (esc-t) and pipe the whole lot through b4 to
easily do so -- in fact I just did :-)

This was meant to go in sched/urgent, right?
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Aaron Lu 1 month, 1 week ago
On Thu, Nov 06, 2025 at 12:27:21PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 06, 2025 at 07:25:00PM +0800, Aaron Lu wrote:
> > On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
> > ... ...
> > > Reviewed-By: Benjamin Segall <bsegall@google.com>
> > 
> > Thank you Ben.
> > 
> > Hi Peter,
> > 
> > Do you have any other comments about this patch? Or do I need to send an
> > updated version with all the tags collected? Thanks.
> 
> I can tag the whole thread (esc-t) and pipe the whole lot through b4 to
> easily do so -- in fact I just did :-)

Cool, thanks for taking care of this.

> 
> This was meant to go in sched/urgent, right?

Right, it's a fix for the current kernel.
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Hao Jia 1 month, 2 weeks ago

On 2025/10/30 11:27, Aaron Lu wrote:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
> 
> When running a test with the following hierarchy:
> 
>            root
>          /      \
>          A*     ...
>       /  |  \   ...
>          B
>         /  \
>        C*
> 
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
> 
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
> 
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
> 
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> 
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
> 
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>


It worked well in my test cases, and the non-empty throttled_limbo_list 
warning no longer appeared.

Tested-by: Hao Jia <jiahao1@lixiang.com>


> ---
> v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!
> 
>   kernel/sched/core.c |  2 +-
>   kernel/sched/fair.c | 15 ++++++---------
>   2 files changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..f754a60de8484 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>   
>   		guard(rq_lock_irq)(rq);
>   		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;
>   
>   		if (cfs_rq->throttled)
>   			unthrottle_cfs_rq(cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb2795..5b752324270b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
>   
>   	/*
> -	 * It's possible we are called with !runtime_remaining due to things
> -	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> -	 * unthrottled us with a positive runtime_remaining but other still
> -	 * running entities consumed those runtime before we reached here.
> +	 * It's possible we are called with runtime_remaining < 0 due to things
> +	 * like async unthrottled us with a positive runtime_remaining but other
> +	 * still running entities consumed those runtime before we reached here.
>   	 *
> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> -	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
> -	 * throttle, which is not supposed to happen on unthrottle path.
> +	 * We can't unthrottle this cfs_rq without any runtime remaining because
> +	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
> +	 * which is not supposed to happen on unthrottle path.
>   	 */
>   	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>   		return;
>   
> -	se = cfs_rq->tg->se[cpu_of(rq)];
> -
>   	cfs_rq->throttled = 0;
>   
>   	update_rq_clock(rq);
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Aaron Lu 1 month, 2 weeks ago
On Thu, Oct 30, 2025 at 02:51:02PM +0800, Hao Jia wrote:
> It worked well in my test cases, and the non-empty throttled_limbo_list
> warning no longer appeared.
> 
> Tested-by: Hao Jia <jiahao1@lixiang.com>

Thanks Hao, it's great to get your confirm that this also works for
your case.
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by K Prateek Nayak 1 month, 2 weeks ago
Hello Aaron,

On 10/30/2025 8:57 AM, Aaron Lu wrote:
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> 
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
> 
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

Have been stress testing this on my system and I haven't seen
any splats yet. Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek
Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by Aaron Lu 1 month, 2 weeks ago
On Thu, Oct 30, 2025 at 11:16:36AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 10/30/2025 8:57 AM, Aaron Lu wrote:
> > I think it's a chaos to trigger throttle on unthrottle path, the status
> > of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> > this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> > cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> > and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> > 
> > Also, update outdated comments in tg_throttle_down() since
> > unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> > While at it, remove a redundant assignment to se in tg_throttle_down().
> > 
> > Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> > Suggested-by: Benjamin Segall <bsegall@google.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> Have been stress testing this on my system and I haven't seen
> any splats yet. Feel free to include:
> 
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thank you Prateek!
[tip: sched/urgent] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Posted by tip-bot2 for Aaron Lu 1 month, 1 week ago
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     956dfda6a70885f18c0f8236a461aa2bc4f556ad
Gitweb:        https://git.kernel.org/tip/956dfda6a70885f18c0f8236a461aa2bc4f556ad
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Thu, 30 Oct 2025 11:27:55 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 06 Nov 2025 12:30:52 +01:00

sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining

When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reviewed-By: Benjamin Segall <bsegall@google.com>
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Hao Jia <jiahao1@lixiang.com>
Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 15 ++++++---------
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67..f754a60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
 
 		guard(rq_lock_irq)(rq);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970db..5b75232 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 
 	/*
-	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * It's possible we are called with runtime_remaining < 0 due to things
+	 * like async unthrottled us with a positive runtime_remaining but other
+	 * still running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining because
+	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
+	 * which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;
 
 	update_rq_clock(rq);