[v1] Defer throttle when task exits to user

[RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by Aaron Lu 11 months ago

Taskes throttled on exit to user path are scheduled by cond_resched() in
task_work_run() but that is a preempt schedule and doesn't mark a task
rcu quiescent state.

Fix this by directly calling schedule() in throttle_cfs_rq_work().

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f26d53ac143fe..be96f7d32998c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct
callback_head *work)
 	struct cfs_rq *cfs_rq;
 	struct rq *rq;
 	struct rq_flags rf;
+	bool sched = false;

 	WARN_ON_ONCE(p != current);
 	p->sched_throttle_work.next = &p->sched_throttle_work;
@@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct
callback_head *work)
 	dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
 	list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 	resched_curr(rq);
+	sched = true;

 out_unlock:
 	task_rq_unlock(rq, p, &rf);
+
+	if (sched)
+		schedule();
 }

 void init_cfs_throttle_work(struct task_struct *p)
-- 
2.39.5

Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by K Prateek Nayak 11 months ago

Hello Aaron,

On 3/13/2025 12:52 PM, Aaron Lu wrote:
> Taskes throttled on exit to user path are scheduled by cond_resched() in
> task_work_run() but that is a preempt schedule and doesn't mark a task
> rcu quiescent state.
> 
> Fix this by directly calling schedule() in throttle_cfs_rq_work().

Perhaps that can be gotten around by just using set_ti_thread_flag()
resched_curr() will also call set_preempt_need_resched() which allows
cond_resched() to resched the task.

Since exit_to_user_mode_loop() will run once again seeing that
TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?

> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>   kernel/sched/fair.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f26d53ac143fe..be96f7d32998c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct
> callback_head *work)
>   	struct cfs_rq *cfs_rq;
>   	struct rq *rq;
>   	struct rq_flags rf;
> +	bool sched = false;
> 
>   	WARN_ON_ONCE(p != current);
>   	p->sched_throttle_work.next = &p->sched_throttle_work;
> @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct
> callback_head *work)
>   	dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
>   	list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>   	resched_curr(rq);
> +	sched = true;
> 
>   out_unlock:
>   	task_rq_unlock(rq, p, &rf);
> +
> +	if (sched)
> +		schedule();
>   }
> 
>   void init_cfs_throttle_work(struct task_struct *p)

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by Aaron Lu 10 months, 2 weeks ago

Hi Prateek,

On Fri, Mar 14, 2025 at 09:44:45AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 3/13/2025 12:52 PM, Aaron Lu wrote:
> > Taskes throttled on exit to user path are scheduled by cond_resched() in
> > task_work_run() but that is a preempt schedule and doesn't mark a task
> > rcu quiescent state.
> > 
> > Fix this by directly calling schedule() in throttle_cfs_rq_work().
> 
> Perhaps that can be gotten around by just using set_ti_thread_flag()
> resched_curr() will also call set_preempt_need_resched() which allows
> cond_resched() to resched the task.
> 
> Since exit_to_user_mode_loop() will run once again seeing that
> TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?
> 

I tried this and noticed an unexpected consequence: get_signal() will
also run task works and if the signal is kill, then it can happen:
exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() -> 
do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() -> 
try_to_block_task() -> dequeue_task_fair().

I would like to avoid this path, at least for now. I want to make sure
for throttled tasks, only events like task group change, affinity change
etc. can cause it dequeue from core, that's why I added
SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
this can help me capture any unexpected events like this.

Besides, I think explicitely calling schedule() has the advantage of not
relying on other code, i.e. it doesn't matter if someone removed
cond_resched() in task_work_run() some time later or someone changed the
logic in exit_to_user_mode_loop().

So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
but if you see anything wrong of doing this, feel free to let me know,
thanks.

Best regards,
Aaron

> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   kernel/sched/fair.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f26d53ac143fe..be96f7d32998c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct
> > callback_head *work)
> >   	struct cfs_rq *cfs_rq;
> >   	struct rq *rq;
> >   	struct rq_flags rf;
> > +	bool sched = false;
> > 
> >   	WARN_ON_ONCE(p != current);
> >   	p->sched_throttle_work.next = &p->sched_throttle_work;
> > @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct
> > callback_head *work)
> >   	dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> >   	list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >   	resched_curr(rq);
> > +	sched = true;
> > 
> >   out_unlock:
> >   	task_rq_unlock(rq, p, &rf);
> > +
> > +	if (sched)
> > +		schedule();
> >   }
> > 
> >   void init_cfs_throttle_work(struct task_struct *p)
> 
> -- 
> Thanks and Regards,
> Prateek
>

Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by K Prateek Nayak 10 months, 2 weeks ago

Hello Aaron,

On 3/31/2025 11:49 AM, Aaron Lu wrote:
>>> Taskes throttled on exit to user path are scheduled by cond_resched() in
>>> task_work_run() but that is a preempt schedule and doesn't mark a task
>>> rcu quiescent state.

So browsing through kernel/rcu/task.h, I found the
cond_resched_tasks_rcu_qs() that basically clears task holdout before
calling schedule(). The question is, is it safe to be used in the
task_work_run() context? I have no idea but you can do a resched_curr()
followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and
that should give you the same effect as doing a schedule().

Thoughts?

>>>
>>> Fix this by directly calling schedule() in throttle_cfs_rq_work().
>> Perhaps that can be gotten around by just using set_ti_thread_flag()
>> resched_curr() will also call set_preempt_need_resched() which allows
>> cond_resched() to resched the task.
>>
>> Since exit_to_user_mode_loop() will run once again seeing that
>> TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?
>>
> I tried this and noticed an unexpected consequence: get_signal() will
> also run task works and if the signal is kill, then it can happen:
> exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
> do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
> try_to_block_task() -> dequeue_task_fair().
> 
> I would like to avoid this path, at least for now. I want to make sure
> for throttled tasks, only events like task group change, affinity change
> etc. can cause it dequeue from core, that's why I added
> SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
> this can help me capture any unexpected events like this.
> 
> Besides, I think explicitely calling schedule() has the advantage of not
> relying on other code, i.e. it doesn't matter if someone removed
> cond_resched() in task_work_run() some time later or someone changed the
> logic in exit_to_user_mode_loop().
> 
> So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
> but if you see anything wrong of doing this, feel free to let me know,
> thanks.

I don't have any strong feelings. Just that the open-coded schedule()
struck out like a sore thumb and since you mention future changes, the
"local_irq_enable_exit_to_user(ti_work)" could perhaps one day be
extended to not disable IRQs in exit_to_user_mode_loop() in which case
a direct call to schedule() can cause scheduling while atomic warnings.

IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that
the throttle work wants to call schedule() asap while also clearing the
RCU holdout but I'll wait for others to comment if it is safe to do so
or if we are missing something.

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by Aaron Lu 10 months, 2 weeks ago

On Tue, Apr 01, 2025 at 08:47:02AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 3/31/2025 11:49 AM, Aaron Lu wrote:
> > > > Taskes throttled on exit to user path are scheduled by cond_resched() in
> > > > task_work_run() but that is a preempt schedule and doesn't mark a task
> > > > rcu quiescent state.
> 
> So browsing through kernel/rcu/task.h, I found the
> cond_resched_tasks_rcu_qs() that basically clears task holdout before
> calling schedule(). The question is, is it safe to be used in the
> task_work_run() context? I have no idea but you can do a resched_curr()
> followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and
> that should give you the same effect as doing a schedule().
> 
> Thoughts?
>

Looks good to me.

> > > > 
> > > > Fix this by directly calling schedule() in throttle_cfs_rq_work().
> > > Perhaps that can be gotten around by just using set_ti_thread_flag()
> > > resched_curr() will also call set_preempt_need_resched() which allows
> > > cond_resched() to resched the task.
> > > 
> > > Since exit_to_user_mode_loop() will run once again seeing that
> > > TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?
> > > 
> > I tried this and noticed an unexpected consequence: get_signal() will
> > also run task works and if the signal is kill, then it can happen:
> > exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
> > do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
> > try_to_block_task() -> dequeue_task_fair().
> > 
> > I would like to avoid this path, at least for now. I want to make sure
> > for throttled tasks, only events like task group change, affinity change
> > etc. can cause it dequeue from core, that's why I added
> > SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
> > this can help me capture any unexpected events like this.
> > 
> > Besides, I think explicitely calling schedule() has the advantage of not
> > relying on other code, i.e. it doesn't matter if someone removed
> > cond_resched() in task_work_run() some time later or someone changed the
> > logic in exit_to_user_mode_loop().
> > 
> > So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
> > but if you see anything wrong of doing this, feel free to let me know,
> > thanks.
> 
> I don't have any strong feelings. Just that the open-coded schedule()
> struck out like a sore thumb and since you mention future changes, the
> "local_irq_enable_exit_to_user(ti_work)" could perhaps one day be
> extended to not disable IRQs in exit_to_user_mode_loop() in which case
> a direct call to schedule() can cause scheduling while atomic warnings.
> 
> IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that
> the throttle work wants to call schedule() asap while also clearing the
> RCU holdout but I'll wait for others to comment if it is safe to do so
> or if we are missing something.

I'm running some tests and I'll follow your suggestion if no problem
found, thanks for the suggestion.

Re: [External] Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

Posted by Aaron Lu 11 months ago

On Fri, Mar 14, 2025 at 09:44:45AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 3/13/2025 12:52 PM, Aaron Lu wrote:
> > Taskes throttled on exit to user path are scheduled by cond_resched() in
> > task_work_run() but that is a preempt schedule and doesn't mark a task
> > rcu quiescent state.
> > 
> > Fix this by directly calling schedule() in throttle_cfs_rq_work().
> 
> Perhaps that can be gotten around by just using set_ti_thread_flag()
> resched_curr() will also call set_preempt_need_resched() which allows
> cond_resched() to resched the task.
> 
> Since exit_to_user_mode_loop() will run once again seeing that
> TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?

Might also work, will give it a shot, thanks for the suggestion.

Best regards,
Aaron

> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   kernel/sched/fair.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f26d53ac143fe..be96f7d32998c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct
> > callback_head *work)
> >   	struct cfs_rq *cfs_rq;
> >   	struct rq *rq;
> >   	struct rq_flags rf;
> > +	bool sched = false;
> > 
> >   	WARN_ON_ONCE(p != current);
> >   	p->sched_throttle_work.next = &p->sched_throttle_work;
> > @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct
> > callback_head *work)
> >   	dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> >   	list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >   	resched_curr(rq);
> > +	sched = true;
> > 
> >   out_unlock:
> >   	task_rq_unlock(rq, p, &rf);
> > +
> > +	if (sched)
> > +		schedule();
> >   }
> > 
> >   void init_cfs_throttle_work(struct task_struct *p)
> 
> -- 
> Thanks and Regards,
> Prateek
>