Taskes throttled on exit to user path are scheduled by cond_resched() in
task_work_run() but that is a preempt schedule and doesn't mark a task
rcu quiescent state.
Fix this by directly calling schedule() in throttle_cfs_rq_work().
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f26d53ac143fe..be96f7d32998c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct
callback_head *work)
struct cfs_rq *cfs_rq;
struct rq *rq;
struct rq_flags rf;
+ bool sched = false;
WARN_ON_ONCE(p != current);
p->sched_throttle_work.next = &p->sched_throttle_work;
@@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct
callback_head *work)
dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
resched_curr(rq);
+ sched = true;
out_unlock:
task_rq_unlock(rq, p, &rf);
+
+ if (sched)
+ schedule();
}
void init_cfs_throttle_work(struct task_struct *p)
--
2.39.5
Hello Aaron, On 3/13/2025 12:52 PM, Aaron Lu wrote: > Taskes throttled on exit to user path are scheduled by cond_resched() in > task_work_run() but that is a preempt schedule and doesn't mark a task > rcu quiescent state. > > Fix this by directly calling schedule() in throttle_cfs_rq_work(). Perhaps that can be gotten around by just using set_ti_thread_flag() resched_curr() will also call set_preempt_need_resched() which allows cond_resched() to resched the task. Since exit_to_user_mode_loop() will run once again seeing that TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts? > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> > --- > kernel/sched/fair.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f26d53ac143fe..be96f7d32998c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct > callback_head *work) > struct cfs_rq *cfs_rq; > struct rq *rq; > struct rq_flags rf; > + bool sched = false; > > WARN_ON_ONCE(p != current); > p->sched_throttle_work.next = &p->sched_throttle_work; > @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct > callback_head *work) > dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); > list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); > resched_curr(rq); > + sched = true; > > out_unlock: > task_rq_unlock(rq, p, &rf); > + > + if (sched) > + schedule(); > } > > void init_cfs_throttle_work(struct task_struct *p) -- Thanks and Regards, Prateek
Hi Prateek, On Fri, Mar 14, 2025 at 09:44:45AM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 3/13/2025 12:52 PM, Aaron Lu wrote: > > Taskes throttled on exit to user path are scheduled by cond_resched() in > > task_work_run() but that is a preempt schedule and doesn't mark a task > > rcu quiescent state. > > > > Fix this by directly calling schedule() in throttle_cfs_rq_work(). > > Perhaps that can be gotten around by just using set_ti_thread_flag() > resched_curr() will also call set_preempt_need_resched() which allows > cond_resched() to resched the task. > > Since exit_to_user_mode_loop() will run once again seeing that > TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts? > I tried this and noticed an unexpected consequence: get_signal() will also run task works and if the signal is kill, then it can happen: exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() -> do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() -> try_to_block_task() -> dequeue_task_fair(). I would like to avoid this path, at least for now. I want to make sure for throttled tasks, only events like task group change, affinity change etc. can cause it dequeue from core, that's why I added SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think this can help me capture any unexpected events like this. Besides, I think explicitely calling schedule() has the advantage of not relying on other code, i.e. it doesn't matter if someone removed cond_resched() in task_work_run() some time later or someone changed the logic in exit_to_user_mode_loop(). So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(), but if you see anything wrong of doing this, feel free to let me know, thanks. Best regards, Aaron > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> > > --- > > kernel/sched/fair.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f26d53ac143fe..be96f7d32998c 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct > > callback_head *work) > > struct cfs_rq *cfs_rq; > > struct rq *rq; > > struct rq_flags rf; > > + bool sched = false; > > > > WARN_ON_ONCE(p != current); > > p->sched_throttle_work.next = &p->sched_throttle_work; > > @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct > > callback_head *work) > > dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); > > list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); > > resched_curr(rq); > > + sched = true; > > > > out_unlock: > > task_rq_unlock(rq, p, &rf); > > + > > + if (sched) > > + schedule(); > > } > > > > void init_cfs_throttle_work(struct task_struct *p) > > -- > Thanks and Regards, > Prateek >
Hello Aaron, On 3/31/2025 11:49 AM, Aaron Lu wrote: >>> Taskes throttled on exit to user path are scheduled by cond_resched() in >>> task_work_run() but that is a preempt schedule and doesn't mark a task >>> rcu quiescent state. So browsing through kernel/rcu/task.h, I found the cond_resched_tasks_rcu_qs() that basically clears task holdout before calling schedule(). The question is, is it safe to be used in the task_work_run() context? I have no idea but you can do a resched_curr() followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and that should give you the same effect as doing a schedule(). Thoughts? >>> >>> Fix this by directly calling schedule() in throttle_cfs_rq_work(). >> Perhaps that can be gotten around by just using set_ti_thread_flag() >> resched_curr() will also call set_preempt_need_resched() which allows >> cond_resched() to resched the task. >> >> Since exit_to_user_mode_loop() will run once again seeing that >> TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts? >> > I tried this and noticed an unexpected consequence: get_signal() will > also run task works and if the signal is kill, then it can happen: > exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() -> > do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() -> > try_to_block_task() -> dequeue_task_fair(). > > I would like to avoid this path, at least for now. I want to make sure > for throttled tasks, only events like task group change, affinity change > etc. can cause it dequeue from core, that's why I added > SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think > this can help me capture any unexpected events like this. > > Besides, I think explicitely calling schedule() has the advantage of not > relying on other code, i.e. it doesn't matter if someone removed > cond_resched() in task_work_run() some time later or someone changed the > logic in exit_to_user_mode_loop(). > > So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(), > but if you see anything wrong of doing this, feel free to let me know, > thanks. I don't have any strong feelings. Just that the open-coded schedule() struck out like a sore thumb and since you mention future changes, the "local_irq_enable_exit_to_user(ti_work)" could perhaps one day be extended to not disable IRQs in exit_to_user_mode_loop() in which case a direct call to schedule() can cause scheduling while atomic warnings. IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that the throttle work wants to call schedule() asap while also clearing the RCU holdout but I'll wait for others to comment if it is safe to do so or if we are missing something. -- Thanks and Regards, Prateek
On Tue, Apr 01, 2025 at 08:47:02AM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 3/31/2025 11:49 AM, Aaron Lu wrote: > > > > Taskes throttled on exit to user path are scheduled by cond_resched() in > > > > task_work_run() but that is a preempt schedule and doesn't mark a task > > > > rcu quiescent state. > > So browsing through kernel/rcu/task.h, I found the > cond_resched_tasks_rcu_qs() that basically clears task holdout before > calling schedule(). The question is, is it safe to be used in the > task_work_run() context? I have no idea but you can do a resched_curr() > followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and > that should give you the same effect as doing a schedule(). > > Thoughts? > Looks good to me. > > > > > > > > Fix this by directly calling schedule() in throttle_cfs_rq_work(). > > > Perhaps that can be gotten around by just using set_ti_thread_flag() > > > resched_curr() will also call set_preempt_need_resched() which allows > > > cond_resched() to resched the task. > > > > > > Since exit_to_user_mode_loop() will run once again seeing that > > > TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts? > > > > > I tried this and noticed an unexpected consequence: get_signal() will > > also run task works and if the signal is kill, then it can happen: > > exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() -> > > do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() -> > > try_to_block_task() -> dequeue_task_fair(). > > > > I would like to avoid this path, at least for now. I want to make sure > > for throttled tasks, only events like task group change, affinity change > > etc. can cause it dequeue from core, that's why I added > > SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think > > this can help me capture any unexpected events like this. > > > > Besides, I think explicitely calling schedule() has the advantage of not > > relying on other code, i.e. it doesn't matter if someone removed > > cond_resched() in task_work_run() some time later or someone changed the > > logic in exit_to_user_mode_loop(). > > > > So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(), > > but if you see anything wrong of doing this, feel free to let me know, > > thanks. > > I don't have any strong feelings. Just that the open-coded schedule() > struck out like a sore thumb and since you mention future changes, the > "local_irq_enable_exit_to_user(ti_work)" could perhaps one day be > extended to not disable IRQs in exit_to_user_mode_loop() in which case > a direct call to schedule() can cause scheduling while atomic warnings. > > IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that > the throttle work wants to call schedule() asap while also clearing the > RCU holdout but I'll wait for others to comment if it is safe to do so > or if we are missing something. I'm running some tests and I'll follow your suggestion if no problem found, thanks for the suggestion.
On Fri, Mar 14, 2025 at 09:44:45AM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 3/13/2025 12:52 PM, Aaron Lu wrote: > > Taskes throttled on exit to user path are scheduled by cond_resched() in > > task_work_run() but that is a preempt schedule and doesn't mark a task > > rcu quiescent state. > > > > Fix this by directly calling schedule() in throttle_cfs_rq_work(). > > Perhaps that can be gotten around by just using set_ti_thread_flag() > resched_curr() will also call set_preempt_need_resched() which allows > cond_resched() to resched the task. > > Since exit_to_user_mode_loop() will run once again seeing that > TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts? Might also work, will give it a shot, thanks for the suggestion. Best regards, Aaron > > > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> > > --- > > kernel/sched/fair.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f26d53ac143fe..be96f7d32998c 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5847,6 +5847,7 @@ static void throttle_cfs_rq_work(struct > > callback_head *work) > > struct cfs_rq *cfs_rq; > > struct rq *rq; > > struct rq_flags rf; > > + bool sched = false; > > > > WARN_ON_ONCE(p != current); > > p->sched_throttle_work.next = &p->sched_throttle_work; > > @@ -5879,9 +5880,13 @@ static void throttle_cfs_rq_work(struct > > callback_head *work) > > dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL); > > list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); > > resched_curr(rq); > > + sched = true; > > > > out_unlock: > > task_rq_unlock(rq, p, &rf); > > + > > + if (sched) > > + schedule(); > > } > > > > void init_cfs_throttle_work(struct task_struct *p) > > -- > Thanks and Regards, > Prateek >
© 2016 - 2025 Red Hat, Inc.