[RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle

Aaron Lu posted 7 patches 10 months ago
There is a newer version of this series
[RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
Posted by Aaron Lu 10 months ago
From: Valentin Schneider <vschneid@redhat.com>

Add related data structures for this new throttle functionality.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 ++++++++++++
 kernel/sched/sched.h  |  2 ++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac19828934..0b55c79fee209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -880,6 +880,10 @@ struct task_struct {
 
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct callback_head		sched_throttle_work;
+	struct list_head		throttle_node;
+#endif
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643fe..3b8735bc527da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4492,6 +4492,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_throttle_work(p);
+#endif
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c19459c80422..894202d232efd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5823,6 +5823,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+	INIT_LIST_HEAD(&p->throttle_node);
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6de..921527327f107 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2703,6 +2703,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
+extern void init_cfs_throttle_work(struct task_struct *p);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8
-- 
2.39.5
Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
Posted by K Prateek Nayak 9 months, 4 weeks ago
Hello Aaron,

On 4/9/2025 5:37 PM, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> Add related data structures for this new throttle functionality.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>   include/linux/sched.h |  4 ++++
>   kernel/sched/core.c   |  3 +++
>   kernel/sched/fair.c   | 12 ++++++++++++
>   kernel/sched/sched.h  |  2 ++
>   4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac19828934..0b55c79fee209 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -880,6 +880,10 @@ struct task_struct {
>   
>   #ifdef CONFIG_CGROUP_SCHED
>   	struct task_group		*sched_task_group;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct callback_head		sched_throttle_work;
> +	struct list_head		throttle_node;

Since throttled tasks are fully dequeued before placing on the
"throttled_limbo_list", is it possible to reuse "p->se.group_node"?

Currently, it is used to track the task on "rq->cfs_tasks" and during
load-balancing when moving a bunch of tasks between CPUs but since a
fully throttled task is not tracked by either, it should be safe to
reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
up on some space in the  task_struct.

Thoughts?

-- 
Thanks and Regards,
Prateek

> +#endif
>   #endif
>
Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
Posted by Aaron Lu 9 months, 4 weeks ago
On Mon, Apr 14, 2025 at 09:28:36AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 4/9/2025 5:37 PM, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> > 
> > Add related data structures for this new throttle functionality.
> > 
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   include/linux/sched.h |  4 ++++
> >   kernel/sched/core.c   |  3 +++
> >   kernel/sched/fair.c   | 12 ++++++++++++
> >   kernel/sched/sched.h  |  2 ++
> >   4 files changed, 21 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index f96ac19828934..0b55c79fee209 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -880,6 +880,10 @@ struct task_struct {
> >   #ifdef CONFIG_CGROUP_SCHED
> >   	struct task_group		*sched_task_group;
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	struct callback_head		sched_throttle_work;
> > +	struct list_head		throttle_node;
> 
> Since throttled tasks are fully dequeued before placing on the
> "throttled_limbo_list", is it possible to reuse "p->se.group_node"?

I think it might be possible.

> Currently, it is used to track the task on "rq->cfs_tasks" and during
> load-balancing when moving a bunch of tasks between CPUs but since a
> fully throttled task is not tracked by either, it should be safe to
> reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
> up on some space in the  task_struct.
> 
> Thoughts?

Is it that adding throttle_node would cause task_struct to just cross a
cacheline boundary? :-)

Or it's mainly a concern that system could have many tasks and any saving
in task_struct is worth to try?

I can see reusing another field would cause task_is_throttled() more
obscure to digest and implement, but I think it is doable.

Thanks,
Aaron
Re: [RFC PATCH v2 1/7] sched/fair: Add related data structure for task based throttle
Posted by K Prateek Nayak 9 months, 4 weeks ago
Hello Aaron,

On 4/14/2025 5:25 PM, Aaron Lu wrote:
> On Mon, Apr 14, 2025 at 09:28:36AM +0530, K Prateek Nayak wrote:
>> Hello Aaron,
>>
>> On 4/9/2025 5:37 PM, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> Add related data structures for this new throttle functionality.
>>>
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>>    include/linux/sched.h |  4 ++++
>>>    kernel/sched/core.c   |  3 +++
>>>    kernel/sched/fair.c   | 12 ++++++++++++
>>>    kernel/sched/sched.h  |  2 ++
>>>    4 files changed, 21 insertions(+)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index f96ac19828934..0b55c79fee209 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -880,6 +880,10 @@ struct task_struct {
>>>    #ifdef CONFIG_CGROUP_SCHED
>>>    	struct task_group		*sched_task_group;
>>> +#ifdef CONFIG_CFS_BANDWIDTH
>>> +	struct callback_head		sched_throttle_work;
>>> +	struct list_head		throttle_node;
>>
>> Since throttled tasks are fully dequeued before placing on the
>> "throttled_limbo_list", is it possible to reuse "p->se.group_node"?
> 
> I think it might be possible.
> 
>> Currently, it is used to track the task on "rq->cfs_tasks" and during
>> load-balancing when moving a bunch of tasks between CPUs but since a
>> fully throttled task is not tracked by either, it should be safe to
>> reuse this bit (CONFIG_DEBUG_LIST will scream if I'm wrong) and save
>> up on some space in the  task_struct.
>>
>> Thoughts?
> 
> Is it that adding throttle_node would cause task_struct to just cross a
> cacheline boundary? :-)
> 
> Or it's mainly a concern that system could have many tasks and any saving
> in task_struct is worth to try?

Mostly this :)

> 
> I can see reusing another field would cause task_is_throttled() more
> obscure to digest and implement, but I think it is doable.

I completely overlooked task_is_throttled() use-case. I think the
current implementation is much cleaner in that aspect; no need to
overload "p->se.group_node" and over-complicate this.

If we really want some space saving , declaring a "unsigned char
sched_throttled" in the hole next to "sched_delayed" would be cleaner
but I'd wait on Valentin and Peter's comments before going down that
path.

> 
> Thanks,
> Aaron

-- 
Thanks and Regards,
Prateek