MAINTAINERS | 7 + kernel/sched/debug.c | 50 ++++ kernel/sched/rt.c | 277 +++++++++++++++--- kernel/sched/sched.h | 1 + tools/testing/selftests/sched/Makefile | 4 +- tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++ .../selftests/sched/rt_group_sched_test.sh | 119 ++++++++ 7 files changed, 606 insertions(+), 44 deletions(-) create mode 100644 tools/testing/selftests/sched/deadloop.c create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh
Hi all, The first patch optimizes the enqueue and dequeue of rt_se, the strategy employs a bottom-up removal approach. The second patch provides validation for the efficiency improvements made by patch 1. The test case count the number of infinite loop executions for all threads. origion optimized 10242794134 10659512784 13650210798 13555924695 12953159254 13733609646 11888973428 11742656925 12791797633 13447598015 11451270205 11704847480 13335320346 13858155642 10682907328 10513565749 10173249704 10254224697 8309259793 8893668653 avg 11547894262 11836376429 Run two QEMU emulators simultaneously, one running the original kernel and the other running the optimized kernel, and compare the average of the results over 10 runs. After optimizing, the number of iterations in the infinite loop increased by approximately 2.5%. Kindly review. Xavier (2): RT SCHED: Optimize the enqueue and dequeue operations for rt_se RT test: Adding test cases for RT group scheduling MAINTAINERS | 7 + kernel/sched/debug.c | 50 ++++ kernel/sched/rt.c | 277 +++++++++++++++--- kernel/sched/sched.h | 1 + tools/testing/selftests/sched/Makefile | 4 +- tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++ .../selftests/sched/rt_group_sched_test.sh | 119 ++++++++ 7 files changed, 606 insertions(+), 44 deletions(-) create mode 100644 tools/testing/selftests/sched/deadloop.c create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh -- 2.45.2
On Fri, Jun 28, 2024 at 01:21:54AM GMT, Xavier <xavier_qy@163.com> wrote: > The first patch optimizes the enqueue and dequeue of rt_se, the strategy > employs a bottom-up removal approach. I haven't read the patches, I only have a remark to the numbers. > The second patch provides validation for the efficiency improvements made > by patch 1. The test case count the number of infinite loop executions for > all threads. > > origion optimized > > 10242794134 10659512784 > 13650210798 13555924695 > 12953159254 13733609646 > 11888973428 11742656925 > 12791797633 13447598015 > 11451270205 11704847480 > 13335320346 13858155642 > 10682907328 10513565749 > 10173249704 10254224697 > 8309259793 8893668653 ^^^ This is fine, that's what you measured. > avg 11547894262 11836376429 But providing averages with that many significant digit is nonsensical (most of them are noise). If I put your columns into D (Octave) and estimate some errors: (std(D)/sqrt(10)) ./ mean(D) ans = 0.046626 0.046755 the error itself would be rounded to ~5%, so the averages measured should be rounded accordingly avg 11500000000 11800000000 or even more conservatively avg 12000000000 12000000000 > Run two QEMU emulators simultaneously, one running the original kernel and the > other running the optimized kernel, and compare the average of the results over > 10 runs. After optimizing, the number of iterations in the infinite loop increased > by approximately 2.5%. Notice that the measure changed is on par with noise in the data (i.e. it may be accidental). You may need more iterations to get cleaner result (more convincing data). HTH, Michal
Hi Michal, Your question is good. however, I currently don't have a stable hardware environment to execute this test case. Running it on QEMU indeed subjects it to significant random interference. I attempted to make the test cases run for longer periods, but I found that the results varied significantly each time. So the previous test data was obtained by running two QEMU instances simultaneously, one running the unoptimized kernel and the other running the optimized kernel, this makes the results more convincing. Nevertheless, from the code logic, it is evident that the optimizations have indeed resulted in fewer se insert and delete operations, which theoretically should improve efficiency. Thanks. -- Best Regards, Xavier At 2024-07-29 17:32:37, "Michal Koutný" <mkoutny@suse.com> wrote: >On Fri, Jun 28, 2024 at 01:21:54AM GMT, Xavier <xavier_qy@163.com> wrote: >> The first patch optimizes the enqueue and dequeue of rt_se, the strategy >> employs a bottom-up removal approach. > >I haven't read the patches, I only have a remark to the numbers. > >> The second patch provides validation for the efficiency improvements made >> by patch 1. The test case count the number of infinite loop executions for >> all threads. >> >> origion optimized >> >> 10242794134 10659512784 >> 13650210798 13555924695 >> 12953159254 13733609646 >> 11888973428 11742656925 >> 12791797633 13447598015 >> 11451270205 11704847480 >> 13335320346 13858155642 >> 10682907328 10513565749 >> 10173249704 10254224697 >> 8309259793 8893668653 > >^^^ This is fine, that's what you measured. > >> avg 11547894262 11836376429 > >But providing averages with that many significant digit is nonsensical >(most of them are noise). > >If I put your columns into D (Octave) and estimate some errors: > >(std(D)/sqrt(10)) ./ mean(D) >ans = > > 0.046626 0.046755 > >the error itself would be rounded to ~5%, so the averages measured >should be rounded accordingly > > avg 11500000000 11800000000 > >or even more conservatively > > avg 12000000000 12000000000 > >> Run two QEMU emulators simultaneously, one running the original kernel and the >> other running the optimized kernel, and compare the average of the results over >> 10 runs. After optimizing, the number of iterations in the infinite loop increased >> by approximately 2.5%. > >Notice that the measure changed is on par with noise in the data (i.e. >it may be accidental). You may need more iterations to get cleaner >result (more convincing data). > >HTH, >Michal
Hi all, Patch v2 fix the issues arising from disabling the CONFIG_RT_GROUP_SCHED macro during compilation. >The first patch optimizes the enqueue and dequeue of rt_se, the strategy >employs a bottom-up removal approach. >The second patch provides validation for the efficiency improvements made >by patch 1. The test case count the number of infinite loop executions for >all threads. > > origion optimized > > 10242794134 10659512784 > 13650210798 13555924695 > 12953159254 13733609646 > 11888973428 11742656925 > 12791797633 13447598015 > 11451270205 11704847480 > 13335320346 13858155642 > 10682907328 10513565749 > 10173249704 10254224697 > 8309259793 8893668653 > >avg 11547894262 11836376429 > >Run two QEMU emulators simultaneously, one running the original kernel and the >other running the optimized kernel, and compare the average of the results over >10 runs. After optimizing, the number of iterations in the infinite loop increased >by approximately 2.5%. Kindly review. Xavier (2): RT SCHED: Optimize the enqueue and dequeue operations for rt_se RT test: Adding test cases for RT group scheduling MAINTAINERS | 7 + kernel/sched/debug.c | 50 ++++ kernel/sched/rt.c | 278 +++++++++++++++--- kernel/sched/sched.h | 1 + tools/testing/selftests/sched/Makefile | 4 +- tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++ .../selftests/sched/rt_group_sched_test.sh | 119 ++++++++ 7 files changed, 609 insertions(+), 42 deletions(-) create mode 100644 tools/testing/selftests/sched/deadloop.c create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh -- 2.45.2
Hi all, Patch 3 fixed the issue with handling tasks with prio set to 0 during the execution of blktests. Kindly review. Best Regards, Xavier Xavier (2): RT SCHED: Optimize the enqueue and dequeue operations for rt_se RT test: Adding test cases for RT group scheduling MAINTAINERS | 7 + kernel/sched/debug.c | 48 +++ kernel/sched/rt.c | 287 +++++++++++++++--- kernel/sched/sched.h | 1 + tools/testing/selftests/sched/Makefile | 4 +- tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++ .../selftests/sched/rt_group_sched_test.sh | 119 ++++++++ 7 files changed, 618 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/sched/deadloop.c create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh -- 2.45.2
This patch optimizes the enqueue and dequeue of rt_se, the strategy employs
a bottom-up removal approach. Specifically, when removing an rt_se at a
certain level, if it is determined that the highest priority of the rq
associated with that rt_se has not changed, there is no need to continue
removing rt_se at higher levels. At this point, only the total number
of removed rt_se needs to be recorded, and the rt_nr_running count of
higher-level rq should be removed accordingly.
Signed-off-by: Xavier <xavier_qy@163.com>
---
kernel/sched/debug.c | 48 ++++++++
kernel/sched/rt.c | 287 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 1 +
3 files changed, 298 insertions(+), 38 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..bf9edba5e87b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -712,6 +712,54 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#endif
}
+void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
+{
+ struct task_struct *task;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->my_q) {
+ SEQ_printf_task_group_path(m, rt_se->my_q->tg, "%s\n");
+ return;
+ }
+#endif
+ task = container_of(rt_se, struct task_struct, rt);
+ SEQ_printf(m, " prio-%d, pid-%d, %s\n", task->prio, task->pid, task->comm);
+}
+
+/*shall be called in rq lock*/
+void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq)
+{
+ struct rt_prio_array *array = &rt_rq->active;
+ struct sched_rt_entity *rt_se;
+ struct list_head *queue, *head;
+ unsigned long bitmap[2];
+ int idx;
+ int count = 0;
+
+ if (!rt_rq->rt_nr_running)
+ return;
+
+ memcpy(bitmap, array->bitmap, sizeof(unsigned long) * 2);
+ idx = sched_find_first_bit(bitmap);
+ WARN_ON_ONCE(idx >= MAX_RT_PRIO);
+
+ while (1) {
+ clear_bit(idx, bitmap);
+ queue = array->queue + idx;
+ head = queue;
+ queue = queue->next;
+ do {
+ rt_se = list_entry(queue, struct sched_rt_entity, run_list);
+ print_rt_se(m, rt_se);
+ queue = queue->next;
+ count++;
+ } while (queue != head);
+ idx = sched_find_first_bit(bitmap);
+ if (idx >= MAX_RT_PRIO)
+ break;
+ }
+}
+
void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index aa4c1c874fa4..b18c424a50d2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1113,7 +1113,7 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
#endif /* CONFIG_SMP */
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
-static void
+static int
inc_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1122,9 +1122,11 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
rt_rq->highest_prio.curr = prio;
inc_rt_prio_smp(rt_rq, prio, prev_prio);
+
+ return prev_prio > prio;
}
-static void
+static int
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1149,12 +1151,22 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
}
dec_rt_prio_smp(rt_rq, prio, prev_prio);
+ if (rt_rq->highest_prio.curr > prio)
+ return prio;
+ else
+ return MAX_RT_PRIO;
}
#else
-static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {}
-static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
+static inline int inc_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
+static inline int dec_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
#endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
@@ -1218,28 +1230,31 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
}
static inline
-void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
int prio = rt_se_prio(rt_se);
+ int prio_change;
WARN_ON(!rt_prio(prio));
rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
- inc_rt_prio(rt_rq, prio);
+ prio_change = inc_rt_prio(rt_rq, prio);
inc_rt_group(rt_se, rt_rq);
+ return prio_change;
}
static inline
-void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq, int prio)
{
+ int prio_changed;
WARN_ON(!rt_prio(rt_se_prio(rt_se)));
- WARN_ON(!rt_rq->rt_nr_running);
rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+ prio_changed = dec_rt_prio(rt_rq, prio);
dec_rt_group(rt_se, rt_rq);
+ return prio_changed;
}
/*
@@ -1255,12 +1270,13 @@ static inline bool move_entity(unsigned int flags)
return true;
}
-static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
+static void __delist_rt_entity(struct sched_rt_entity *rt_se,
+ struct rt_prio_array *array, int last_prio)
{
list_del_init(&rt_se->run_list);
- if (list_empty(array->queue + rt_se_prio(rt_se)))
- __clear_bit(rt_se_prio(rt_se), array->bitmap);
+ if (list_empty(array->queue + last_prio))
+ __clear_bit(last_prio, array->bitmap);
rt_se->on_list = 0;
}
@@ -1371,7 +1387,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
}
}
-static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/*
+ * Returns: -1 indicates that rt_se was not enqueued, 0 indicates that the highest
+ * priority of the rq did not change after enqueue, and 1 indicates that the highest
+ * priority of the rq changed after enqueue.
+ */
+static int __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -1386,8 +1407,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
*/
if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
if (rt_se->on_list)
- __delist_rt_entity(rt_se, array);
- return;
+ __delist_rt_entity(rt_se, array, rt_se_prio(rt_se));
+ return -1;
}
if (move_entity(flags)) {
@@ -1402,73 +1423,263 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
}
rt_se->on_rq = 1;
- inc_rt_tasks(rt_se, rt_rq);
+ return inc_rt_tasks(rt_se, rt_rq);
}
-static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/**
+ * delete rt_se from rt_rq
+ *
+ * @rt_se Nodes to be deleted
+ * @last_prio The highest priority of this rt_se before the previous round
+ * of deletion
+ * @flags operation flags
+ *
+ * Returns: =0 indicates that the highest priority of the current rq did not
+ * change during this deletion. >0 indicates it changed, and it returns the
+ * previous highest priority to use in the next round of deletion.
+ */
+static int __dequeue_rt_entity(struct sched_rt_entity *rt_se, int last_prio,
+ unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
if (move_entity(flags)) {
WARN_ON_ONCE(!rt_se->on_list);
- __delist_rt_entity(rt_se, array);
+ __delist_rt_entity(rt_se, array, last_prio);
}
rt_se->on_rq = 0;
- dec_rt_tasks(rt_se, rt_rq);
+ return dec_rt_tasks(rt_se, rt_rq, last_prio);
+}
+
+static inline void dec_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running -= rt;
+ rt_rq->rr_nr_running -= rr;
+}
+
+static inline void add_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running += rt;
+ rt_rq->rr_nr_running += rr;
+}
+
+static inline bool on_top_rt_rq(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->parent)
+ return false;
+#endif
+ return true;
}
/*
- * Because the prio of an upper entry depends on the lower
- * entries, we must remove entries top - down.
+ * To optimize the enqueue and dequeue of rt_se, this strategy employs a
+ * bottom-up removal approach. Specifically, when removing an rt_se at a
+ * certain level, if it is determined that the highest priority of the rq
+ * associated with that rt_se has not changed, there is no need to continue
+ * removing rt_se at higher levels. At this point, only the total number
+ * of removed rt_se needs to be recorded, and the rt_nr_running count of
+ * higher-level rq should be removed accordingly.
+ *
+ * For enqueue operations, if an rt_se at a certain level is in the rq,
+ * it is still necessary to check the priority of the higher-level rq.
+ * If the priority of the higher-level rq is found to be lower than that
+ * of the rt_se to be added, it should be removed, as updating the highest
+ * priority of the rq during addition will cause the rq to be repositioned
+ * in the parent rq.
+ *
+ * Conversely, for dequeue operations, if an rt_se at a certain level is
+ * not in the rq, the operation can be exited immediately to reduce
+ * unnecessary checks and handling.
+ *
+ * The return value refers to the last rt_se that was removed for enqueue
+ * operations. And for dequeue operations, it refers to the last rt_se
+ * that was either removed or had its rt_nr_running updated.
*/
-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
+static struct sched_rt_entity *dequeue_rt_stack(struct sched_rt_entity *rt_se,
+ unsigned int flags, int for_enqueue)
{
- struct sched_rt_entity *back = NULL;
- unsigned int rt_nr_running;
+ struct sched_rt_entity *last = rt_se;
+ struct sched_rt_entity *origin = rt_se;
+ unsigned int del_rt_nr = 0;
+ unsigned int del_rr_nr = 0;
+ int prio_changed = rt_se_prio(rt_se);
+ int sub_on_rq = 1;
for_each_sched_rt_entity(rt_se) {
- rt_se->back = back;
- back = rt_se;
- }
+ if (on_rt_rq(rt_se)) {
+ if (sub_on_rq) {
+ /*
+ * The number of tasks removed from the sub-level rt_se also needs
+ * to be subtracted from the rq of the current rt_se, as the current
+ * rt_se's rq no longer includes the number of removed tasks.
+ */
+ dec_rq_nr_running(rt_se, del_rt_nr, del_rr_nr);
+ if ((prio_changed != MAX_RT_PRIO) ||
+ (rt_se_prio(rt_se) > rt_se_prio(origin))) {
+ /*
+ * If the removal of the lower-level rt_se causes the
+ * highest priority of the current rq to change, or if the
+ * priority of current rq is lower than the rt_se to be
+ * added, then the current rt_se also needs to be removed
+ * from its parent rq, and the number of deleted tasks
+ * should be accumulated.
+ */
+ if (prio_changed == MAX_RT_PRIO)
+ prio_changed = rt_se_prio(rt_se);
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else if (!for_enqueue) {
+ /* For dequeue, last may only rt_nr_running was modified.*/
+ last = rt_se;
+ }
+ } else {
+ /*
+ * Entering this branch must be for enqueue, as dequeue would break
+ * if an rt_se is not online.
+ * If the sub-level node is not online, and the current rt_se's
+ * priority is lower than the one being added, current rt_se need
+ * to be removed.
+ */
+ prio_changed = rt_se_prio(rt_se);
+ if (prio_changed > rt_se_prio(origin)) {
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else {
+ prio_changed = MAX_RT_PRIO;
+ }
+ }
- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
+ /*
+ * If the current rt_se is on the top rt_rq, then the already deleted
+ * nodes, plus the count of the rt_rq where current rt_se located,
+ * need to be removed from the top_rt_rq.
+ */
+ if (on_top_rt_rq(rt_se)) {
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ del_rt_nr + rt_rq_of_se(rt_se)->rt_nr_running);
+ }
+ sub_on_rq = 1;
+ } else if (for_enqueue) {
+ struct rt_rq *group_rq = group_rt_rq(rt_se);
- for (rt_se = back; rt_se; rt_se = rt_se->back) {
- if (on_rt_rq(rt_se))
- __dequeue_rt_entity(rt_se, flags);
+ /*
+ * In the case of an enqueue operation, if a certain level is found to be
+ * not online, then the previous counts need to be reset to zero.
+ */
+ prio_changed = MAX_RT_PRIO;
+ sub_on_rq = 0;
+ del_rt_nr = 0;
+ del_rr_nr = 0;
+
+ /*
+ * If the current group is being throttled, then there is no need to check
+ * higher levels since enqueueing will not affect higher-level nodes.
+ */
+ if (group_rq && rt_rq_throttled(group_rq))
+ break;
+
+ if (on_top_rt_rq(rt_se))
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ rt_rq_of_se(rt_se)->rt_nr_running);
+ } else {
+ last = rt_se;
+ break;
+ }
}
- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
+ return last;
}
static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int enqueue = 1;
+ int prio_change = 1;
update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
- for_each_sched_rt_entity(rt_se)
- __enqueue_rt_entity(rt_se, flags);
+ last = dequeue_rt_stack(rt_se, flags, 1);
+
+ for_each_sched_rt_entity(rt_se) {
+ if (enqueue || !on_rt_rq(rt_se) || (prio_change == 1)) {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change >= 0) {
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
+ } else {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ }
+
+ if (rt_se == last)
+ enqueue = 0;
+ }
+
enqueue_top_rt_rq(&rq->rt);
}
static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int prio_change = 1;
update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
+ last = dequeue_rt_stack(rt_se, flags, 0);
for_each_sched_rt_entity(rt_se) {
struct rt_rq *rt_rq = group_rt_rq(rt_se);
+ if (rt_rq && rt_rq->rt_nr_running) {
+ if (on_rt_rq(rt_se)) {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ } else {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change == 0) {
+ /*
+ * If enqueue is successful and the priority of the rq has
+ * not changed, then the parent node only needs to add the
+ * count of the current rt_se. Otherwise, the parent node
+ * will also need to enqueue.
+ */
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ }
+ }
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
- if (rt_rq && rt_rq->rt_nr_running)
- __enqueue_rt_entity(rt_se, flags);
+ /*
+ * last is the rt_se of the last deletion or modification of the
+ * count, so the subsequent rt_se does not need to be updated.
+ */
+ if (rt_se == last)
+ break;
}
+
enqueue_top_rt_rq(&rq->rt);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a831af102070..b634153aacf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2878,6 +2878,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
extern void print_dl_stats(struct seq_file *m, int cpu);
extern void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);
+extern void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq);
extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
extern void resched_latency_warn(int cpu, u64 latency);
--
2.45.2
Hi Xavier,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on shuah-kselftest/next shuah-kselftest/fixes peterz-queue/sched/core linus/master v6.10]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Xavier/RT-SCHED-Optimize-the-enqueue-and-dequeue-operations-for-rt_se/20240716-140932
base: tip/sched/core
patch link: https://lore.kernel.org/r/20240716060514.304324-2-xavier_qy%40163.com
patch subject: [PATCH-RT sched v3 1/2] RT SCHED: Optimize the enqueue and dequeue operations for rt_se
config: x86_64-randconfig-121-20240716 (https://download.01.org/0day-ci/archive/20240717/202407170411.vRtOCOzx-lkp@intel.com/config)
compiler: gcc-8 (Ubuntu 8.4.0-3ubuntu2) 8.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240717/202407170411.vRtOCOzx-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407170411.vRtOCOzx-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
kernel/sched/build_utility.c: note: in included file:
kernel/sched/debug.c:469:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/debug.c:469:17: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/debug.c:469:17: sparse: got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/debug.c:715:6: sparse: sparse: symbol 'print_rt_se' was not declared. Should it be static?
kernel/sched/debug.c:842:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/debug.c:842:9: sparse: expected struct task_struct *tsk
kernel/sched/debug.c:842:9: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/debug.c:842:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/debug.c:842:9: sparse: expected struct task_struct *tsk
kernel/sched/debug.c:842:9: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/build_utility.c: note: in included file:
kernel/sched/stats.c:148:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/stats.c:148:17: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/stats.c:148:17: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/build_utility.c: note: in included file:
kernel/sched/topology.c:107:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:107:56: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:107:56: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:126:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:126:60: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:126:60: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:149:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:149:20: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:149:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:454:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:454:13: sparse: expected struct perf_domain *[assigned] tmp
kernel/sched/topology.c:454:13: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:463:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct perf_domain *[assigned] tmp @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:463:13: sparse: expected struct perf_domain *[assigned] tmp
kernel/sched/topology.c:463:13: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:484:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *[assigned] pd @@ got struct perf_domain [noderef] __rcu *pd @@
kernel/sched/topology.c:484:19: sparse: expected struct perf_domain *[assigned] pd
kernel/sched/topology.c:484:19: sparse: got struct perf_domain [noderef] __rcu *pd
kernel/sched/topology.c:646:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:646:49: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:646:49: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:731:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:731:50: sparse: expected struct sched_domain *parent
kernel/sched/topology.c:731:50: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:739:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@
kernel/sched/topology.c:739:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
kernel/sched/topology.c:739:55: sparse: got struct sched_domain *[assigned] tmp
kernel/sched/topology.c:752:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:752:29: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:752:29: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:757:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:757:20: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:757:20: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:778:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@
kernel/sched/topology.c:778:13: sparse: expected struct sched_domain *[assigned] tmp
kernel/sched/topology.c:778:13: sparse: got struct sched_domain [noderef] __rcu *sd
kernel/sched/topology.c:940:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:940:70: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:940:70: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:969:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:969:59: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:969:59: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1015:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1015:57: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:1015:57: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1017:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1017:25: sparse: expected struct sched_domain *sibling
kernel/sched/topology.c:1017:25: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1025:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1025:55: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:1025:55: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1027:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1027:25: sparse: expected struct sched_domain *sibling
kernel/sched/topology.c:1027:25: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1097:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1097:62: sparse: expected struct sched_domain *sd
kernel/sched/topology.c:1097:62: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1201:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:1201:40: sparse: expected struct sched_domain *child
kernel/sched/topology.c:1201:40: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1629:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@
kernel/sched/topology.c:1629:43: sparse: expected struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:1629:43: sparse: got struct sched_domain *child
kernel/sched/topology.c:2328:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@
kernel/sched/topology.c:2328:31: sparse: expected struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2328:31: sparse: got struct sched_domain *sd
kernel/sched/topology.c:2430:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2430:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2430:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2451:56: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@
kernel/sched/topology.c:2451:56: sparse: expected struct sched_domain *child
kernel/sched/topology.c:2451:56: sparse: got struct sched_domain [noderef] __rcu *child
kernel/sched/topology.c:2450:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2450:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2450:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/topology.c:2505:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/topology.c:2505:57: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/topology.c:2505:57: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/build_utility.c: note: in included file:
kernel/sched/core_sched.c:276:37: sparse: sparse: incompatible types in conditional expression (different address spaces):
kernel/sched/core_sched.c:276:37: sparse: struct task_struct *
kernel/sched/core_sched.c:276:37: sparse: struct task_struct [noderef] __rcu *
kernel/sched/build_utility.c: note: in included file:
kernel/sched/build_utility.c: note: in included file (through include/linux/mmzone.h, include/linux/topology.h, include/linux/sched/topology.h, ...):
include/linux/page-flags.h:240:46: sparse: sparse: self-comparison always evaluates to false
include/linux/page-flags.h:240:46: sparse: sparse: self-comparison always evaluates to false
kernel/sched/build_utility.c: note: in included file:
kernel/sched/sched.h:2175:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
vim +/print_rt_se +715 kernel/sched/debug.c
714
> 715 void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
716 {
717 struct task_struct *task;
718
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi all, Fix compilation warnings in debug.c To Peter, When is the new RT group implementation expected to be proposed? Do you think my current patch is appropriate for optimization until then? Xavier (2): RT SCHED: Optimize the enqueue and dequeue operations for rt_se RT test: Adding test cases for RT group scheduling MAINTAINERS | 7 + kernel/sched/debug.c | 48 +++ kernel/sched/rt.c | 287 +++++++++++++++--- kernel/sched/sched.h | 1 + tools/testing/selftests/sched/Makefile | 4 +- tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++ .../selftests/sched/rt_group_sched_test.sh | 119 ++++++++ 7 files changed, 618 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/sched/deadloop.c create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh -- 2.45.2
This patch optimizes the enqueue and dequeue of rt_se, the strategy employs
a bottom-up removal approach. Specifically, when removing an rt_se at a
certain level, if it is determined that the highest priority of the rq
associated with that rt_se has not changed, there is no need to continue
removing rt_se at higher levels. At this point, only the total number
of removed rt_se needs to be recorded, and the rt_nr_running count of
higher-level rq should be removed accordingly.
Signed-off-by: Xavier <xavier_qy@163.com>
---
kernel/sched/debug.c | 48 ++++++++
kernel/sched/rt.c | 287 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 1 +
3 files changed, 298 insertions(+), 38 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..352ee55da25e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -712,6 +712,54 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#endif
}
+static void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
+{
+ struct task_struct *task;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->my_q) {
+ SEQ_printf_task_group_path(m, rt_se->my_q->tg, "%s\n");
+ return;
+ }
+#endif
+ task = container_of(rt_se, struct task_struct, rt);
+ SEQ_printf(m, " prio-%d, pid-%d, %s\n", task->prio, task->pid, task->comm);
+}
+
+/*shall be called in rq lock*/
+void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq)
+{
+ struct rt_prio_array *array = &rt_rq->active;
+ struct sched_rt_entity *rt_se;
+ struct list_head *queue, *head;
+ unsigned long bitmap[2];
+ int idx;
+ int count = 0;
+
+ if (!rt_rq->rt_nr_running)
+ return;
+
+ memcpy(bitmap, array->bitmap, sizeof(unsigned long) * 2);
+ idx = sched_find_first_bit(bitmap);
+ WARN_ON_ONCE(idx >= MAX_RT_PRIO);
+
+ while (1) {
+ clear_bit(idx, bitmap);
+ queue = array->queue + idx;
+ head = queue;
+ queue = queue->next;
+ do {
+ rt_se = list_entry(queue, struct sched_rt_entity, run_list);
+ print_rt_se(m, rt_se);
+ queue = queue->next;
+ count++;
+ } while (queue != head);
+ idx = sched_find_first_bit(bitmap);
+ if (idx >= MAX_RT_PRIO)
+ break;
+ }
+}
+
void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index aa4c1c874fa4..b18c424a50d2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1113,7 +1113,7 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
#endif /* CONFIG_SMP */
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
-static void
+static int
inc_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1122,9 +1122,11 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
rt_rq->highest_prio.curr = prio;
inc_rt_prio_smp(rt_rq, prio, prev_prio);
+
+ return prev_prio > prio;
}
-static void
+static int
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1149,12 +1151,22 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
}
dec_rt_prio_smp(rt_rq, prio, prev_prio);
+ if (rt_rq->highest_prio.curr > prio)
+ return prio;
+ else
+ return MAX_RT_PRIO;
}
#else
-static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {}
-static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
+static inline int inc_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
+static inline int dec_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
#endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
@@ -1218,28 +1230,31 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
}
static inline
-void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
int prio = rt_se_prio(rt_se);
+ int prio_change;
WARN_ON(!rt_prio(prio));
rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
- inc_rt_prio(rt_rq, prio);
+ prio_change = inc_rt_prio(rt_rq, prio);
inc_rt_group(rt_se, rt_rq);
+ return prio_change;
}
static inline
-void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq, int prio)
{
+ int prio_changed;
WARN_ON(!rt_prio(rt_se_prio(rt_se)));
- WARN_ON(!rt_rq->rt_nr_running);
rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+ prio_changed = dec_rt_prio(rt_rq, prio);
dec_rt_group(rt_se, rt_rq);
+ return prio_changed;
}
/*
@@ -1255,12 +1270,13 @@ static inline bool move_entity(unsigned int flags)
return true;
}
-static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
+static void __delist_rt_entity(struct sched_rt_entity *rt_se,
+ struct rt_prio_array *array, int last_prio)
{
list_del_init(&rt_se->run_list);
- if (list_empty(array->queue + rt_se_prio(rt_se)))
- __clear_bit(rt_se_prio(rt_se), array->bitmap);
+ if (list_empty(array->queue + last_prio))
+ __clear_bit(last_prio, array->bitmap);
rt_se->on_list = 0;
}
@@ -1371,7 +1387,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
}
}
-static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/*
+ * Returns: -1 indicates that rt_se was not enqueued, 0 indicates that the highest
+ * priority of the rq did not change after enqueue, and 1 indicates that the highest
+ * priority of the rq changed after enqueue.
+ */
+static int __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -1386,8 +1407,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
*/
if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
if (rt_se->on_list)
- __delist_rt_entity(rt_se, array);
- return;
+ __delist_rt_entity(rt_se, array, rt_se_prio(rt_se));
+ return -1;
}
if (move_entity(flags)) {
@@ -1402,73 +1423,263 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
}
rt_se->on_rq = 1;
- inc_rt_tasks(rt_se, rt_rq);
+ return inc_rt_tasks(rt_se, rt_rq);
}
-static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/**
+ * delete rt_se from rt_rq
+ *
+ * @rt_se Nodes to be deleted
+ * @last_prio The highest priority of this rt_se before the previous round
+ * of deletion
+ * @flags operation flags
+ *
+ * Returns: =0 indicates that the highest priority of the current rq did not
+ * change during this deletion. >0 indicates it changed, and it returns the
+ * previous highest priority to use in the next round of deletion.
+ */
+static int __dequeue_rt_entity(struct sched_rt_entity *rt_se, int last_prio,
+ unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
if (move_entity(flags)) {
WARN_ON_ONCE(!rt_se->on_list);
- __delist_rt_entity(rt_se, array);
+ __delist_rt_entity(rt_se, array, last_prio);
}
rt_se->on_rq = 0;
- dec_rt_tasks(rt_se, rt_rq);
+ return dec_rt_tasks(rt_se, rt_rq, last_prio);
+}
+
+static inline void dec_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running -= rt;
+ rt_rq->rr_nr_running -= rr;
+}
+
+static inline void add_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running += rt;
+ rt_rq->rr_nr_running += rr;
+}
+
+static inline bool on_top_rt_rq(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->parent)
+ return false;
+#endif
+ return true;
}
/*
- * Because the prio of an upper entry depends on the lower
- * entries, we must remove entries top - down.
+ * To optimize the enqueue and dequeue of rt_se, this strategy employs a
+ * bottom-up removal approach. Specifically, when removing an rt_se at a
+ * certain level, if it is determined that the highest priority of the rq
+ * associated with that rt_se has not changed, there is no need to continue
+ * removing rt_se at higher levels. At this point, only the total number
+ * of removed rt_se needs to be recorded, and the rt_nr_running count of
+ * higher-level rq should be removed accordingly.
+ *
+ * For enqueue operations, if an rt_se at a certain level is in the rq,
+ * it is still necessary to check the priority of the higher-level rq.
+ * If the priority of the higher-level rq is found to be lower than that
+ * of the rt_se to be added, it should be removed, as updating the highest
+ * priority of the rq during addition will cause the rq to be repositioned
+ * in the parent rq.
+ *
+ * Conversely, for dequeue operations, if an rt_se at a certain level is
+ * not in the rq, the operation can be exited immediately to reduce
+ * unnecessary checks and handling.
+ *
+ * The return value refers to the last rt_se that was removed for enqueue
+ * operations. And for dequeue operations, it refers to the last rt_se
+ * that was either removed or had its rt_nr_running updated.
*/
-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
+static struct sched_rt_entity *dequeue_rt_stack(struct sched_rt_entity *rt_se,
+ unsigned int flags, int for_enqueue)
{
- struct sched_rt_entity *back = NULL;
- unsigned int rt_nr_running;
+ struct sched_rt_entity *last = rt_se;
+ struct sched_rt_entity *origin = rt_se;
+ unsigned int del_rt_nr = 0;
+ unsigned int del_rr_nr = 0;
+ int prio_changed = rt_se_prio(rt_se);
+ int sub_on_rq = 1;
for_each_sched_rt_entity(rt_se) {
- rt_se->back = back;
- back = rt_se;
- }
+ if (on_rt_rq(rt_se)) {
+ if (sub_on_rq) {
+ /*
+ * The number of tasks removed from the sub-level rt_se also needs
+ * to be subtracted from the rq of the current rt_se, as the current
+ * rt_se's rq no longer includes the number of removed tasks.
+ */
+ dec_rq_nr_running(rt_se, del_rt_nr, del_rr_nr);
+ if ((prio_changed != MAX_RT_PRIO) ||
+ (rt_se_prio(rt_se) > rt_se_prio(origin))) {
+ /*
+ * If the removal of the lower-level rt_se causes the
+ * highest priority of the current rq to change, or if the
+ * priority of current rq is lower than the rt_se to be
+ * added, then the current rt_se also needs to be removed
+ * from its parent rq, and the number of deleted tasks
+ * should be accumulated.
+ */
+ if (prio_changed == MAX_RT_PRIO)
+ prio_changed = rt_se_prio(rt_se);
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else if (!for_enqueue) {
+ /* For dequeue, last may only rt_nr_running was modified.*/
+ last = rt_se;
+ }
+ } else {
+ /*
+ * Entering this branch must be for enqueue, as dequeue would break
+ * if an rt_se is not online.
+ * If the sub-level node is not online, and the current rt_se's
+ * priority is lower than the one being added, current rt_se need
+ * to be removed.
+ */
+ prio_changed = rt_se_prio(rt_se);
+ if (prio_changed > rt_se_prio(origin)) {
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else {
+ prio_changed = MAX_RT_PRIO;
+ }
+ }
- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
+ /*
+ * If the current rt_se is on the top rt_rq, then the already deleted
+ * nodes, plus the count of the rt_rq where current rt_se located,
+ * need to be removed from the top_rt_rq.
+ */
+ if (on_top_rt_rq(rt_se)) {
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ del_rt_nr + rt_rq_of_se(rt_se)->rt_nr_running);
+ }
+ sub_on_rq = 1;
+ } else if (for_enqueue) {
+ struct rt_rq *group_rq = group_rt_rq(rt_se);
- for (rt_se = back; rt_se; rt_se = rt_se->back) {
- if (on_rt_rq(rt_se))
- __dequeue_rt_entity(rt_se, flags);
+ /*
+ * In the case of an enqueue operation, if a certain level is found to be
+ * not online, then the previous counts need to be reset to zero.
+ */
+ prio_changed = MAX_RT_PRIO;
+ sub_on_rq = 0;
+ del_rt_nr = 0;
+ del_rr_nr = 0;
+
+ /*
+ * If the current group is being throttled, then there is no need to check
+ * higher levels since enqueueing will not affect higher-level nodes.
+ */
+ if (group_rq && rt_rq_throttled(group_rq))
+ break;
+
+ if (on_top_rt_rq(rt_se))
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ rt_rq_of_se(rt_se)->rt_nr_running);
+ } else {
+ last = rt_se;
+ break;
+ }
}
- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
+ return last;
}
static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int enqueue = 1;
+ int prio_change = 1;
update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
- for_each_sched_rt_entity(rt_se)
- __enqueue_rt_entity(rt_se, flags);
+ last = dequeue_rt_stack(rt_se, flags, 1);
+
+ for_each_sched_rt_entity(rt_se) {
+ if (enqueue || !on_rt_rq(rt_se) || (prio_change == 1)) {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change >= 0) {
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
+ } else {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ }
+
+ if (rt_se == last)
+ enqueue = 0;
+ }
+
enqueue_top_rt_rq(&rq->rt);
}
static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int prio_change = 1;
update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
+ last = dequeue_rt_stack(rt_se, flags, 0);
for_each_sched_rt_entity(rt_se) {
struct rt_rq *rt_rq = group_rt_rq(rt_se);
+ if (rt_rq && rt_rq->rt_nr_running) {
+ if (on_rt_rq(rt_se)) {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ } else {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change == 0) {
+ /*
+ * If enqueue is successful and the priority of the rq has
+ * not changed, then the parent node only needs to add the
+ * count of the current rt_se. Otherwise, the parent node
+ * will also need to enqueue.
+ */
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ }
+ }
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
- if (rt_rq && rt_rq->rt_nr_running)
- __enqueue_rt_entity(rt_se, flags);
+ /*
+ * last is the rt_se of the last deletion or modification of the
+ * count, so the subsequent rt_se does not need to be updated.
+ */
+ if (rt_se == last)
+ break;
}
+
enqueue_top_rt_rq(&rq->rt);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef20c61004eb..821d65106d13 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2879,6 +2879,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
extern void print_dl_stats(struct seq_file *m, int cpu);
extern void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);
+extern void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq);
extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
extern void resched_latency_warn(int cpu, u64 latency);
--
2.45.2
Hi all,
I would like to ask everyone for your opinions or thoughts on the RT group scheduling
optimization. At this stage, is it ready to be merged into the corresponding branch?
Thanks.
--
Best Regards,
Xavier
At 2024-07-17 11:00:32, "Xavier" <xavier_qy@163.com> wrote:
>This patch optimizes the enqueue and dequeue of rt_se, the strategy employs
>a bottom-up removal approach. Specifically, when removing an rt_se at a
>certain level, if it is determined that the highest priority of the rq
>associated with that rt_se has not changed, there is no need to continue
>removing rt_se at higher levels. At this point, only the total number
>of removed rt_se needs to be recorded, and the rt_nr_running count of
>higher-level rq should be removed accordingly.
>
>Signed-off-by: Xavier <xavier_qy@163.com>
>---
> kernel/sched/debug.c | 48 ++++++++
> kernel/sched/rt.c | 287 +++++++++++++++++++++++++++++++++++++------
> kernel/sched/sched.h | 1 +
> 3 files changed, 298 insertions(+), 38 deletions(-)
>
>diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>index c1eb9a1afd13..352ee55da25e 100644
>--- a/kernel/sched/debug.c
>+++ b/kernel/sched/debug.c
>@@ -712,6 +712,54 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> #endif
> }
>
>+static void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
>+{
>+ struct task_struct *task;
>+
>+#ifdef CONFIG_RT_GROUP_SCHED
>+ if (rt_se->my_q) {
>+ SEQ_printf_task_group_path(m, rt_se->my_q->tg, "%s\n");
>+ return;
>+ }
>+#endif
>+ task = container_of(rt_se, struct task_struct, rt);
>+ SEQ_printf(m, " prio-%d, pid-%d, %s\n", task->prio, task->pid, task->comm);
>+}
>+
>+/*shall be called in rq lock*/
>+void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq)
>+{
>+ struct rt_prio_array *array = &rt_rq->active;
>+ struct sched_rt_entity *rt_se;
>+ struct list_head *queue, *head;
>+ unsigned long bitmap[2];
>+ int idx;
>+ int count = 0;
>+
>+ if (!rt_rq->rt_nr_running)
>+ return;
>+
>+ memcpy(bitmap, array->bitmap, sizeof(unsigned long) * 2);
>+ idx = sched_find_first_bit(bitmap);
>+ WARN_ON_ONCE(idx >= MAX_RT_PRIO);
>+
>+ while (1) {
>+ clear_bit(idx, bitmap);
>+ queue = array->queue + idx;
>+ head = queue;
>+ queue = queue->next;
>+ do {
>+ rt_se = list_entry(queue, struct sched_rt_entity, run_list);
>+ print_rt_se(m, rt_se);
>+ queue = queue->next;
>+ count++;
>+ } while (queue != head);
>+ idx = sched_find_first_bit(bitmap);
>+ if (idx >= MAX_RT_PRIO)
>+ break;
>+ }
>+}
>+
> void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
> {
> #ifdef CONFIG_RT_GROUP_SCHED
>diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>index aa4c1c874fa4..b18c424a50d2 100644
>--- a/kernel/sched/rt.c
>+++ b/kernel/sched/rt.c
>@@ -1113,7 +1113,7 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
> #endif /* CONFIG_SMP */
>
> #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
>-static void
>+static int
> inc_rt_prio(struct rt_rq *rt_rq, int prio)
> {
> int prev_prio = rt_rq->highest_prio.curr;
>@@ -1122,9 +1122,11 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
> rt_rq->highest_prio.curr = prio;
>
> inc_rt_prio_smp(rt_rq, prio, prev_prio);
>+
>+ return prev_prio > prio;
> }
>
>-static void
>+static int
> dec_rt_prio(struct rt_rq *rt_rq, int prio)
> {
> int prev_prio = rt_rq->highest_prio.curr;
>@@ -1149,12 +1151,22 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
> }
>
> dec_rt_prio_smp(rt_rq, prio, prev_prio);
>+ if (rt_rq->highest_prio.curr > prio)
>+ return prio;
>+ else
>+ return MAX_RT_PRIO;
> }
>
> #else
>
>-static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {}
>-static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
>+static inline int inc_rt_prio(struct rt_rq *rt_rq, int prio)
>+{
>+ return 0;
>+}
>+static inline int dec_rt_prio(struct rt_rq *rt_rq, int prio)
>+{
>+ return 0;
>+}
>
> #endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
>
>@@ -1218,28 +1230,31 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
> }
>
> static inline
>-void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>+int inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
> {
> int prio = rt_se_prio(rt_se);
>+ int prio_change;
>
> WARN_ON(!rt_prio(prio));
> rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
> rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
>
>- inc_rt_prio(rt_rq, prio);
>+ prio_change = inc_rt_prio(rt_rq, prio);
> inc_rt_group(rt_se, rt_rq);
>+ return prio_change;
> }
>
> static inline
>-void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>+int dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq, int prio)
> {
>+ int prio_changed;
> WARN_ON(!rt_prio(rt_se_prio(rt_se)));
>- WARN_ON(!rt_rq->rt_nr_running);
> rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
> rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
>
>- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
>+ prio_changed = dec_rt_prio(rt_rq, prio);
> dec_rt_group(rt_se, rt_rq);
>+ return prio_changed;
> }
>
> /*
>@@ -1255,12 +1270,13 @@ static inline bool move_entity(unsigned int flags)
> return true;
> }
>
>-static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
>+static void __delist_rt_entity(struct sched_rt_entity *rt_se,
>+ struct rt_prio_array *array, int last_prio)
> {
> list_del_init(&rt_se->run_list);
>
>- if (list_empty(array->queue + rt_se_prio(rt_se)))
>- __clear_bit(rt_se_prio(rt_se), array->bitmap);
>+ if (list_empty(array->queue + last_prio))
>+ __clear_bit(last_prio, array->bitmap);
>
> rt_se->on_list = 0;
> }
>@@ -1371,7 +1387,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
> }
> }
>
>-static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>+/*
>+ * Returns: -1 indicates that rt_se was not enqueued, 0 indicates that the highest
>+ * priority of the rq did not change after enqueue, and 1 indicates that the highest
>+ * priority of the rq changed after enqueue.
>+ */
>+static int __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
> {
> struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
> struct rt_prio_array *array = &rt_rq->active;
>@@ -1386,8 +1407,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
> */
> if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
> if (rt_se->on_list)
>- __delist_rt_entity(rt_se, array);
>- return;
>+ __delist_rt_entity(rt_se, array, rt_se_prio(rt_se));
>+ return -1;
> }
>
> if (move_entity(flags)) {
>@@ -1402,73 +1423,263 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
> }
> rt_se->on_rq = 1;
>
>- inc_rt_tasks(rt_se, rt_rq);
>+ return inc_rt_tasks(rt_se, rt_rq);
> }
>
>-static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>+/**
>+ * delete rt_se from rt_rq
>+ *
>+ * @rt_se Nodes to be deleted
>+ * @last_prio The highest priority of this rt_se before the previous round
>+ * of deletion
>+ * @flags operation flags
>+ *
>+ * Returns: =0 indicates that the highest priority of the current rq did not
>+ * change during this deletion. >0 indicates it changed, and it returns the
>+ * previous highest priority to use in the next round of deletion.
>+ */
>+static int __dequeue_rt_entity(struct sched_rt_entity *rt_se, int last_prio,
>+ unsigned int flags)
> {
> struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
> struct rt_prio_array *array = &rt_rq->active;
>
> if (move_entity(flags)) {
> WARN_ON_ONCE(!rt_se->on_list);
>- __delist_rt_entity(rt_se, array);
>+ __delist_rt_entity(rt_se, array, last_prio);
> }
> rt_se->on_rq = 0;
>
>- dec_rt_tasks(rt_se, rt_rq);
>+ return dec_rt_tasks(rt_se, rt_rq, last_prio);
>+}
>+
>+static inline void dec_rq_nr_running(struct sched_rt_entity *rt_se,
>+ unsigned int rt, unsigned int rr)
>+{
>+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>+
>+ rt_rq->rt_nr_running -= rt;
>+ rt_rq->rr_nr_running -= rr;
>+}
>+
>+static inline void add_rq_nr_running(struct sched_rt_entity *rt_se,
>+ unsigned int rt, unsigned int rr)
>+{
>+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>+
>+ rt_rq->rt_nr_running += rt;
>+ rt_rq->rr_nr_running += rr;
>+}
>+
>+static inline bool on_top_rt_rq(struct sched_rt_entity *rt_se)
>+{
>+#ifdef CONFIG_RT_GROUP_SCHED
>+ if (rt_se->parent)
>+ return false;
>+#endif
>+ return true;
> }
>
> /*
>- * Because the prio of an upper entry depends on the lower
>- * entries, we must remove entries top - down.
>+ * To optimize the enqueue and dequeue of rt_se, this strategy employs a
>+ * bottom-up removal approach. Specifically, when removing an rt_se at a
>+ * certain level, if it is determined that the highest priority of the rq
>+ * associated with that rt_se has not changed, there is no need to continue
>+ * removing rt_se at higher levels. At this point, only the total number
>+ * of removed rt_se needs to be recorded, and the rt_nr_running count of
>+ * higher-level rq should be removed accordingly.
>+ *
>+ * For enqueue operations, if an rt_se at a certain level is in the rq,
>+ * it is still necessary to check the priority of the higher-level rq.
>+ * If the priority of the higher-level rq is found to be lower than that
>+ * of the rt_se to be added, it should be removed, as updating the highest
>+ * priority of the rq during addition will cause the rq to be repositioned
>+ * in the parent rq.
>+ *
>+ * Conversely, for dequeue operations, if an rt_se at a certain level is
>+ * not in the rq, the operation can be exited immediately to reduce
>+ * unnecessary checks and handling.
>+ *
>+ * The return value refers to the last rt_se that was removed for enqueue
>+ * operations. And for dequeue operations, it refers to the last rt_se
>+ * that was either removed or had its rt_nr_running updated.
> */
>-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
>+static struct sched_rt_entity *dequeue_rt_stack(struct sched_rt_entity *rt_se,
>+ unsigned int flags, int for_enqueue)
> {
>- struct sched_rt_entity *back = NULL;
>- unsigned int rt_nr_running;
>+ struct sched_rt_entity *last = rt_se;
>+ struct sched_rt_entity *origin = rt_se;
>+ unsigned int del_rt_nr = 0;
>+ unsigned int del_rr_nr = 0;
>+ int prio_changed = rt_se_prio(rt_se);
>+ int sub_on_rq = 1;
>
> for_each_sched_rt_entity(rt_se) {
>- rt_se->back = back;
>- back = rt_se;
>- }
>+ if (on_rt_rq(rt_se)) {
>+ if (sub_on_rq) {
>+ /*
>+ * The number of tasks removed from the sub-level rt_se also needs
>+ * to be subtracted from the rq of the current rt_se, as the current
>+ * rt_se's rq no longer includes the number of removed tasks.
>+ */
>+ dec_rq_nr_running(rt_se, del_rt_nr, del_rr_nr);
>+ if ((prio_changed != MAX_RT_PRIO) ||
>+ (rt_se_prio(rt_se) > rt_se_prio(origin))) {
>+ /*
>+ * If the removal of the lower-level rt_se causes the
>+ * highest priority of the current rq to change, or if the
>+ * priority of current rq is lower than the rt_se to be
>+ * added, then the current rt_se also needs to be removed
>+ * from its parent rq, and the number of deleted tasks
>+ * should be accumulated.
>+ */
>+ if (prio_changed == MAX_RT_PRIO)
>+ prio_changed = rt_se_prio(rt_se);
>+ del_rt_nr += rt_se_nr_running(rt_se);
>+ del_rr_nr += rt_se_rr_nr_running(rt_se);
>+ prio_changed = __dequeue_rt_entity(rt_se,
>+ prio_changed, flags);
>+ last = rt_se;
>+ } else if (!for_enqueue) {
>+ /* For dequeue, last may only rt_nr_running was modified.*/
>+ last = rt_se;
>+ }
>+ } else {
>+ /*
>+ * Entering this branch must be for enqueue, as dequeue would break
>+ * if an rt_se is not online.
>+ * If the sub-level node is not online, and the current rt_se's
>+ * priority is lower than the one being added, current rt_se need
>+ * to be removed.
>+ */
>+ prio_changed = rt_se_prio(rt_se);
>+ if (prio_changed > rt_se_prio(origin)) {
>+ del_rt_nr += rt_se_nr_running(rt_se);
>+ del_rr_nr += rt_se_rr_nr_running(rt_se);
>+ prio_changed = __dequeue_rt_entity(rt_se,
>+ prio_changed, flags);
>+ last = rt_se;
>+ } else {
>+ prio_changed = MAX_RT_PRIO;
>+ }
>+ }
>
>- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
>+ /*
>+ * If the current rt_se is on the top rt_rq, then the already deleted
>+ * nodes, plus the count of the rt_rq where current rt_se located,
>+ * need to be removed from the top_rt_rq.
>+ */
>+ if (on_top_rt_rq(rt_se)) {
>+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
>+ del_rt_nr + rt_rq_of_se(rt_se)->rt_nr_running);
>+ }
>+ sub_on_rq = 1;
>+ } else if (for_enqueue) {
>+ struct rt_rq *group_rq = group_rt_rq(rt_se);
>
>- for (rt_se = back; rt_se; rt_se = rt_se->back) {
>- if (on_rt_rq(rt_se))
>- __dequeue_rt_entity(rt_se, flags);
>+ /*
>+ * In the case of an enqueue operation, if a certain level is found to be
>+ * not online, then the previous counts need to be reset to zero.
>+ */
>+ prio_changed = MAX_RT_PRIO;
>+ sub_on_rq = 0;
>+ del_rt_nr = 0;
>+ del_rr_nr = 0;
>+
>+ /*
>+ * If the current group is being throttled, then there is no need to check
>+ * higher levels since enqueueing will not affect higher-level nodes.
>+ */
>+ if (group_rq && rt_rq_throttled(group_rq))
>+ break;
>+
>+ if (on_top_rt_rq(rt_se))
>+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
>+ rt_rq_of_se(rt_se)->rt_nr_running);
>+ } else {
>+ last = rt_se;
>+ break;
>+ }
> }
>
>- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
>+ return last;
> }
>
> static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
> {
> struct rq *rq = rq_of_rt_se(rt_se);
>+ struct sched_rt_entity *last;
>+ unsigned int add_rt_nr = 0;
>+ unsigned int add_rr_nr = 0;
>+ int enqueue = 1;
>+ int prio_change = 1;
>
> update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
>
>- dequeue_rt_stack(rt_se, flags);
>- for_each_sched_rt_entity(rt_se)
>- __enqueue_rt_entity(rt_se, flags);
>+ last = dequeue_rt_stack(rt_se, flags, 1);
>+
>+ for_each_sched_rt_entity(rt_se) {
>+ if (enqueue || !on_rt_rq(rt_se) || (prio_change == 1)) {
>+ prio_change = __enqueue_rt_entity(rt_se, flags);
>+ if (prio_change >= 0) {
>+ add_rt_nr = rt_se_nr_running(rt_se);
>+ add_rr_nr = rt_se_rr_nr_running(rt_se);
>+ } else {
>+ add_rt_nr = add_rr_nr = 0;
>+ }
>+ } else {
>+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
>+ }
>+
>+ if (rt_se == last)
>+ enqueue = 0;
>+ }
>+
> enqueue_top_rt_rq(&rq->rt);
> }
>
> static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
> {
> struct rq *rq = rq_of_rt_se(rt_se);
>+ struct sched_rt_entity *last;
>+ unsigned int add_rt_nr = 0;
>+ unsigned int add_rr_nr = 0;
>+ int prio_change = 1;
>
> update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
>
>- dequeue_rt_stack(rt_se, flags);
>+ last = dequeue_rt_stack(rt_se, flags, 0);
>
> for_each_sched_rt_entity(rt_se) {
> struct rt_rq *rt_rq = group_rt_rq(rt_se);
>+ if (rt_rq && rt_rq->rt_nr_running) {
>+ if (on_rt_rq(rt_se)) {
>+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
>+ } else {
>+ prio_change = __enqueue_rt_entity(rt_se, flags);
>+ if (prio_change == 0) {
>+ /*
>+ * If enqueue is successful and the priority of the rq has
>+ * not changed, then the parent node only needs to add the
>+ * count of the current rt_se. Otherwise, the parent node
>+ * will also need to enqueue.
>+ */
>+ add_rt_nr = rt_se_nr_running(rt_se);
>+ add_rr_nr = rt_se_rr_nr_running(rt_se);
>+ }
>+ }
>+ } else {
>+ add_rt_nr = add_rr_nr = 0;
>+ }
>
>- if (rt_rq && rt_rq->rt_nr_running)
>- __enqueue_rt_entity(rt_se, flags);
>+ /*
>+ * last is the rt_se of the last deletion or modification of the
>+ * count, so the subsequent rt_se does not need to be updated.
>+ */
>+ if (rt_se == last)
>+ break;
> }
>+
> enqueue_top_rt_rq(&rq->rt);
> }
>
>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>index ef20c61004eb..821d65106d13 100644
>--- a/kernel/sched/sched.h
>+++ b/kernel/sched/sched.h
>@@ -2879,6 +2879,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
> extern void print_dl_stats(struct seq_file *m, int cpu);
> extern void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
> extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);
>+extern void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq);
> extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
>
> extern void resched_latency_warn(int cpu, u64 latency);
>--
>2.45.2
Just a reminder, does anyone have any comments or feedback on the patch?
--
Best Regards,
Xavier
At 2024-07-25 14:21:03, "Xavier" <xavier_qy@163.com> wrote:
>
>Hi all,
>
>I would like to ask everyone for your opinions or thoughts on the RT group scheduling
>optimization. At this stage, is it ready to be merged into the corresponding branch?
>Thanks.
>
>--
>Best Regards,
>Xavier
>
>
>
>
>
>At 2024-07-17 11:00:32, "Xavier" <xavier_qy@163.com> wrote:
>>This patch optimizes the enqueue and dequeue of rt_se, the strategy employs
>>a bottom-up removal approach. Specifically, when removing an rt_se at a
>>certain level, if it is determined that the highest priority of the rq
>>associated with that rt_se has not changed, there is no need to continue
>>removing rt_se at higher levels. At this point, only the total number
>>of removed rt_se needs to be recorded, and the rt_nr_running count of
>>higher-level rq should be removed accordingly.
>>
>>Signed-off-by: Xavier <xavier_qy@163.com>
>>---
>> kernel/sched/debug.c | 48 ++++++++
>> kernel/sched/rt.c | 287 +++++++++++++++++++++++++++++++++++++------
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 298 insertions(+), 38 deletions(-)
>>
>>diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>index c1eb9a1afd13..352ee55da25e 100644
>>--- a/kernel/sched/debug.c
>>+++ b/kernel/sched/debug.c
>>@@ -712,6 +712,54 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>> #endif
>> }
>>
>>+static void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
>>+{
>>+ struct task_struct *task;
>>+
>>+#ifdef CONFIG_RT_GROUP_SCHED
>>+ if (rt_se->my_q) {
>>+ SEQ_printf_task_group_path(m, rt_se->my_q->tg, "%s\n");
>>+ return;
>>+ }
>>+#endif
>>+ task = container_of(rt_se, struct task_struct, rt);
>>+ SEQ_printf(m, " prio-%d, pid-%d, %s\n", task->prio, task->pid, task->comm);
>>+}
>>+
>>+/*shall be called in rq lock*/
>>+void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq)
>>+{
>>+ struct rt_prio_array *array = &rt_rq->active;
>>+ struct sched_rt_entity *rt_se;
>>+ struct list_head *queue, *head;
>>+ unsigned long bitmap[2];
>>+ int idx;
>>+ int count = 0;
>>+
>>+ if (!rt_rq->rt_nr_running)
>>+ return;
>>+
>>+ memcpy(bitmap, array->bitmap, sizeof(unsigned long) * 2);
>>+ idx = sched_find_first_bit(bitmap);
>>+ WARN_ON_ONCE(idx >= MAX_RT_PRIO);
>>+
>>+ while (1) {
>>+ clear_bit(idx, bitmap);
>>+ queue = array->queue + idx;
>>+ head = queue;
>>+ queue = queue->next;
>>+ do {
>>+ rt_se = list_entry(queue, struct sched_rt_entity, run_list);
>>+ print_rt_se(m, rt_se);
>>+ queue = queue->next;
>>+ count++;
>>+ } while (queue != head);
>>+ idx = sched_find_first_bit(bitmap);
>>+ if (idx >= MAX_RT_PRIO)
>>+ break;
>>+ }
>>+}
>>+
>> void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
>> {
>> #ifdef CONFIG_RT_GROUP_SCHED
>>diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>index aa4c1c874fa4..b18c424a50d2 100644
>>--- a/kernel/sched/rt.c
>>+++ b/kernel/sched/rt.c
>>@@ -1113,7 +1113,7 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
>> #endif /* CONFIG_SMP */
>>
>> #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
>>-static void
>>+static int
>> inc_rt_prio(struct rt_rq *rt_rq, int prio)
>> {
>> int prev_prio = rt_rq->highest_prio.curr;
>>@@ -1122,9 +1122,11 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
>> rt_rq->highest_prio.curr = prio;
>>
>> inc_rt_prio_smp(rt_rq, prio, prev_prio);
>>+
>>+ return prev_prio > prio;
>> }
>>
>>-static void
>>+static int
>> dec_rt_prio(struct rt_rq *rt_rq, int prio)
>> {
>> int prev_prio = rt_rq->highest_prio.curr;
>>@@ -1149,12 +1151,22 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
>> }
>>
>> dec_rt_prio_smp(rt_rq, prio, prev_prio);
>>+ if (rt_rq->highest_prio.curr > prio)
>>+ return prio;
>>+ else
>>+ return MAX_RT_PRIO;
>> }
>>
>> #else
>>
>>-static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {}
>>-static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
>>+static inline int inc_rt_prio(struct rt_rq *rt_rq, int prio)
>>+{
>>+ return 0;
>>+}
>>+static inline int dec_rt_prio(struct rt_rq *rt_rq, int prio)
>>+{
>>+ return 0;
>>+}
>>
>> #endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
>>
>>@@ -1218,28 +1230,31 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
>> }
>>
>> static inline
>>-void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>>+int inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>> {
>> int prio = rt_se_prio(rt_se);
>>+ int prio_change;
>>
>> WARN_ON(!rt_prio(prio));
>> rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
>> rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
>>
>>- inc_rt_prio(rt_rq, prio);
>>+ prio_change = inc_rt_prio(rt_rq, prio);
>> inc_rt_group(rt_se, rt_rq);
>>+ return prio_change;
>> }
>>
>> static inline
>>-void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
>>+int dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq, int prio)
>> {
>>+ int prio_changed;
>> WARN_ON(!rt_prio(rt_se_prio(rt_se)));
>>- WARN_ON(!rt_rq->rt_nr_running);
>> rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
>> rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
>>
>>- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
>>+ prio_changed = dec_rt_prio(rt_rq, prio);
>> dec_rt_group(rt_se, rt_rq);
>>+ return prio_changed;
>> }
>>
>> /*
>>@@ -1255,12 +1270,13 @@ static inline bool move_entity(unsigned int flags)
>> return true;
>> }
>>
>>-static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
>>+static void __delist_rt_entity(struct sched_rt_entity *rt_se,
>>+ struct rt_prio_array *array, int last_prio)
>> {
>> list_del_init(&rt_se->run_list);
>>
>>- if (list_empty(array->queue + rt_se_prio(rt_se)))
>>- __clear_bit(rt_se_prio(rt_se), array->bitmap);
>>+ if (list_empty(array->queue + last_prio))
>>+ __clear_bit(last_prio, array->bitmap);
>>
>> rt_se->on_list = 0;
>> }
>>@@ -1371,7 +1387,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
>> }
>> }
>>
>>-static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>>+/*
>>+ * Returns: -1 indicates that rt_se was not enqueued, 0 indicates that the highest
>>+ * priority of the rq did not change after enqueue, and 1 indicates that the highest
>>+ * priority of the rq changed after enqueue.
>>+ */
>>+static int __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>> {
>> struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>> struct rt_prio_array *array = &rt_rq->active;
>>@@ -1386,8 +1407,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
>> */
>> if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
>> if (rt_se->on_list)
>>- __delist_rt_entity(rt_se, array);
>>- return;
>>+ __delist_rt_entity(rt_se, array, rt_se_prio(rt_se));
>>+ return -1;
>> }
>>
>> if (move_entity(flags)) {
>>@@ -1402,73 +1423,263 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
>> }
>> rt_se->on_rq = 1;
>>
>>- inc_rt_tasks(rt_se, rt_rq);
>>+ return inc_rt_tasks(rt_se, rt_rq);
>> }
>>
>>-static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>>+/**
>>+ * delete rt_se from rt_rq
>>+ *
>>+ * @rt_se Nodes to be deleted
>>+ * @last_prio The highest priority of this rt_se before the previous round
>>+ * of deletion
>>+ * @flags operation flags
>>+ *
>>+ * Returns: =0 indicates that the highest priority of the current rq did not
>>+ * change during this deletion. >0 indicates it changed, and it returns the
>>+ * previous highest priority to use in the next round of deletion.
>>+ */
>>+static int __dequeue_rt_entity(struct sched_rt_entity *rt_se, int last_prio,
>>+ unsigned int flags)
>> {
>> struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>> struct rt_prio_array *array = &rt_rq->active;
>>
>> if (move_entity(flags)) {
>> WARN_ON_ONCE(!rt_se->on_list);
>>- __delist_rt_entity(rt_se, array);
>>+ __delist_rt_entity(rt_se, array, last_prio);
>> }
>> rt_se->on_rq = 0;
>>
>>- dec_rt_tasks(rt_se, rt_rq);
>>+ return dec_rt_tasks(rt_se, rt_rq, last_prio);
>>+}
>>+
>>+static inline void dec_rq_nr_running(struct sched_rt_entity *rt_se,
>>+ unsigned int rt, unsigned int rr)
>>+{
>>+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>>+
>>+ rt_rq->rt_nr_running -= rt;
>>+ rt_rq->rr_nr_running -= rr;
>>+}
>>+
>>+static inline void add_rq_nr_running(struct sched_rt_entity *rt_se,
>>+ unsigned int rt, unsigned int rr)
>>+{
>>+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
>>+
>>+ rt_rq->rt_nr_running += rt;
>>+ rt_rq->rr_nr_running += rr;
>>+}
>>+
>>+static inline bool on_top_rt_rq(struct sched_rt_entity *rt_se)
>>+{
>>+#ifdef CONFIG_RT_GROUP_SCHED
>>+ if (rt_se->parent)
>>+ return false;
>>+#endif
>>+ return true;
>> }
>>
>> /*
>>- * Because the prio of an upper entry depends on the lower
>>- * entries, we must remove entries top - down.
>>+ * To optimize the enqueue and dequeue of rt_se, this strategy employs a
>>+ * bottom-up removal approach. Specifically, when removing an rt_se at a
>>+ * certain level, if it is determined that the highest priority of the rq
>>+ * associated with that rt_se has not changed, there is no need to continue
>>+ * removing rt_se at higher levels. At this point, only the total number
>>+ * of removed rt_se needs to be recorded, and the rt_nr_running count of
>>+ * higher-level rq should be removed accordingly.
>>+ *
>>+ * For enqueue operations, if an rt_se at a certain level is in the rq,
>>+ * it is still necessary to check the priority of the higher-level rq.
>>+ * If the priority of the higher-level rq is found to be lower than that
>>+ * of the rt_se to be added, it should be removed, as updating the highest
>>+ * priority of the rq during addition will cause the rq to be repositioned
>>+ * in the parent rq.
>>+ *
>>+ * Conversely, for dequeue operations, if an rt_se at a certain level is
>>+ * not in the rq, the operation can be exited immediately to reduce
>>+ * unnecessary checks and handling.
>>+ *
>>+ * The return value refers to the last rt_se that was removed for enqueue
>>+ * operations. And for dequeue operations, it refers to the last rt_se
>>+ * that was either removed or had its rt_nr_running updated.
>> */
>>-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
>>+static struct sched_rt_entity *dequeue_rt_stack(struct sched_rt_entity *rt_se,
>>+ unsigned int flags, int for_enqueue)
>> {
>>- struct sched_rt_entity *back = NULL;
>>- unsigned int rt_nr_running;
>>+ struct sched_rt_entity *last = rt_se;
>>+ struct sched_rt_entity *origin = rt_se;
>>+ unsigned int del_rt_nr = 0;
>>+ unsigned int del_rr_nr = 0;
>>+ int prio_changed = rt_se_prio(rt_se);
>>+ int sub_on_rq = 1;
>>
>> for_each_sched_rt_entity(rt_se) {
>>- rt_se->back = back;
>>- back = rt_se;
>>- }
>>+ if (on_rt_rq(rt_se)) {
>>+ if (sub_on_rq) {
>>+ /*
>>+ * The number of tasks removed from the sub-level rt_se also needs
>>+ * to be subtracted from the rq of the current rt_se, as the current
>>+ * rt_se's rq no longer includes the number of removed tasks.
>>+ */
>>+ dec_rq_nr_running(rt_se, del_rt_nr, del_rr_nr);
>>+ if ((prio_changed != MAX_RT_PRIO) ||
>>+ (rt_se_prio(rt_se) > rt_se_prio(origin))) {
>>+ /*
>>+ * If the removal of the lower-level rt_se causes the
>>+ * highest priority of the current rq to change, or if the
>>+ * priority of current rq is lower than the rt_se to be
>>+ * added, then the current rt_se also needs to be removed
>>+ * from its parent rq, and the number of deleted tasks
>>+ * should be accumulated.
>>+ */
>>+ if (prio_changed == MAX_RT_PRIO)
>>+ prio_changed = rt_se_prio(rt_se);
>>+ del_rt_nr += rt_se_nr_running(rt_se);
>>+ del_rr_nr += rt_se_rr_nr_running(rt_se);
>>+ prio_changed = __dequeue_rt_entity(rt_se,
>>+ prio_changed, flags);
>>+ last = rt_se;
>>+ } else if (!for_enqueue) {
>>+ /* For dequeue, last may only rt_nr_running was modified.*/
>>+ last = rt_se;
>>+ }
>>+ } else {
>>+ /*
>>+ * Entering this branch must be for enqueue, as dequeue would break
>>+ * if an rt_se is not online.
>>+ * If the sub-level node is not online, and the current rt_se's
>>+ * priority is lower than the one being added, current rt_se need
>>+ * to be removed.
>>+ */
>>+ prio_changed = rt_se_prio(rt_se);
>>+ if (prio_changed > rt_se_prio(origin)) {
>>+ del_rt_nr += rt_se_nr_running(rt_se);
>>+ del_rr_nr += rt_se_rr_nr_running(rt_se);
>>+ prio_changed = __dequeue_rt_entity(rt_se,
>>+ prio_changed, flags);
>>+ last = rt_se;
>>+ } else {
>>+ prio_changed = MAX_RT_PRIO;
>>+ }
>>+ }
>>
>>- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
>>+ /*
>>+ * If the current rt_se is on the top rt_rq, then the already deleted
>>+ * nodes, plus the count of the rt_rq where current rt_se located,
>>+ * need to be removed from the top_rt_rq.
>>+ */
>>+ if (on_top_rt_rq(rt_se)) {
>>+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
>>+ del_rt_nr + rt_rq_of_se(rt_se)->rt_nr_running);
>>+ }
>>+ sub_on_rq = 1;
>>+ } else if (for_enqueue) {
>>+ struct rt_rq *group_rq = group_rt_rq(rt_se);
>>
>>- for (rt_se = back; rt_se; rt_se = rt_se->back) {
>>- if (on_rt_rq(rt_se))
>>- __dequeue_rt_entity(rt_se, flags);
>>+ /*
>>+ * In the case of an enqueue operation, if a certain level is found to be
>>+ * not online, then the previous counts need to be reset to zero.
>>+ */
>>+ prio_changed = MAX_RT_PRIO;
>>+ sub_on_rq = 0;
>>+ del_rt_nr = 0;
>>+ del_rr_nr = 0;
>>+
>>+ /*
>>+ * If the current group is being throttled, then there is no need to check
>>+ * higher levels since enqueueing will not affect higher-level nodes.
>>+ */
>>+ if (group_rq && rt_rq_throttled(group_rq))
>>+ break;
>>+
>>+ if (on_top_rt_rq(rt_se))
>>+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
>>+ rt_rq_of_se(rt_se)->rt_nr_running);
>>+ } else {
>>+ last = rt_se;
>>+ break;
>>+ }
>> }
>>
>>- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
>>+ return last;
>> }
>>
>> static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>> {
>> struct rq *rq = rq_of_rt_se(rt_se);
>>+ struct sched_rt_entity *last;
>>+ unsigned int add_rt_nr = 0;
>>+ unsigned int add_rr_nr = 0;
>>+ int enqueue = 1;
>>+ int prio_change = 1;
>>
>> update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
>>
>>- dequeue_rt_stack(rt_se, flags);
>>- for_each_sched_rt_entity(rt_se)
>>- __enqueue_rt_entity(rt_se, flags);
>>+ last = dequeue_rt_stack(rt_se, flags, 1);
>>+
>>+ for_each_sched_rt_entity(rt_se) {
>>+ if (enqueue || !on_rt_rq(rt_se) || (prio_change == 1)) {
>>+ prio_change = __enqueue_rt_entity(rt_se, flags);
>>+ if (prio_change >= 0) {
>>+ add_rt_nr = rt_se_nr_running(rt_se);
>>+ add_rr_nr = rt_se_rr_nr_running(rt_se);
>>+ } else {
>>+ add_rt_nr = add_rr_nr = 0;
>>+ }
>>+ } else {
>>+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
>>+ }
>>+
>>+ if (rt_se == last)
>>+ enqueue = 0;
>>+ }
>>+
>> enqueue_top_rt_rq(&rq->rt);
>> }
>>
>> static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>> {
>> struct rq *rq = rq_of_rt_se(rt_se);
>>+ struct sched_rt_entity *last;
>>+ unsigned int add_rt_nr = 0;
>>+ unsigned int add_rr_nr = 0;
>>+ int prio_change = 1;
>>
>> update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
>>
>>- dequeue_rt_stack(rt_se, flags);
>>+ last = dequeue_rt_stack(rt_se, flags, 0);
>>
>> for_each_sched_rt_entity(rt_se) {
>> struct rt_rq *rt_rq = group_rt_rq(rt_se);
>>+ if (rt_rq && rt_rq->rt_nr_running) {
>>+ if (on_rt_rq(rt_se)) {
>>+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
>>+ } else {
>>+ prio_change = __enqueue_rt_entity(rt_se, flags);
>>+ if (prio_change == 0) {
>>+ /*
>>+ * If enqueue is successful and the priority of the rq has
>>+ * not changed, then the parent node only needs to add the
>>+ * count of the current rt_se. Otherwise, the parent node
>>+ * will also need to enqueue.
>>+ */
>>+ add_rt_nr = rt_se_nr_running(rt_se);
>>+ add_rr_nr = rt_se_rr_nr_running(rt_se);
>>+ }
>>+ }
>>+ } else {
>>+ add_rt_nr = add_rr_nr = 0;
>>+ }
>>
>>- if (rt_rq && rt_rq->rt_nr_running)
>>- __enqueue_rt_entity(rt_se, flags);
>>+ /*
>>+ * last is the rt_se of the last deletion or modification of the
>>+ * count, so the subsequent rt_se does not need to be updated.
>>+ */
>>+ if (rt_se == last)
>>+ break;
>> }
>>+
>> enqueue_top_rt_rq(&rq->rt);
>> }
>>
>>diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>index ef20c61004eb..821d65106d13 100644
>>--- a/kernel/sched/sched.h
>>+++ b/kernel/sched/sched.h
>>@@ -2879,6 +2879,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
>> extern void print_dl_stats(struct seq_file *m, int cpu);
>> extern void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
>> extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);
>>+extern void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq);
>> extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
>>
>> extern void resched_latency_warn(int cpu, u64 latency);
>>--
>>2.45.2
Adding test cases for RT group scheduling, create some RT infinite loop
processes/threads, then set them to the same or different priorities.
Place them in different RT task groups, run for a period of time,
and finally count the number of infinite loop executions for all tasks.
Signed-off-by: Xavier <xavier_qy@163.com>
---
MAINTAINERS | 7 +
tools/testing/selftests/sched/Makefile | 4 +-
tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++++++++
.../selftests/sched/rt_group_sched_test.sh | 119 +++++++++++
4 files changed, 320 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/sched/deadloop.c
create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh
diff --git a/MAINTAINERS b/MAINTAINERS
index 958e935449e5..f5cc821b8510 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19463,6 +19463,13 @@ L: linux-remoteproc@vger.kernel.org
S: Maintained
F: drivers/tty/rpmsg_tty.c
+RT GROUP SCHED TEST
+M: Xavier <xavier_qy@163.com>
+L: linux-kernel@vger.kernel.org
+S: Maintained
+F: tools/testing/selftests/sched/deadloop.c
+F: tools/testing/selftests/sched/rt_group_sched_test.sh
+
RTL2830 MEDIA DRIVER
L: linux-media@vger.kernel.org
S: Orphan
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..96decb58bf35 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
$(CLANG_FLAGS)
LDLIBS += -lpthread
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := cs_prctl_test deadloop
+TEST_PROGS := cs_prctl_test deadloop
include ../lib.mk
diff --git a/tools/testing/selftests/sched/deadloop.c b/tools/testing/selftests/sched/deadloop.c
new file mode 100644
index 000000000000..d850a3e2a0ab
--- /dev/null
+++ b/tools/testing/selftests/sched/deadloop.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <sys/types.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <signal.h>
+
+/*
+ * Create multiple infinite loop threads based on the passed parameters
+ * Usage: deadloop num policy prio
+ * num: the number of child threads
+ * policy: the scheduling policy of the child threads, 0-fair, 1-fifo, 2-rr
+ * prio: the priority
+ * If this process is killed, it will print the loop count of all child threads
+ * to the OUTPUT_FILE
+ *
+ * Date: June 27, 2024
+ * Author: Xavier <xavier_qy@163.com>
+ */
+
+#define OUTPUT_FILE "rt_group_sched_test.log"
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+ return syscall(SYS_gettid);
+}
+#endif
+
+#define do_err(x) \
+do { \
+ if ((x) < 0) { \
+ printf("test BUG_ON func %s, line %d %ld\n", \
+ __func__, __LINE__, (long)(x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+#define do_false(x) \
+do { \
+ if ((x) == 1) { \
+ printf("test BUG_ON func %s, line %d %d\n", \
+ __func__, __LINE__, (x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+
+struct thread_data {
+ pthread_t thread;
+ int index;
+ int pid;
+ unsigned long cnt;
+};
+
+static struct thread_data *pdata;
+static int thread_num = 1;
+
+static void create_thread_posix(void *entry, pthread_t *thread, int *para,
+ int policy, int prio)
+{
+ int ret;
+ struct sched_param param;
+ pthread_attr_t attr;
+
+ memset(¶m, 0, sizeof(param));
+ ret = pthread_attr_init(&attr);
+ do_err(ret);
+
+ ret = pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
+ do_err(ret);
+
+ param.sched_priority = prio;
+
+ ret = pthread_attr_setschedpolicy(&attr, policy);
+ do_err(ret);
+
+ ret = pthread_attr_setschedparam(&attr, ¶m);
+ do_err(ret);
+
+ ret = pthread_create(thread, &attr, entry, para);
+ do_err(ret);
+}
+
+static void *dead_loop_entry(void *arg)
+{
+ int index = *(int *)arg;
+ struct sched_param param;
+ int cur = gettid();
+
+ sched_getparam(cur, ¶m);
+ pdata[index].pid = cur;
+ printf("cur:%d prio:%d\n", cur, param.sched_priority);
+
+ while (1) {
+ asm volatile("" ::: "memory");
+ pdata[index].cnt++;
+ }
+ return NULL;
+}
+
+static void handle_signal(int signal)
+{
+ int cnt = 0;
+
+ if (signal == SIGTERM) {
+ FILE *file = freopen(OUTPUT_FILE, "a", stdout);
+
+ if (file == NULL) {
+ perror("freopen");
+ exit(0);
+ }
+
+ while (cnt < thread_num) {
+ printf("pid:%d cnt:%ld\n", pdata[cnt].pid, pdata[cnt].cnt);
+ cnt++;
+ }
+ fclose(file);
+ exit(0);
+ }
+}
+
+static int dead_loop_create(int policy, int prio)
+{
+ int cnt = 0;
+ int ret;
+ void *status;
+ struct sched_param param;
+
+ param.sched_priority = prio;
+ pdata = malloc(thread_num * sizeof(struct thread_data));
+ do_false(!pdata);
+
+ if (policy) {
+ ret = sched_setscheduler(0, policy, ¶m);
+ do_err(ret);
+ }
+
+ while (cnt < thread_num) {
+ pdata[cnt].index = cnt;
+ create_thread_posix(dead_loop_entry, &pdata[cnt].thread,
+ &pdata[cnt].index, policy, prio);
+ cnt++;
+ }
+
+ signal(SIGTERM, handle_signal);
+
+ cnt = 0;
+ while (cnt < thread_num) {
+ pthread_join(pdata[cnt].thread, &status);
+ cnt++;
+ }
+
+ free(pdata);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int policy = 2;
+ int prio = 50;
+
+ if (argc == 2)
+ thread_num = atoi(argv[1]);
+
+ if (argc == 3) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ if (policy > 0)
+ prio = 50;
+ }
+
+ if (argc == 4) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ prio = atoi(argv[3]);
+ }
+
+ dead_loop_create(policy, prio);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/sched/rt_group_sched_test.sh b/tools/testing/selftests/sched/rt_group_sched_test.sh
new file mode 100755
index 000000000000..9031250a2684
--- /dev/null
+++ b/tools/testing/selftests/sched/rt_group_sched_test.sh
@@ -0,0 +1,119 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test for rt group scheduling
+# Date: June 27, 2024
+# Author: Xavier <xavier_qy@163.com>
+
+# Record the list of child process PIDs
+PIDS=()
+
+# File for redirected output
+LOGFILE="rt_group_sched_test.log"
+
+# Cleanup function: kill all recorded child processes and unmount the cgroup
+function cleanup() {
+ echo "Cleaning up..."
+ for pid in "${PIDS[@]}"; do
+ if kill -0 $pid 2>/dev/null; then
+ kill -TERM $pid
+ fi
+ done
+
+ # Sleep for a while to ensure the processes are properly killed
+ sleep 2
+
+ # Unmount the cgroup filesystem
+ umount /sys/fs/cgroup/cpu 2>/dev/null
+ umount /sys/fs/cgroup 2>/dev/null
+ echo "Cleanup completed."
+
+ # Ensure the LOGFILE exists and is correct
+ if [ ! -f "$LOGFILE" ]; then
+ echo "$LOGFILE not found!"
+ exit 1
+ fi
+
+ # Initialize the total count variable
+ total=0
+
+ # Read matching lines and calculate the total sum
+ while IFS= read -r line
+ do
+ # Use grep to match lines containing 'pid:' and 'cnt:', and extract the value of cnt
+ if echo "$line" | grep -q '^pid:[[:digit:]]\+ cnt:[[:digit:]]\+'; then
+ cnt=$(echo "$line" | sed -n \
+ 's/^pid:[[:digit:]]\+ cnt:\([[:digit:]]\+\)/\1/p')
+ total=$((total + cnt))
+ fi
+ done < "$LOGFILE"
+
+ # Print the total sum
+ echo "Total cnt: $total"
+ echo "Finished processing."
+}
+
+# Capture actions when interrupted or terminated by a signal
+trap cleanup EXIT
+
+# Start the cgroup filesystem and create the necessary directories
+function setup_cgroups() {
+ mount -t tmpfs -o mode=755 cgroup_root /sys/fs/cgroup
+ mkdir -p /sys/fs/cgroup/cpu
+ mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
+}
+
+# Create cgroup subdirectories and configure their settings
+function create_child_cgroup() {
+ local base_dir=$1
+ local name=$2
+ local rt_period=$3
+ local rt_runtime=$4
+ mkdir -p "$base_dir/$name"
+ echo $rt_period > "$base_dir/$name/cpu.rt_period_us"
+ echo $rt_runtime > "$base_dir/$name/cpu.rt_runtime_us"
+}
+# Launch a process and add it to the specified cgroup
+function launch_process() {
+ local process_name=$1
+
+ # Three parameters representing the number of child threads, scheduling policy, and priority
+ local args=$2
+ local cgroup_path=$3
+
+ # Launch the process
+ exec -a $process_name ./deadloop $args &
+ local pid=$!
+ PIDS+=($pid)
+
+ # Short sleep to ensure the process starts
+ sleep 1
+
+ # Check if the process started successfully
+ if ! pgrep -x $process_name > /dev/null; then
+ echo "Error: No process found with name $process_name."
+ exit 1
+ fi
+
+ echo $pid > "$cgroup_path/cgroup.procs"
+ echo "Process $process_name with PID $pid added to cgroup $cgroup_path"
+}
+
+# Main function running all tasks
+function main() {
+ echo "The test needs 30 seconds..."
+ rm -f "$LOGFILE"
+ setup_cgroups
+ create_child_cgroup "/sys/fs/cgroup/cpu" "child1" 1000000 800000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1" "child2" 1000000 700000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1/child2" "child3" 1000000 600000
+ launch_process "child1" "3 2 50" "/sys/fs/cgroup/cpu/child1"
+ launch_process "child2" "3 2 50" "/sys/fs/cgroup/cpu/child1/child2"
+ launch_process "child3" "1 2 50" "/sys/fs/cgroup/cpu/child1/child2/child3"
+ launch_process "tg_root" "1 2 50" "/sys/fs/cgroup/cpu"
+
+ # Run for 30 seconds
+ sleep 30
+}
+
+# Execute the main function
+main
--
2.45.2
Adding test cases for RT group scheduling, create some RT infinite loop
processes/threads, then set them to the same or different priorities.
Place them in different RT task groups, run for a period of time,
and finally count the number of infinite loop executions for all tasks.
Signed-off-by: Xavier <xavier_qy@163.com>
---
MAINTAINERS | 7 +
tools/testing/selftests/sched/Makefile | 4 +-
tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++++++++
.../selftests/sched/rt_group_sched_test.sh | 119 +++++++++++
4 files changed, 320 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/sched/deadloop.c
create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh
diff --git a/MAINTAINERS b/MAINTAINERS
index 43353b705988..d29effe57bf8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19480,6 +19480,13 @@ L: linux-remoteproc@vger.kernel.org
S: Maintained
F: drivers/tty/rpmsg_tty.c
+RT GROUP SCHED TEST
+M: Xavier <xavier_qy@163.com>
+L: linux-kernel@vger.kernel.org
+S: Maintained
+F: tools/testing/selftests/sched/deadloop.c
+F: tools/testing/selftests/sched/rt_group_sched_test.sh
+
RTL2830 MEDIA DRIVER
L: linux-media@vger.kernel.org
S: Orphan
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..96decb58bf35 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
$(CLANG_FLAGS)
LDLIBS += -lpthread
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := cs_prctl_test deadloop
+TEST_PROGS := cs_prctl_test deadloop
include ../lib.mk
diff --git a/tools/testing/selftests/sched/deadloop.c b/tools/testing/selftests/sched/deadloop.c
new file mode 100644
index 000000000000..d850a3e2a0ab
--- /dev/null
+++ b/tools/testing/selftests/sched/deadloop.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <sys/types.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <signal.h>
+
+/*
+ * Create multiple infinite loop threads based on the passed parameters
+ * Usage: deadloop num policy prio
+ * num: the number of child threads
+ * policy: the scheduling policy of the child threads, 0-fair, 1-fifo, 2-rr
+ * prio: the priority
+ * If this process is killed, it will print the loop count of all child threads
+ * to the OUTPUT_FILE
+ *
+ * Date: June 27, 2024
+ * Author: Xavier <xavier_qy@163.com>
+ */
+
+#define OUTPUT_FILE "rt_group_sched_test.log"
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+ return syscall(SYS_gettid);
+}
+#endif
+
+#define do_err(x) \
+do { \
+ if ((x) < 0) { \
+ printf("test BUG_ON func %s, line %d %ld\n", \
+ __func__, __LINE__, (long)(x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+#define do_false(x) \
+do { \
+ if ((x) == 1) { \
+ printf("test BUG_ON func %s, line %d %d\n", \
+ __func__, __LINE__, (x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+
+struct thread_data {
+ pthread_t thread;
+ int index;
+ int pid;
+ unsigned long cnt;
+};
+
+static struct thread_data *pdata;
+static int thread_num = 1;
+
+static void create_thread_posix(void *entry, pthread_t *thread, int *para,
+ int policy, int prio)
+{
+ int ret;
+ struct sched_param param;
+ pthread_attr_t attr;
+
+ memset(¶m, 0, sizeof(param));
+ ret = pthread_attr_init(&attr);
+ do_err(ret);
+
+ ret = pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
+ do_err(ret);
+
+ param.sched_priority = prio;
+
+ ret = pthread_attr_setschedpolicy(&attr, policy);
+ do_err(ret);
+
+ ret = pthread_attr_setschedparam(&attr, ¶m);
+ do_err(ret);
+
+ ret = pthread_create(thread, &attr, entry, para);
+ do_err(ret);
+}
+
+static void *dead_loop_entry(void *arg)
+{
+ int index = *(int *)arg;
+ struct sched_param param;
+ int cur = gettid();
+
+ sched_getparam(cur, ¶m);
+ pdata[index].pid = cur;
+ printf("cur:%d prio:%d\n", cur, param.sched_priority);
+
+ while (1) {
+ asm volatile("" ::: "memory");
+ pdata[index].cnt++;
+ }
+ return NULL;
+}
+
+static void handle_signal(int signal)
+{
+ int cnt = 0;
+
+ if (signal == SIGTERM) {
+ FILE *file = freopen(OUTPUT_FILE, "a", stdout);
+
+ if (file == NULL) {
+ perror("freopen");
+ exit(0);
+ }
+
+ while (cnt < thread_num) {
+ printf("pid:%d cnt:%ld\n", pdata[cnt].pid, pdata[cnt].cnt);
+ cnt++;
+ }
+ fclose(file);
+ exit(0);
+ }
+}
+
+static int dead_loop_create(int policy, int prio)
+{
+ int cnt = 0;
+ int ret;
+ void *status;
+ struct sched_param param;
+
+ param.sched_priority = prio;
+ pdata = malloc(thread_num * sizeof(struct thread_data));
+ do_false(!pdata);
+
+ if (policy) {
+ ret = sched_setscheduler(0, policy, ¶m);
+ do_err(ret);
+ }
+
+ while (cnt < thread_num) {
+ pdata[cnt].index = cnt;
+ create_thread_posix(dead_loop_entry, &pdata[cnt].thread,
+ &pdata[cnt].index, policy, prio);
+ cnt++;
+ }
+
+ signal(SIGTERM, handle_signal);
+
+ cnt = 0;
+ while (cnt < thread_num) {
+ pthread_join(pdata[cnt].thread, &status);
+ cnt++;
+ }
+
+ free(pdata);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int policy = 2;
+ int prio = 50;
+
+ if (argc == 2)
+ thread_num = atoi(argv[1]);
+
+ if (argc == 3) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ if (policy > 0)
+ prio = 50;
+ }
+
+ if (argc == 4) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ prio = atoi(argv[3]);
+ }
+
+ dead_loop_create(policy, prio);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/sched/rt_group_sched_test.sh b/tools/testing/selftests/sched/rt_group_sched_test.sh
new file mode 100755
index 000000000000..9031250a2684
--- /dev/null
+++ b/tools/testing/selftests/sched/rt_group_sched_test.sh
@@ -0,0 +1,119 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test for rt group scheduling
+# Date: June 27, 2024
+# Author: Xavier <xavier_qy@163.com>
+
+# Record the list of child process PIDs
+PIDS=()
+
+# File for redirected output
+LOGFILE="rt_group_sched_test.log"
+
+# Cleanup function: kill all recorded child processes and unmount the cgroup
+function cleanup() {
+ echo "Cleaning up..."
+ for pid in "${PIDS[@]}"; do
+ if kill -0 $pid 2>/dev/null; then
+ kill -TERM $pid
+ fi
+ done
+
+ # Sleep for a while to ensure the processes are properly killed
+ sleep 2
+
+ # Unmount the cgroup filesystem
+ umount /sys/fs/cgroup/cpu 2>/dev/null
+ umount /sys/fs/cgroup 2>/dev/null
+ echo "Cleanup completed."
+
+ # Ensure the LOGFILE exists and is correct
+ if [ ! -f "$LOGFILE" ]; then
+ echo "$LOGFILE not found!"
+ exit 1
+ fi
+
+ # Initialize the total count variable
+ total=0
+
+ # Read matching lines and calculate the total sum
+ while IFS= read -r line
+ do
+ # Use grep to match lines containing 'pid:' and 'cnt:', and extract the value of cnt
+ if echo "$line" | grep -q '^pid:[[:digit:]]\+ cnt:[[:digit:]]\+'; then
+ cnt=$(echo "$line" | sed -n \
+ 's/^pid:[[:digit:]]\+ cnt:\([[:digit:]]\+\)/\1/p')
+ total=$((total + cnt))
+ fi
+ done < "$LOGFILE"
+
+ # Print the total sum
+ echo "Total cnt: $total"
+ echo "Finished processing."
+}
+
+# Capture actions when interrupted or terminated by a signal
+trap cleanup EXIT
+
+# Start the cgroup filesystem and create the necessary directories
+function setup_cgroups() {
+ mount -t tmpfs -o mode=755 cgroup_root /sys/fs/cgroup
+ mkdir -p /sys/fs/cgroup/cpu
+ mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
+}
+
+# Create cgroup subdirectories and configure their settings
+function create_child_cgroup() {
+ local base_dir=$1
+ local name=$2
+ local rt_period=$3
+ local rt_runtime=$4
+ mkdir -p "$base_dir/$name"
+ echo $rt_period > "$base_dir/$name/cpu.rt_period_us"
+ echo $rt_runtime > "$base_dir/$name/cpu.rt_runtime_us"
+}
+# Launch a process and add it to the specified cgroup
+function launch_process() {
+ local process_name=$1
+
+ # Three parameters representing the number of child threads, scheduling policy, and priority
+ local args=$2
+ local cgroup_path=$3
+
+ # Launch the process
+ exec -a $process_name ./deadloop $args &
+ local pid=$!
+ PIDS+=($pid)
+
+ # Short sleep to ensure the process starts
+ sleep 1
+
+ # Check if the process started successfully
+ if ! pgrep -x $process_name > /dev/null; then
+ echo "Error: No process found with name $process_name."
+ exit 1
+ fi
+
+ echo $pid > "$cgroup_path/cgroup.procs"
+ echo "Process $process_name with PID $pid added to cgroup $cgroup_path"
+}
+
+# Main function running all tasks
+function main() {
+ echo "The test needs 30 seconds..."
+ rm -f "$LOGFILE"
+ setup_cgroups
+ create_child_cgroup "/sys/fs/cgroup/cpu" "child1" 1000000 800000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1" "child2" 1000000 700000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1/child2" "child3" 1000000 600000
+ launch_process "child1" "3 2 50" "/sys/fs/cgroup/cpu/child1"
+ launch_process "child2" "3 2 50" "/sys/fs/cgroup/cpu/child1/child2"
+ launch_process "child3" "1 2 50" "/sys/fs/cgroup/cpu/child1/child2/child3"
+ launch_process "tg_root" "1 2 50" "/sys/fs/cgroup/cpu"
+
+ # Run for 30 seconds
+ sleep 30
+}
+
+# Execute the main function
+main
--
2.45.2
This patch optimizes the enqueue and dequeue of rt_se, the strategy employs
a bottom-up removal approach. Specifically, when removing an rt_se at a
certain level, if it is determined that the highest priority of the rq
associated with that rt_se has not changed, there is no need to continue
removing rt_se at higher levels. At this point, only the total number
of removed rt_se needs to be recorded, and the rt_nr_running count of
higher-level rq should be removed accordingly.
Signed-off-by: Xavier <xavier_qy@163.com>
---
kernel/sched/debug.c | 50 ++++++++
kernel/sched/rt.c | 278 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 1 +
3 files changed, 289 insertions(+), 40 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..282153397e02 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -712,6 +712,56 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#endif
}
+static void print_rt_se(struct seq_file *m, struct sched_rt_entity *rt_se)
+{
+ struct task_struct *task;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->my_q) {
+ SEQ_printf_task_group_path(m, rt_se->my_q->tg, "%s\n");
+ return;
+ }
+#endif
+ task = container_of(rt_se, struct task_struct, rt);
+ SEQ_printf(m, " prio-%d, pid-%d, %s\n", task->prio, task->pid, task->comm);
+}
+
+/*shall be called in rq lock*/
+void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq)
+{
+ struct rt_prio_array *array = &rt_rq->active;
+ struct sched_rt_entity *rt_se;
+ struct list_head *queue, *head;
+ unsigned long bitmap[2];
+ int idx;
+ int count = 0;
+
+ if (!rt_rq->rt_nr_running)
+ return;
+
+ memcpy(bitmap, array->bitmap, sizeof(unsigned long) * 2);
+ idx = sched_find_first_bit(bitmap);
+ WARN_ON_ONCE(idx >= MAX_RT_PRIO);
+
+ while (1) {
+ clear_bit(idx, bitmap);
+ queue = array->queue + idx;
+ head = queue;
+ queue = queue->next;
+ do {
+ rt_se = list_entry(queue, struct sched_rt_entity, run_list);
+ print_rt_se(m, rt_se);
+ queue = queue->next;
+ count++;
+ } while (queue != head);
+ idx = sched_find_first_bit(bitmap);
+ if (idx >= MAX_RT_PRIO)
+ break;
+ }
+
+ WARN_ON_ONCE(count != rt_rq->rt_nr_running);
+}
+
void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index aa4c1c874fa4..0673bce0c145 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1113,7 +1113,7 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
#endif /* CONFIG_SMP */
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
-static void
+static int
inc_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1122,9 +1122,11 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
rt_rq->highest_prio.curr = prio;
inc_rt_prio_smp(rt_rq, prio, prev_prio);
+
+ return prev_prio > prio;
}
-static void
+static int
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
int prev_prio = rt_rq->highest_prio.curr;
@@ -1149,12 +1151,22 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
}
dec_rt_prio_smp(rt_rq, prio, prev_prio);
+ if (rt_rq->highest_prio.curr > prio)
+ return prio;
+ else
+ return 0;
}
#else
-static inline void inc_rt_prio(struct rt_rq *rt_rq, int prio) {}
-static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
+static inline int inc_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
+static inline int dec_rt_prio(struct rt_rq *rt_rq, int prio)
+{
+ return 0;
+}
#endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
@@ -1218,28 +1230,31 @@ unsigned int rt_se_rr_nr_running(struct sched_rt_entity *rt_se)
}
static inline
-void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
int prio = rt_se_prio(rt_se);
+ int prio_change;
WARN_ON(!rt_prio(prio));
rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
- inc_rt_prio(rt_rq, prio);
+ prio_change = inc_rt_prio(rt_rq, prio);
inc_rt_group(rt_se, rt_rq);
+ return prio_change;
}
static inline
-void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
+int dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq, int prio)
{
+ int prio_changed;
WARN_ON(!rt_prio(rt_se_prio(rt_se)));
- WARN_ON(!rt_rq->rt_nr_running);
rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+ prio_changed = dec_rt_prio(rt_rq, prio);
dec_rt_group(rt_se, rt_rq);
+ return prio_changed;
}
/*
@@ -1255,12 +1270,13 @@ static inline bool move_entity(unsigned int flags)
return true;
}
-static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
+static void __delist_rt_entity(struct sched_rt_entity *rt_se,
+ struct rt_prio_array *array, int last_prio)
{
list_del_init(&rt_se->run_list);
- if (list_empty(array->queue + rt_se_prio(rt_se)))
- __clear_bit(rt_se_prio(rt_se), array->bitmap);
+ if (list_empty(array->queue + last_prio))
+ __clear_bit(last_prio, array->bitmap);
rt_se->on_list = 0;
}
@@ -1371,7 +1387,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
}
}
-static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/*
+ * Returns: -1 indicates that rt_se was not enqueued, 0 indicates that the highest
+ * priority of the rq did not change after enqueue, and 1 indicates that the highest
+ * priority of the rq changed after enqueue.
+ */
+static int __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -1386,8 +1407,8 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
*/
if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)) {
if (rt_se->on_list)
- __delist_rt_entity(rt_se, array);
- return;
+ __delist_rt_entity(rt_se, array, rt_se_prio(rt_se));
+ return -1;
}
if (move_entity(flags)) {
@@ -1402,73 +1423,250 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
}
rt_se->on_rq = 1;
- inc_rt_tasks(rt_se, rt_rq);
+ return inc_rt_tasks(rt_se, rt_rq);
}
-static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
+/**
+ * delete rt_se from rt_rq
+ *
+ * @rt_se Nodes to be deleted
+ * @last_prio The highest priority of this rt_se before the previous round
+ * of deletion
+ * @flags operation flags
+ *
+ * Returns: =0 indicates that the highest priority of the current rq did not
+ * change during this deletion. >0 indicates it changed, and it returns the
+ * previous highest priority to use in the next round of deletion.
+ */
+static int __dequeue_rt_entity(struct sched_rt_entity *rt_se, int last_prio,
+ unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
if (move_entity(flags)) {
WARN_ON_ONCE(!rt_se->on_list);
- __delist_rt_entity(rt_se, array);
+ __delist_rt_entity(rt_se, array, last_prio);
}
rt_se->on_rq = 0;
- dec_rt_tasks(rt_se, rt_rq);
+ return dec_rt_tasks(rt_se, rt_rq, last_prio);
+}
+
+static inline void dec_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running -= rt;
+ rt_rq->rr_nr_running -= rr;
+}
+
+static inline void add_rq_nr_running(struct sched_rt_entity *rt_se,
+ unsigned int rt, unsigned int rr)
+{
+ struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
+
+ rt_rq->rt_nr_running += rt;
+ rt_rq->rr_nr_running += rr;
+}
+
+static inline bool on_top_rt_rq(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+ if (rt_se->parent)
+ return false;
+#endif
+ return true;
}
/*
- * Because the prio of an upper entry depends on the lower
- * entries, we must remove entries top - down.
+ * To optimize the enqueue and dequeue of rt_se, this strategy employs a
+ * bottom-up removal approach. Specifically, when removing an rt_se at a
+ * certain level, if it is determined that the highest priority of the rq
+ * associated with that rt_se has not changed, there is no need to continue
+ * removing rt_se at higher levels. At this point, only the total number
+ * of removed rt_se needs to be recorded, and the rt_nr_running count of
+ * higher-level rq should be removed accordingly.
+ *
+ * For enqueue operations, if an rt_se at a certain level is in the rq,
+ * it is still necessary to check the priority of the higher-level rq.
+ * If the priority of the higher-level rq is found to be lower than that
+ * of the rt_se to be added, it should be removed, as updating the highest
+ * priority of the rq during addition will cause the rq to be repositioned
+ * in the parent rq.
+ *
+ * Conversely, for dequeue operations, if an rt_se at a certain level is
+ * not in the rq, the operation can be exited immediately to reduce
+ * unnecessary checks and handling.
+ *
+ * The return value refers to the last rt_se that was removed for enqueue
+ * operations. And for dequeue operations, it refers to the last rt_se
+ * that was either removed or had its rt_nr_running updated.
*/
-static void dequeue_rt_stack(struct sched_rt_entity *rt_se, unsigned int flags)
+static struct sched_rt_entity *dequeue_rt_stack(struct sched_rt_entity *rt_se,
+ unsigned int flags, int for_enqueue)
{
- struct sched_rt_entity *back = NULL;
- unsigned int rt_nr_running;
+ struct sched_rt_entity *last = rt_se;
+ struct sched_rt_entity *origin = rt_se;
+ unsigned int del_rt_nr = 0;
+ unsigned int del_rr_nr = 0;
+ int prio_changed = rt_se_prio(rt_se);
+ int sub_on_rq = 1;
for_each_sched_rt_entity(rt_se) {
- rt_se->back = back;
- back = rt_se;
- }
-
- rt_nr_running = rt_rq_of_se(back)->rt_nr_running;
+ if (on_rt_rq(rt_se)) {
+ if (sub_on_rq) {
+ /*
+ * The number of tasks removed from the sub-level rt_se also needs
+ * to be subtracted from the rq of the current rt_se, as the current
+ * rt_se's rq no longer includes the number of removed tasks.
+ */
+ dec_rq_nr_running(rt_se, del_rt_nr, del_rr_nr);
+
+ if (prio_changed) {
+ /*
+ * If the removal of the lower-level rt_se causes the
+ * highest priority of the current rq to change, then the
+ * current rt_se also needs to be removed from its parent
+ * rq, and the number of deleted tasks should be
+ * accumulated.
+ */
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else if (!for_enqueue) {
+ /* For dequeue, last may only rt_nr_running was modified.*/
+ last = rt_se;
+ }
+ } else {
+ /*
+ * Entering this branch must be for enqueue, as dequeue would break
+ * if an rt_se is not online.
+ * If the sub-level node is not online, and the current rt_se's
+ * priority is lower than the one being added, current rt_se need
+ * to be removed.
+ */
+ prio_changed = rt_se_prio(rt_se);
+ if (prio_changed > rt_se_prio(origin)) {
+ del_rt_nr += rt_se_nr_running(rt_se);
+ del_rr_nr += rt_se_rr_nr_running(rt_se);
+ prio_changed = __dequeue_rt_entity(rt_se,
+ prio_changed, flags);
+ last = rt_se;
+ } else {
+ prio_changed = 0;
+ }
+ }
- for (rt_se = back; rt_se; rt_se = rt_se->back) {
- if (on_rt_rq(rt_se))
- __dequeue_rt_entity(rt_se, flags);
+ /*
+ * If the current rt_se is on the top rt_rq, then the already deleted
+ * nodes, plus the count of the rt_rq where current rt_se located,
+ * need to be removed from the top_rt_rq.
+ */
+ if (on_top_rt_rq(rt_se)) {
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ del_rt_nr + rt_rq_of_se(rt_se)->rt_nr_running);
+ }
+ sub_on_rq = 1;
+ } else if (for_enqueue) {
+ /*
+ * In the case of an enqueue operation, if a certain level is found to be
+ * not online, then the previous counts need to be reset to zero.
+ */
+ prio_changed = 0;
+ sub_on_rq = 0;
+ del_rt_nr = 0;
+ del_rr_nr = 0;
+
+ if (on_top_rt_rq(rt_se))
+ dequeue_top_rt_rq(rt_rq_of_se(rt_se),
+ rt_rq_of_se(rt_se)->rt_nr_running);
+ } else {
+ last = rt_se;
+ break;
+ }
}
- dequeue_top_rt_rq(rt_rq_of_se(back), rt_nr_running);
+ return last;
}
+
static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int enqueue = 1;
+ int prio_change = 1;
update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
- for_each_sched_rt_entity(rt_se)
- __enqueue_rt_entity(rt_se, flags);
+ last = dequeue_rt_stack(rt_se, flags, 1);
+ for_each_sched_rt_entity(rt_se) {
+ if (enqueue || !on_rt_rq(rt_se) || (prio_change == 1)) {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change >= 0) {
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
+ } else {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ }
+
+ if (rt_se == last)
+ enqueue = 0;
+ }
+
enqueue_top_rt_rq(&rq->rt);
}
static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
+ struct sched_rt_entity *last;
+ unsigned int add_rt_nr = 0;
+ unsigned int add_rr_nr = 0;
+ int prio_change = 1;
update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
- dequeue_rt_stack(rt_se, flags);
-
+ last = dequeue_rt_stack(rt_se, flags, 0);
for_each_sched_rt_entity(rt_se) {
struct rt_rq *rt_rq = group_rt_rq(rt_se);
+ if (rt_rq && rt_rq->rt_nr_running) {
+ if (on_rt_rq(rt_se)) {
+ add_rq_nr_running(rt_se, add_rt_nr, add_rr_nr);
+ } else {
+ prio_change = __enqueue_rt_entity(rt_se, flags);
+ if (prio_change == 0) {
+ /*
+ * If enqueue is successful and the priority of the rq has
+ * not changed, then the parent node only needs to add the
+ * count of the current rt_se. Otherwise, the parent node
+ * will also need to enqueue.
+ */
+ add_rt_nr = rt_se_nr_running(rt_se);
+ add_rr_nr = rt_se_rr_nr_running(rt_se);
+ }
+ }
+ } else {
+ add_rt_nr = add_rr_nr = 0;
+ }
- if (rt_rq && rt_rq->rt_nr_running)
- __enqueue_rt_entity(rt_se, flags);
+ /*
+ * last is the rt_se of the last deletion or modification of the
+ * count, so the subsequent rt_se does not need to be updated.
+ */
+ if (rt_se == last)
+ break;
}
+
enqueue_top_rt_rq(&rq->rt);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a831af102070..b634153aacf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2878,6 +2878,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
extern void print_dl_stats(struct seq_file *m, int cpu);
extern void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);
+extern void print_rt_rq_task(struct seq_file *m, struct rt_rq *rt_rq);
extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
extern void resched_latency_warn(int cpu, u64 latency);
--
2.45.2
Hello,
kernel test robot noticed "WARNING:at_kernel/sched/rt.c:#__enqueue_rt_entity" on:
commit: ed0ed14c2b47993c00c4b3cdceabef535bcef32b ("[PATCH-RT sched v2 1/2] RT SCHED: Optimize the enqueue and dequeue operations for rt_se")
url: https://github.com/intel-lab-lkp/linux/commits/Xavier/RT-SCHED-Optimize-the-enqueue-and-dequeue-operations-for-rt_se/20240630-173825
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git c793a62823d1ce8f70d9cfc7803e3ea436277cda
patch link: https://lore.kernel.org/all/20240629112812.243691-2-xavier_qy@163.com/
patch subject: [PATCH-RT sched v2 1/2] RT SCHED: Optimize the enqueue and dequeue operations for rt_se
in testcase: blktests
version: blktests-x86_64-775a058-1_20240702
with following parameters:
disk: 1SSD
test: block-group-01
compiler: gcc-13
test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407041644.de55c25-oliver.sang@intel.com
[ 54.093440][ C2] ------------[ cut here ]------------
[ 54.094193][ T705] list_add double add: new=ffff888802a8abc0, prev=ffff888802a8abc0, next=ffff8887892c4dd0.
[ 54.098261][ C2] WARNING: CPU: 2 PID: 53 at kernel/sched/rt.c:1415 __enqueue_rt_entity (kernel/sched/rt.c:1415 (discriminator 1))
[ 54.103613][ T705] ------------[ cut here ]------------
[ 54.113477][ C2] Modules linked in: dm_multipath
[ 54.122743][ T705] kernel BUG at lib/list_debug.c:35!
[ 54.128080][ C2] btrfs blake2b_generic
[ 54.132987][ T705] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[ 54.138148][ C2] xor zstd_compress
[ 54.142266][ T705] CPU: 3 PID: 705 Comm: multipathd Tainted: G S 6.10.0-rc1-00010-ged0ed14c2b47 #1
[ 54.149087][ C2] raid6_pq libcrc32c
[ 54.152852][ T705] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.8.1 12/05/2017
[ 54.163339][ C2] ipmi_devintf ipmi_msghandler
[ 54.167192][ T705] RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.175322][ C2] intel_rapl_msr intel_rapl_common
[ 54.180049][ T705] Code: 0b 48 89 f1 48 c7 c7 00 fa 26 84 48 89 de e8 d6 75 f2 fe 0f 0b 48 89 f2 48 89 d9 48 89 ee 48 c7 c7 80 fa 26 84 e8 bf 75 f2 fe <0f> 0b 48 89 f7 48 89 34 24 e8 11 cc 61 ff 48 8b 34 24 e9 71 ff ff
All code
========
0: 0b 48 89 or -0x77(%rax),%ecx
3: f1 icebp
4: 48 c7 c7 00 fa 26 84 mov $0xffffffff8426fa00,%rdi
b: 48 89 de mov %rbx,%rsi
e: e8 d6 75 f2 fe callq 0xfffffffffef275e9
13: 0f 0b ud2
15: 48 89 f2 mov %rsi,%rdx
18: 48 89 d9 mov %rbx,%rcx
1b: 48 89 ee mov %rbp,%rsi
1e: 48 c7 c7 80 fa 26 84 mov $0xffffffff8426fa80,%rdi
25: e8 bf 75 f2 fe callq 0xfffffffffef275e9
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 89 f7 mov %rsi,%rdi
2f: 48 89 34 24 mov %rsi,(%rsp)
33: e8 11 cc 61 ff callq 0xffffffffff61cc49
38: 48 8b 34 24 mov (%rsp),%rsi
3c: e9 .byte 0xe9
3d: 71 ff jno 0x3e
3f: ff .byte 0xff
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 89 f7 mov %rsi,%rdi
5: 48 89 34 24 mov %rsi,(%rsp)
9: e8 11 cc 61 ff callq 0xffffffffff61cc1f
e: 48 8b 34 24 mov (%rsp),%rsi
12: e9 .byte 0xe9
13: 71 ff jno 0x14
15: ff .byte 0xff
[ 54.186345][ C2] sd_mod t10_pi
[ 54.191424][ T705] RSP: 0018:ffffc90000327b38 EFLAGS: 00010046
[ 54.211022][ C2] x86_pkg_temp_thermal
[ 54.214447][ T705]
[ 54.220405][ C2] crc64_rocksoft_generic crc64_rocksoft
[ 54.224435][ T705] RAX: 0000000000000058 RBX: ffff8887892c4dd0 RCX: ffffffff82424f4e
[ 54.226632][ C2] intel_powerclamp crc64
[ 54.232145][ T705] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8887893b5380
[ 54.240012][ C2] coretemp sg
[ 54.244217][ T705] RBP: ffff888802a8abc0 R08: 0000000000000001 R09: fffff52000064f22
[ 54.252087][ C2] kvm_intel i915
[ 54.255330][ T705] R10: ffffc90000327917 R11: 205d324320202020 R12: ffff888802a8abc0
[ 54.263200][ C2] kvm crct10dif_pclmul
[ 54.266705][ T705] R13: ffff8887892c4dd0 R14: ffff888802a8ac00 R15: ffff8887892c4dd8
[ 54.274572][ C2] crc32_pclmul crc32c_intel
[ 54.278599][ T705] FS: 00007f1b015ee680(0000) GS:ffff888789380000(0000) knlGS:0000000000000000
[ 54.286469][ C2] drm_buddy ghash_clmulni_intel
[ 54.290934][ T705] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 54.299764][ C2] intel_gtt sha512_ssse3
[ 54.304580][ T705] CR2: 000055e6a99e25f8 CR3: 000000080473e006 CR4: 00000000003706f0
[ 54.311054][ C2] drm_display_helper
[ 54.315255][ T705] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 54.323124][ C2] rapl ttm
[ 54.326976][ T705] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 54.334845][ C2] drm_kms_helper
[ 54.337825][ T705] Call Trace:
[ 54.345696][ C2] ahci mei_wdt
[ 54.349201][ T705] <TASK>
[ 54.352357][ C2] intel_cstate wmi_bmof
[ 54.355687][ T705] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
[ 54.358493][ C2] intel_uncore
[ 54.362610][ T705] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155)
[ 54.366202][ C2] binfmt_misc video
[ 54.369533][ T705] ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.373650][ C2] libahci mei_me
[ 54.377418][ T705] ? do_error_trap (arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176)
[ 54.383104][ C2] i2c_i801 wmi
[ 54.386607][ T705] ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.391070][ C2] intel_pch_thermal i2c_smbus
[ 54.394400][ T705] ? handle_invalid_op (arch/x86/kernel/traps.c:214)
[ 54.400087][ C2] mei libata
[ 54.404727][ T705] ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.409540][ C2] acpi_pad fuse
[ 54.412697][ T705] ? exc_invalid_op (arch/x86/kernel/traps.c:267)
[ 54.418385][ C2] loop drm
[ 54.421803][ T705] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 54.426355][ C2] dm_mod ip_tables
[ 54.429337][ T705] ? llist_add_batch (lib/llist.c:33 (discriminator 14))
[ 54.434240][ C2]
[ 54.437928][ T705] ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.442661][ C2] CPU: 2 PID: 53 Comm: khugepaged Tainted: G S 6.10.0-rc1-00010-ged0ed14c2b47 #1
[ 54.444859][ T705] ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
[ 54.450557][ C2] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.8.1 12/05/2017
[ 54.460974][ T705] __enqueue_rt_entity (include/linux/list.h:150 (discriminator 1) include/linux/list.h:183 (discriminator 1) kernel/sched/rt.c:1419 (discriminator 1))
[ 54.466661][ C2] RIP: 0010:__enqueue_rt_entity (kernel/sched/rt.c:1415 (discriminator 1))
[ 54.474792][ T705] enqueue_rt_entity (kernel/sched/rt.c:1616)
[ 54.479778][ C2] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 1f 03 00 00 49 8b bf 40 0a 00 00 44 89 ea 48 81 c7 b8 00 00 00 e8 15 72 05 00 e9 23 fa ff ff <0f> 0b e9 9b f6 ff ff 48 89 ee 48 89 df e8 8e d1 ff ff e9 f6 f5 ff
All code
========
0: fa cli
1: 48 c1 ea 03 shr $0x3,%rdx
5: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
9: 0f 85 1f 03 00 00 jne 0x32e
f: 49 8b bf 40 0a 00 00 mov 0xa40(%r15),%rdi
16: 44 89 ea mov %r13d,%edx
19: 48 81 c7 b8 00 00 00 add $0xb8,%rdi
20: e8 15 72 05 00 callq 0x5723a
25: e9 23 fa ff ff jmpq 0xfffffffffffffa4d
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 9b f6 ff ff jmpq 0xfffffffffffff6cc
31: 48 89 ee mov %rbp,%rsi
34: 48 89 df mov %rbx,%rdi
37: e8 8e d1 ff ff callq 0xffffffffffffd1ca
3c: e9 .byte 0xe9
3d: f6 f5 div %ch
3f: ff .byte 0xff
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 9b f6 ff ff jmpq 0xfffffffffffff6a2
7: 48 89 ee mov %rbp,%rsi
a: 48 89 df mov %rbx,%rdi
d: e8 8e d1 ff ff callq 0xffffffffffffd1a0
12: e9 .byte 0xe9
13: f6 f5 div %ch
15: ff .byte 0xff
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240704/202407041644.de55c25-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Adding test cases for RT group scheduling, create some RT infinite loop
processes/threads, then set them to the same or different priorities.
Place them in different RT task groups, run for a period of time,
and finally count the number of infinite loop executions for all tasks.
Signed-off-by: Xavier <xavier_qy@163.com>
---
MAINTAINERS | 7 +
tools/testing/selftests/sched/Makefile | 4 +-
tools/testing/selftests/sched/deadloop.c | 192 ++++++++++++++++++
.../selftests/sched/rt_group_sched_test.sh | 119 +++++++++++
4 files changed, 320 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/sched/deadloop.c
create mode 100755 tools/testing/selftests/sched/rt_group_sched_test.sh
diff --git a/MAINTAINERS b/MAINTAINERS
index 43353b705988..d29effe57bf8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19480,6 +19480,13 @@ L: linux-remoteproc@vger.kernel.org
S: Maintained
F: drivers/tty/rpmsg_tty.c
+RT GROUP SCHED TEST
+M: Xavier <xavier_qy@163.com>
+L: linux-kernel@vger.kernel.org
+S: Maintained
+F: tools/testing/selftests/sched/deadloop.c
+F: tools/testing/selftests/sched/rt_group_sched_test.sh
+
RTL2830 MEDIA DRIVER
L: linux-media@vger.kernel.org
S: Orphan
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..96decb58bf35 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
$(CLANG_FLAGS)
LDLIBS += -lpthread
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := cs_prctl_test deadloop
+TEST_PROGS := cs_prctl_test deadloop
include ../lib.mk
diff --git a/tools/testing/selftests/sched/deadloop.c b/tools/testing/selftests/sched/deadloop.c
new file mode 100644
index 000000000000..d850a3e2a0ab
--- /dev/null
+++ b/tools/testing/selftests/sched/deadloop.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <sys/types.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <signal.h>
+
+/*
+ * Create multiple infinite loop threads based on the passed parameters
+ * Usage: deadloop num policy prio
+ * num: the number of child threads
+ * policy: the scheduling policy of the child threads, 0-fair, 1-fifo, 2-rr
+ * prio: the priority
+ * If this process is killed, it will print the loop count of all child threads
+ * to the OUTPUT_FILE
+ *
+ * Date: June 27, 2024
+ * Author: Xavier <xavier_qy@163.com>
+ */
+
+#define OUTPUT_FILE "rt_group_sched_test.log"
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+ return syscall(SYS_gettid);
+}
+#endif
+
+#define do_err(x) \
+do { \
+ if ((x) < 0) { \
+ printf("test BUG_ON func %s, line %d %ld\n", \
+ __func__, __LINE__, (long)(x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+#define do_false(x) \
+do { \
+ if ((x) == 1) { \
+ printf("test BUG_ON func %s, line %d %d\n", \
+ __func__, __LINE__, (x) \
+ ); \
+ while (1) \
+ sleep(1); \
+ } \
+} while (0)
+
+
+struct thread_data {
+ pthread_t thread;
+ int index;
+ int pid;
+ unsigned long cnt;
+};
+
+static struct thread_data *pdata;
+static int thread_num = 1;
+
+static void create_thread_posix(void *entry, pthread_t *thread, int *para,
+ int policy, int prio)
+{
+ int ret;
+ struct sched_param param;
+ pthread_attr_t attr;
+
+ memset(¶m, 0, sizeof(param));
+ ret = pthread_attr_init(&attr);
+ do_err(ret);
+
+ ret = pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
+ do_err(ret);
+
+ param.sched_priority = prio;
+
+ ret = pthread_attr_setschedpolicy(&attr, policy);
+ do_err(ret);
+
+ ret = pthread_attr_setschedparam(&attr, ¶m);
+ do_err(ret);
+
+ ret = pthread_create(thread, &attr, entry, para);
+ do_err(ret);
+}
+
+static void *dead_loop_entry(void *arg)
+{
+ int index = *(int *)arg;
+ struct sched_param param;
+ int cur = gettid();
+
+ sched_getparam(cur, ¶m);
+ pdata[index].pid = cur;
+ printf("cur:%d prio:%d\n", cur, param.sched_priority);
+
+ while (1) {
+ asm volatile("" ::: "memory");
+ pdata[index].cnt++;
+ }
+ return NULL;
+}
+
+static void handle_signal(int signal)
+{
+ int cnt = 0;
+
+ if (signal == SIGTERM) {
+ FILE *file = freopen(OUTPUT_FILE, "a", stdout);
+
+ if (file == NULL) {
+ perror("freopen");
+ exit(0);
+ }
+
+ while (cnt < thread_num) {
+ printf("pid:%d cnt:%ld\n", pdata[cnt].pid, pdata[cnt].cnt);
+ cnt++;
+ }
+ fclose(file);
+ exit(0);
+ }
+}
+
+static int dead_loop_create(int policy, int prio)
+{
+ int cnt = 0;
+ int ret;
+ void *status;
+ struct sched_param param;
+
+ param.sched_priority = prio;
+ pdata = malloc(thread_num * sizeof(struct thread_data));
+ do_false(!pdata);
+
+ if (policy) {
+ ret = sched_setscheduler(0, policy, ¶m);
+ do_err(ret);
+ }
+
+ while (cnt < thread_num) {
+ pdata[cnt].index = cnt;
+ create_thread_posix(dead_loop_entry, &pdata[cnt].thread,
+ &pdata[cnt].index, policy, prio);
+ cnt++;
+ }
+
+ signal(SIGTERM, handle_signal);
+
+ cnt = 0;
+ while (cnt < thread_num) {
+ pthread_join(pdata[cnt].thread, &status);
+ cnt++;
+ }
+
+ free(pdata);
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int policy = 2;
+ int prio = 50;
+
+ if (argc == 2)
+ thread_num = atoi(argv[1]);
+
+ if (argc == 3) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ if (policy > 0)
+ prio = 50;
+ }
+
+ if (argc == 4) {
+ thread_num = atoi(argv[1]);
+ policy = atoi(argv[2]);
+ prio = atoi(argv[3]);
+ }
+
+ dead_loop_create(policy, prio);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/sched/rt_group_sched_test.sh b/tools/testing/selftests/sched/rt_group_sched_test.sh
new file mode 100755
index 000000000000..9031250a2684
--- /dev/null
+++ b/tools/testing/selftests/sched/rt_group_sched_test.sh
@@ -0,0 +1,119 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test for rt group scheduling
+# Date: June 27, 2024
+# Author: Xavier <xavier_qy@163.com>
+
+# Record the list of child process PIDs
+PIDS=()
+
+# File for redirected output
+LOGFILE="rt_group_sched_test.log"
+
+# Cleanup function: kill all recorded child processes and unmount the cgroup
+function cleanup() {
+ echo "Cleaning up..."
+ for pid in "${PIDS[@]}"; do
+ if kill -0 $pid 2>/dev/null; then
+ kill -TERM $pid
+ fi
+ done
+
+ # Sleep for a while to ensure the processes are properly killed
+ sleep 2
+
+ # Unmount the cgroup filesystem
+ umount /sys/fs/cgroup/cpu 2>/dev/null
+ umount /sys/fs/cgroup 2>/dev/null
+ echo "Cleanup completed."
+
+ # Ensure the LOGFILE exists and is correct
+ if [ ! -f "$LOGFILE" ]; then
+ echo "$LOGFILE not found!"
+ exit 1
+ fi
+
+ # Initialize the total count variable
+ total=0
+
+ # Read matching lines and calculate the total sum
+ while IFS= read -r line
+ do
+ # Use grep to match lines containing 'pid:' and 'cnt:', and extract the value of cnt
+ if echo "$line" | grep -q '^pid:[[:digit:]]\+ cnt:[[:digit:]]\+'; then
+ cnt=$(echo "$line" | sed -n \
+ 's/^pid:[[:digit:]]\+ cnt:\([[:digit:]]\+\)/\1/p')
+ total=$((total + cnt))
+ fi
+ done < "$LOGFILE"
+
+ # Print the total sum
+ echo "Total cnt: $total"
+ echo "Finished processing."
+}
+
+# Capture actions when interrupted or terminated by a signal
+trap cleanup EXIT
+
+# Start the cgroup filesystem and create the necessary directories
+function setup_cgroups() {
+ mount -t tmpfs -o mode=755 cgroup_root /sys/fs/cgroup
+ mkdir -p /sys/fs/cgroup/cpu
+ mount -t cgroup -o cpu none /sys/fs/cgroup/cpu
+}
+
+# Create cgroup subdirectories and configure their settings
+function create_child_cgroup() {
+ local base_dir=$1
+ local name=$2
+ local rt_period=$3
+ local rt_runtime=$4
+ mkdir -p "$base_dir/$name"
+ echo $rt_period > "$base_dir/$name/cpu.rt_period_us"
+ echo $rt_runtime > "$base_dir/$name/cpu.rt_runtime_us"
+}
+# Launch a process and add it to the specified cgroup
+function launch_process() {
+ local process_name=$1
+
+ # Three parameters representing the number of child threads, scheduling policy, and priority
+ local args=$2
+ local cgroup_path=$3
+
+ # Launch the process
+ exec -a $process_name ./deadloop $args &
+ local pid=$!
+ PIDS+=($pid)
+
+ # Short sleep to ensure the process starts
+ sleep 1
+
+ # Check if the process started successfully
+ if ! pgrep -x $process_name > /dev/null; then
+ echo "Error: No process found with name $process_name."
+ exit 1
+ fi
+
+ echo $pid > "$cgroup_path/cgroup.procs"
+ echo "Process $process_name with PID $pid added to cgroup $cgroup_path"
+}
+
+# Main function running all tasks
+function main() {
+ echo "The test needs 30 seconds..."
+ rm -f "$LOGFILE"
+ setup_cgroups
+ create_child_cgroup "/sys/fs/cgroup/cpu" "child1" 1000000 800000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1" "child2" 1000000 700000
+ create_child_cgroup "/sys/fs/cgroup/cpu/child1/child2" "child3" 1000000 600000
+ launch_process "child1" "3 2 50" "/sys/fs/cgroup/cpu/child1"
+ launch_process "child2" "3 2 50" "/sys/fs/cgroup/cpu/child1/child2"
+ launch_process "child3" "1 2 50" "/sys/fs/cgroup/cpu/child1/child2/child3"
+ launch_process "tg_root" "1 2 50" "/sys/fs/cgroup/cpu"
+
+ # Run for 30 seconds
+ sleep 30
+}
+
+# Execute the main function
+main
--
2.45.2
© 2016 - 2025 Red Hat, Inc.