dl_server: Reset DL server params when rd changes

[PATCH] dl_server: Reset DL server params when rd changes

Posted by Joel Fernandes (Google) 3 weeks, 5 days ago

During boot initialization, DL server parameters are initialized using the
default root domain before the proper scheduler domains and root domains
are built. This results in DL server parameters being tied to the default
root domain's bandwidth accounting instead of the actual root domain
assigned to the CPU after scheduler topology initialization.

When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't
properly track CPUs being added since the DL server was started too early
with the default root domain. Specifically, dl_bw_cpus() is called before
set_cpu_active() during secondary CPU bringup, causing it to not account
for the CPU being brought up in its capacity calculations. This causes
subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth
accounting using the wrong root domain with zeroed total_bw.

This issue also causes under-utilization of system capacity. With the fix,
we see proper capacity initialization and scaling as CPUs come online - the
total system capacity increases from CPU 0 to CPU 1 and continues scaling
up as more CPUs are added (from cap=1024 initially to cap=8192 with 8
CPUs). Without the fix, the capacity initialization was incomplete since
dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(),
leading to CPUs not being properly accounted for in the capacity
calculations.

Fix this by tracking the last root domain used for the DL server and
resetting the server parameters when the root domain changes. This ensures
bandwidth accounting uses the correct, fully initialized root domain after
the scheduler topology is built.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Aashish Sharma <shraash@google.com>
Cc: Shin Kawamura <kawasin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vineeth Remanan Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched.h   |  1 +
 kernel/sched/deadline.c | 13 +++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d5cc3e50884..427d1da79d05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -695,6 +695,7 @@ struct sched_dl_entity {
 	struct rq			*rq;
 	dl_server_has_tasks_f		server_has_tasks;
 	dl_server_pick_f		server_pick_task;
+	struct root_domain		*last_rd;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..076b372f28b5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1627,21 +1627,26 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct root_domain *rd = cpu_rq(cpu)->rd;
 
 	/*
 	 * XXX: the apply do not work fine at the init phase for the
 	 * fair server because things are not yet set. We need to improve
 	 * this before getting generic.
 	 */
-	if (!dl_server(dl_se)) {
+	if (!dl_server(dl_se) || dl_se->last_rd != rd) {
 		u64 runtime =  50 * NSEC_PER_MSEC;
 		u64 period = 1000 * NSEC_PER_MSEC;
 
+		dl_se->last_rd = rd;
 		dl_server_apply_params(dl_se, runtime, period, 1);
 
-		dl_se->dl_server = 1;
-		dl_se->dl_defer = 1;
-		setup_new_dl_entity(dl_se);
+		if (!dl_server(dl_se)) {
+			dl_se->dl_server = 1;
+			dl_se->dl_defer = 1;
+			setup_new_dl_entity(dl_se);
+		}
 	}
 
 	if (!dl_se->dl_runtime)
-- 
2.47.0.163.g1226f6d8fa-goog

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by kernel test robot 3 weeks, 4 days ago

Hi Joel,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.12-rc5 next-20241030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/dl_server-Reset-DL-server-params-when-rd-changes/20241030-065241
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20241029225116.3998487-1-joel%40joelfernandes.org
patch subject: [PATCH] dl_server: Reset DL server params when rd changes
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20241031/202410310046.qSHsY4dH-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/202410310046.qSHsY4dH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410310046.qSHsY4dH-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:19:
   In file included from include/linux/sched/isolation.h:5:
   In file included from include/linux/cpuset.h:17:
   In file included from include/linux/mm.h:2213:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from kernel/sched/build_policy.c:60:
>> kernel/sched/deadline.c:1631:40: error: no member named 'rd' in 'struct rq'
    1631 |         struct root_domain *rd = cpu_rq(cpu)->rd;
         |                                  ~~~~~~~~~~~  ^
   1 warning and 1 error generated.


vim +1631 kernel/sched/deadline.c

  1626	
  1627	void dl_server_start(struct sched_dl_entity *dl_se)
  1628	{
  1629		struct rq *rq = dl_se->rq;
  1630		int cpu = cpu_of(rq);
> 1631		struct root_domain *rd = cpu_rq(cpu)->rd;
  1632	
  1633		/*
  1634		 * XXX: the apply do not work fine at the init phase for the
  1635		 * fair server because things are not yet set. We need to improve
  1636		 * this before getting generic.
  1637		 */
  1638		if (!dl_server(dl_se) || dl_se->last_rd != rd) {
  1639			u64 runtime =  50 * NSEC_PER_MSEC;
  1640			u64 period = 1000 * NSEC_PER_MSEC;
  1641	
  1642			dl_se->last_rd = rd;
  1643			dl_server_apply_params(dl_se, runtime, period, 1);
  1644	
  1645			if (!dl_server(dl_se)) {
  1646				dl_se->dl_server = 1;
  1647				dl_se->dl_defer = 1;
  1648				setup_new_dl_entity(dl_se);
  1649			}
  1650		}
  1651	
  1652		if (!dl_se->dl_runtime)
  1653			return;
  1654	
  1655		enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
  1656		if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
  1657			resched_curr(dl_se->rq);
  1658	}
  1659	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by kernel test robot 3 weeks, 4 days ago

Hi Joel,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.12-rc5 next-20241030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/dl_server-Reset-DL-server-params-when-rd-changes/20241030-065241
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20241029225116.3998487-1-joel%40joelfernandes.org
patch subject: [PATCH] dl_server: Reset DL server params when rd changes
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20241030/202410302318.EbfnSdKM-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241030/202410302318.EbfnSdKM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410302318.EbfnSdKM-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:60:
   kernel/sched/deadline.c: In function 'dl_server_start':
>> kernel/sched/deadline.c:1631:47: error: 'struct rq' has no member named 'rd'; did you mean 'rt'?
    1631 |         struct root_domain *rd = cpu_rq(cpu)->rd;
         |                                               ^~
         |                                               rt


vim +1631 kernel/sched/deadline.c

  1626	
  1627	void dl_server_start(struct sched_dl_entity *dl_se)
  1628	{
  1629		struct rq *rq = dl_se->rq;
  1630		int cpu = cpu_of(rq);
> 1631		struct root_domain *rd = cpu_rq(cpu)->rd;
  1632	
  1633		/*
  1634		 * XXX: the apply do not work fine at the init phase for the
  1635		 * fair server because things are not yet set. We need to improve
  1636		 * this before getting generic.
  1637		 */
  1638		if (!dl_server(dl_se) || dl_se->last_rd != rd) {
  1639			u64 runtime =  50 * NSEC_PER_MSEC;
  1640			u64 period = 1000 * NSEC_PER_MSEC;
  1641	
  1642			dl_se->last_rd = rd;
  1643			dl_server_apply_params(dl_se, runtime, period, 1);
  1644	
  1645			if (!dl_server(dl_se)) {
  1646				dl_se->dl_server = 1;
  1647				dl_se->dl_defer = 1;
  1648				setup_new_dl_entity(dl_se);
  1649			}
  1650		}
  1651	
  1652		if (!dl_se->dl_runtime)
  1653			return;
  1654	
  1655		enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
  1656		if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
  1657			resched_curr(dl_se->rq);
  1658	}
  1659	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 3 weeks, 4 days ago

Hi Joel,

On 29/10/24 22:51, Joel Fernandes (Google) wrote:
> During boot initialization, DL server parameters are initialized using the
> default root domain before the proper scheduler domains and root domains
> are built. This results in DL server parameters being tied to the default
> root domain's bandwidth accounting instead of the actual root domain
> assigned to the CPU after scheduler topology initialization.
> 
> When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't
> properly track CPUs being added since the DL server was started too early
> with the default root domain. Specifically, dl_bw_cpus() is called before
> set_cpu_active() during secondary CPU bringup, causing it to not account
> for the CPU being brought up in its capacity calculations. This causes
> subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth
> accounting using the wrong root domain with zeroed total_bw.
> 
> This issue also causes under-utilization of system capacity. With the fix,
> we see proper capacity initialization and scaling as CPUs come online - the
> total system capacity increases from CPU 0 to CPU 1 and continues scaling
> up as more CPUs are added (from cap=1024 initially to cap=8192 with 8
> CPUs). Without the fix, the capacity initialization was incomplete since
> dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(),
> leading to CPUs not being properly accounted for in the capacity
> calculations.
> 
> Fix this by tracking the last root domain used for the DL server and
> resetting the server parameters when the root domain changes. This ensures
> bandwidth accounting uses the correct, fully initialized root domain after
> the scheduler topology is built.

So, I'm trying to reproduce this issue, but currenlty not really seeing
it, sorry.

I'm on a 40 CPUs box and, even if I fiddle with hotplug, the numbers I
see from debug (bw, total_bw) seem sane and consistent with the fair
server settings.

Could you please provide additional info about how you reproduce the
issue? Maybe you have a test script around you could share?

Thanks!
Juri

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Joel Fernandes 3 weeks, 4 days ago

On Wed, Oct 30, 2024 at 03:30:12PM +0100, Juri Lelli wrote:
> Hi Joel,

Hi Juri!

> On 29/10/24 22:51, Joel Fernandes (Google) wrote:
> > During boot initialization, DL server parameters are initialized using the
> > default root domain before the proper scheduler domains and root domains
> > are built. This results in DL server parameters being tied to the default
> > root domain's bandwidth accounting instead of the actual root domain
> > assigned to the CPU after scheduler topology initialization.
> > 
> > When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't
> > properly track CPUs being added since the DL server was started too early
> > with the default root domain. Specifically, dl_bw_cpus() is called before
> > set_cpu_active() during secondary CPU bringup, causing it to not account
> > for the CPU being brought up in its capacity calculations. This causes
> > subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth
> > accounting using the wrong root domain with zeroed total_bw.
> > 
> > This issue also causes under-utilization of system capacity. With the fix,
> > we see proper capacity initialization and scaling as CPUs come online - the
> > total system capacity increases from CPU 0 to CPU 1 and continues scaling
> > up as more CPUs are added (from cap=1024 initially to cap=8192 with 8
> > CPUs). Without the fix, the capacity initialization was incomplete since
> > dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(),
> > leading to CPUs not being properly accounted for in the capacity
> > calculations.
> > 
> > Fix this by tracking the last root domain used for the DL server and
> > resetting the server parameters when the root domain changes. This ensures
> > bandwidth accounting uses the correct, fully initialized root domain after
> > the scheduler topology is built.
> 
> So, I'm trying to reproduce this issue, but currenlty not really seeing
> it, sorry.
> 
> I'm on a 40 CPUs box and, even if I fiddle with hotplug, the numbers I
> see from debug (bw, total_bw) seem sane and consistent with the fair
> server settings.
> 
> Could you please provide additional info about how you reproduce the
> issue? Maybe you have a test script around you could share?

With some prints [1] in the kernel, we can see on boot:

$ dmesg|grep appl
[    0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
[    0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
[    0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
[    0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
[    0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
[    0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
[    0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
[    0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1

For the 8th apply_params, the 8th CPU is not considered. This is because
set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit
message.

With the patch:

$ dmesg|grep appl
[    0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
[    0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
[    0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
[    0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
[    0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
[    0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
[    1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
[    1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1

   [ ... here somewhere rd changes as topology init finishes, then all the
   params are replied, this time with the correct rd. ]

[    1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
[    1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1

The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0
without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to
compiler or other differences), shouldn't we use the correct rd for
apply_params? The dl_bw is tied to the rd via  cpu_rq(cpu)->rd->dl_bw;

So if rd changes during boot initialization, the correct dl_bw has to be
updated AFAICS. Also if cpusets are used, the rd for a CPU may change.

Let me know if I missed something?

[1]
----------8<-----------
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..249f99e88b98 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -238,6 +238,12 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 static inline bool
 __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
 {
+	printk("dl_b->bw: %llu\n", dl_b->bw);
+	printk("cap: %lu, cap_scale(dl_b->bw, cap): %llu\n", cap, cap_scale(dl_b->bw, cap));
+	printk("dl_b->total_bw: %llu\n", dl_b->total_bw);
+	printk("old_bw: %llu\n", old_bw);
+	printk("new_bw: %llu\n", new_bw);
+
 	return dl_b->bw != -1 &&
 	       cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
 }
@@ -1704,6 +1710,10 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 	cpus = dl_bw_cpus(cpu);
 	cap = dl_bw_capacity(cpu);
 
+	printk("dl_server_apply_params: cpu=%d, runtime=%llu, period=%llu, cpus=%d, cap=%lu, init=%d\n",
+		cpu, runtime, period, cpus, cap, init);
+	printk("initial dl_b->total_bw=%llu, dl_b->bw=%llu\n", dl_b->total_bw, dl_b->bw);
+
 	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
 		return -EBUSY;
 
-- 
2.47.0.163.g1226f6d8fa-goog

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 2 weeks, 6 days ago

On 30/10/24 19:50, Joel Fernandes wrote:

...

> With some prints [1] in the kernel, we can see on boot:
> 
> $ dmesg|grep appl
> [    0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> [    0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> [    0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
> [    0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
> [    0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
> [    0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
> [    0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
> [    0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1
> 
> For the 8th apply_params, the 8th CPU is not considered. This is because
> set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit
> message.
> 
> With the patch:
> 
> $ dmesg|grep appl
> [    0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> [    0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> [    0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
> [    0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
> [    0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
> [    0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
> [    1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
> [    1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1
> 
>    [ ... here somewhere rd changes as topology init finishes, then all the
>    params are replied, this time with the correct rd. ]
> 
> [    1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> [    1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> 
> The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0
> without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to
> compiler or other differences), shouldn't we use the correct rd for
> apply_params? The dl_bw is tied to the rd via  cpu_rq(cpu)->rd->dl_bw;

I think I am still seeing something different.

[    0.184629] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=0, dl_b->bw=996147
[    0.185498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=0, old_bw=0, new_bw=52428
[    0.371531] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=52428, dl_b->bw=996147
[    0.372498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=52428, old_bw=0, new_bw=52428
[    0.373541] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=104856, dl_b->bw=996147
[    0.374498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=2048, cap_scale=1992294 dl_bw->total_bw=104856, old_bw=0, new_bw=52428
[    0.375507] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=157284, dl_b->bw=996147
[    0.376498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=3072, cap_scale=2988441 dl_bw->total_bw=157284, old_bw=0, new_bw=52428
[    0.377507] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=209712, dl_b->bw=996147
[    0.378498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=4096, cap_scale=3984588 dl_bw->total_bw=209712, old_bw=0, new_bw=52428
[    0.379505] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=262140, dl_b->bw=996147
[    0.380498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=5120, cap_scale=4980735 dl_bw->total_bw=262140, old_bw=0, new_bw=52428
[    0.381504] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=314568, dl_b->bw=996147
[    0.382498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=6144, cap_scale=5976882 dl_bw->total_bw=314568, old_bw=0, new_bw=52428
[    0.384527] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=366996, dl_b->bw=996147
[    0.385498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=7168, cap_scale=6973029 dl_bw->total_bw=366996, old_bw=0, new_bw=52428
     ...
[    0.388556] __dl_server_attach_root: cpu=0, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=52428, dl_b->bw=996147
[    0.389507] __dl_server_attach_root: cpu=1, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=104856, dl_b->bw=996147
[    0.390501] __dl_server_attach_root: cpu=2, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=157284, dl_b->bw=996147
[    0.391503] __dl_server_attach_root: cpu=3, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=209712, dl_b->bw=996147
[    0.392499] __dl_server_attach_root: cpu=4, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=262140, dl_b->bw=996147
[    0.393499] __dl_server_attach_root: cpu=5, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=314568, dl_b->bw=996147
[    0.394428] __dl_server_attach_root: cpu=6, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=366996, dl_b->bw=996147
[    0.394499] __dl_server_attach_root: cpu=7, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=419424, dl_b->bw=996147


I added a printk in __dl_server_attach_root which is called after the
dynamic rd is built to transfer bandwidth to it.

__dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
server interface"), do you have this change in your backport?

> So if rd changes during boot initialization, the correct dl_bw has to be
> updated AFAICS. Also if cpusets are used, the rd for a CPU may change.

cpusets changes are something that I still need to double check. Will
do.

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Joel Fernandes 2 weeks, 6 days ago

On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
> On 30/10/24 19:50, Joel Fernandes wrote:
> 
> ...
> 
> > With some prints [1] in the kernel, we can see on boot:
> > 
> > $ dmesg|grep appl
> > [    0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> > [    0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> > [    0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
> > [    0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
> > [    0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
> > [    0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
> > [    0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
> > [    0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1
> > 
> > For the 8th apply_params, the 8th CPU is not considered. This is because
> > set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit
> > message.
> > 
> > With the patch:
> > 
> > $ dmesg|grep appl
> > [    0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> > [    0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1
> > [    0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1
> > [    0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1
> > [    0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1
> > [    0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1
> > [    1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1
> > [    1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1
> > 
> >    [ ... here somewhere rd changes as topology init finishes, then all the
> >    params are replied, this time with the correct rd. ]
> > 
> > [    1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > [    1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1
> > 
> > The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0
> > without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to
> > compiler or other differences), shouldn't we use the correct rd for
> > apply_params? The dl_bw is tied to the rd via  cpu_rq(cpu)->rd->dl_bw;
> 
> I think I am still seeing something different.
> 
> [    0.184629] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=0, dl_b->bw=996147
> [    0.185498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=0, old_bw=0, new_bw=52428
> [    0.371531] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=52428, dl_b->bw=996147
> [    0.372498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=52428, old_bw=0, new_bw=52428
> [    0.373541] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=104856, dl_b->bw=996147
> [    0.374498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=2048, cap_scale=1992294 dl_bw->total_bw=104856, old_bw=0, new_bw=52428
> [    0.375507] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=157284, dl_b->bw=996147
> [    0.376498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=3072, cap_scale=2988441 dl_bw->total_bw=157284, old_bw=0, new_bw=52428
> [    0.377507] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=209712, dl_b->bw=996147
> [    0.378498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=4096, cap_scale=3984588 dl_bw->total_bw=209712, old_bw=0, new_bw=52428
> [    0.379505] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=262140, dl_b->bw=996147
> [    0.380498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=5120, cap_scale=4980735 dl_bw->total_bw=262140, old_bw=0, new_bw=52428
> [    0.381504] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=314568, dl_b->bw=996147
> [    0.382498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=6144, cap_scale=5976882 dl_bw->total_bw=314568, old_bw=0, new_bw=52428
> [    0.384527] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=366996, dl_b->bw=996147
> [    0.385498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=7168, cap_scale=6973029 dl_bw->total_bw=366996, old_bw=0, new_bw=52428
>      ...
> [    0.388556] __dl_server_attach_root: cpu=0, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=52428, dl_b->bw=996147
> [    0.389507] __dl_server_attach_root: cpu=1, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=104856, dl_b->bw=996147
> [    0.390501] __dl_server_attach_root: cpu=2, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=157284, dl_b->bw=996147
> [    0.391503] __dl_server_attach_root: cpu=3, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=209712, dl_b->bw=996147
> [    0.392499] __dl_server_attach_root: cpu=4, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=262140, dl_b->bw=996147
> [    0.393499] __dl_server_attach_root: cpu=5, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=314568, dl_b->bw=996147
> [    0.394428] __dl_server_attach_root: cpu=6, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=366996, dl_b->bw=996147
> [    0.394499] __dl_server_attach_root: cpu=7, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=419424, dl_b->bw=996147
> 
> 
> I added a printk in __dl_server_attach_root which is called after the
> dynamic rd is built to transfer bandwidth to it.
> 
> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
> server interface"), do you have this change in your backport?

You nailed it! Our 5.15 backport appears to be slightly older and is missing
this from topology.c as you mentioned. Thanks for clarifying!


        /*
         * Because the rq is not a task, dl_add_task_root_domain() did not
         * move the fair server bw to the rd if it already started.
         * Add it now.
         */
        if (rq->fair_server.dl_server)
                __dl_server_attach_root(&rq->fair_server, rq);

> 
> > So if rd changes during boot initialization, the correct dl_bw has to be
> > updated AFAICS. Also if cpusets are used, the rd for a CPU may change.
> 
> cpusets changes are something that I still need to double check. Will
> do.
> 

Sounds good, that would be good to verify.

Thanks,

 - Joel

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 2 weeks, 4 days ago

On 04/11/24 17:41, Joel Fernandes wrote:
> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:

...

> > I added a printk in __dl_server_attach_root which is called after the
> > dynamic rd is built to transfer bandwidth to it.
> > 
> > __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
> > server interface"), do you have this change in your backport?
> 
> You nailed it! Our 5.15 backport appears to be slightly older and is missing
> this from topology.c as you mentioned. Thanks for clarifying!
> 
> 
>         /*
>          * Because the rq is not a task, dl_add_task_root_domain() did not
>          * move the fair server bw to the rd if it already started.
>          * Add it now.
>          */
>         if (rq->fair_server.dl_server)
>                 __dl_server_attach_root(&rq->fair_server, rq);
> 
> > 
> > > So if rd changes during boot initialization, the correct dl_bw has to be
> > > updated AFAICS. Also if cpusets are used, the rd for a CPU may change.
> > 
> > cpusets changes are something that I still need to double check. Will
> > do.
> > 
> 
> Sounds good, that would be good to verify.

So, I played a little bit with it and came up with a simple set of ops
that point out an issue (default fedora server install):

echo Y >/sys/kernel/debug/sched/verbose

echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control

echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition

The domains are rebuilt correctly, but we end up with a null total_bw.

The conditional call above takes care correctly of adding back dl_server
per-rq bandwidth when we pass from the single domain to the 2 exclusive
ones, but I noticed that we go through partition_sched_domains_locked()
twice for a single write of 'root' and the second one, since it's not
actually destroying/rebuilding anything, is resetting total_bw w/o
addition dl_server contribution back.

Now, not completely sure why we need to go through partition_sched_
domains_locked() twice, as we have (it also looked like a pattern from
other call paths)

update_prstate()
-> update_cpumasks_hier()
   -> rebuild_sched_domains_locked() <- right at the end
-> update_partition_sd_lb()
   -> rebuild_sched_domains_locked() <- right after the above call

Removing the first call does indeed fix the issue and domains look OK,
but I'm pretty sure I'm missing all sort of details and corner cases.

Waiman (now Cc-ed), maybe you can help here understanding why the two
back to back calls are needed?

Thanks!
Juri

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Waiman Long 2 weeks, 4 days ago

On 11/6/24 11:08 AM, Juri Lelli wrote:
> On 04/11/24 17:41, Joel Fernandes wrote:
>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
> ...
>
>>> I added a printk in __dl_server_attach_root which is called after the
>>> dynamic rd is built to transfer bandwidth to it.
>>>
>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
>>> server interface"), do you have this change in your backport?
>> You nailed it! Our 5.15 backport appears to be slightly older and is missing
>> this from topology.c as you mentioned. Thanks for clarifying!
>>
>>
>>          /*
>>           * Because the rq is not a task, dl_add_task_root_domain() did not
>>           * move the fair server bw to the rd if it already started.
>>           * Add it now.
>>           */
>>          if (rq->fair_server.dl_server)
>>                  __dl_server_attach_root(&rq->fair_server, rq);
>>
>>>> So if rd changes during boot initialization, the correct dl_bw has to be
>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may change.
>>> cpusets changes are something that I still need to double check. Will
>>> do.
>>>
>> Sounds good, that would be good to verify.
> So, I played a little bit with it and came up with a simple set of ops
> that point out an issue (default fedora server install):
>
> echo Y >/sys/kernel/debug/sched/verbose
>
> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
>
> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
>
> The domains are rebuilt correctly, but we end up with a null total_bw.
>
> The conditional call above takes care correctly of adding back dl_server
> per-rq bandwidth when we pass from the single domain to the 2 exclusive
> ones, but I noticed that we go through partition_sched_domains_locked()
> twice for a single write of 'root' and the second one, since it's not
> actually destroying/rebuilding anything, is resetting total_bw w/o
> addition dl_server contribution back.
>
> Now, not completely sure why we need to go through partition_sched_
> domains_locked() twice, as we have (it also looked like a pattern from
> other call paths)
>
> update_prstate()
> -> update_cpumasks_hier()
>     -> rebuild_sched_domains_locked() <- right at the end
> -> update_partition_sd_lb()
>     -> rebuild_sched_domains_locked() <- right after the above call
>
> Removing the first call does indeed fix the issue and domains look OK,
> but I'm pretty sure I'm missing all sort of details and corner cases.
>
> Waiman (now Cc-ed), maybe you can help here understanding why the two
> back to back calls are needed?

Thanks for letting me know about this case.

I am aware that rebuild_sched_domains_locked() can be called more than 
once. I have addressed the hotplug case, but it can happen in some other 
corner cases as well. The problem with multiple 
rebuild_sched_domains_locked() calls is the fact that intermediate ones 
may be called where the internal states may not be consistent. I am 
going to work on a fix to this issue by making sure that 
rebuild_sched_domains_locked() will only be called once.

Cheers,
Longman

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Waiman Long 2 weeks, 2 days ago

On 11/6/24 1:05 PM, Waiman Long wrote:
> On 11/6/24 11:08 AM, Juri Lelli wrote:
>> On 04/11/24 17:41, Joel Fernandes wrote:
>>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
>> ...
>>
>>>> I added a printk in __dl_server_attach_root which is called after the
>>>> dynamic rd is built to transfer bandwidth to it.
>>>>
>>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
>>>> server interface"), do you have this change in your backport?
>>> You nailed it! Our 5.15 backport appears to be slightly older and is 
>>> missing
>>> this from topology.c as you mentioned. Thanks for clarifying!
>>>
>>>
>>>          /*
>>>           * Because the rq is not a task, dl_add_task_root_domain() 
>>> did not
>>>           * move the fair server bw to the rd if it already started.
>>>           * Add it now.
>>>           */
>>>          if (rq->fair_server.dl_server)
>>> __dl_server_attach_root(&rq->fair_server, rq);
>>>
>>>>> So if rd changes during boot initialization, the correct dl_bw has 
>>>>> to be
>>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may 
>>>>> change.
>>>> cpusets changes are something that I still need to double check. Will
>>>> do.
>>>>
>>> Sounds good, that would be good to verify.
>> So, I played a little bit with it and came up with a simple set of ops
>> that point out an issue (default fedora server install):
>>
>> echo Y >/sys/kernel/debug/sched/verbose
>>
>> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
>>
>> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
>> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
>> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
>>
>> The domains are rebuilt correctly, but we end up with a null total_bw.
>>
>> The conditional call above takes care correctly of adding back dl_server
>> per-rq bandwidth when we pass from the single domain to the 2 exclusive
>> ones, but I noticed that we go through partition_sched_domains_locked()
>> twice for a single write of 'root' and the second one, since it's not
>> actually destroying/rebuilding anything, is resetting total_bw w/o
>> addition dl_server contribution back.
>>
>> Now, not completely sure why we need to go through partition_sched_
>> domains_locked() twice, as we have (it also looked like a pattern from
>> other call paths)
>>
>> update_prstate()
>> -> update_cpumasks_hier()
>>     -> rebuild_sched_domains_locked() <- right at the end
>> -> update_partition_sd_lb()
>>     -> rebuild_sched_domains_locked() <- right after the above call
>>
>> Removing the first call does indeed fix the issue and domains look OK,
>> but I'm pretty sure I'm missing all sort of details and corner cases.
>>
>> Waiman (now Cc-ed), maybe you can help here understanding why the two
>> back to back calls are needed?
>
> Thanks for letting me know about this case.
>
> I am aware that rebuild_sched_domains_locked() can be called more than 
> once. I have addressed the hotplug case, but it can happen in some 
> other corner cases as well. The problem with multiple 
> rebuild_sched_domains_locked() calls is the fact that intermediate 
> ones may be called where the internal states may not be consistent. I 
> am going to work on a fix to this issue by making sure that 
> rebuild_sched_domains_locked() will only be called once.

I am working on a set of cpuset patches to eliminate redundant 
rebuild_sched_domains_locked() calls. However, my cpuset test script 
fails after the change due to the presence of test cases where the only 
CPU in a 1-cpu partition is being offlined. So I sent out a 
sched/deadline patch [1] to work around this particular corner case.

[1] 
https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u

Apparently, the null total_bw bug caused by multiple 
rebuild_sched_domains_locked() calls masks this problem.

Anyway, I should be able to post the cpuset patch series next week after 
further testing. Please review my sched/deadline patch to see if you are 
OK with this minor change.

Cheers,
Longman

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Waiman Long 2 weeks, 1 day ago

On 11/7/24 11:40 PM, Waiman Long wrote:
> On 11/6/24 1:05 PM, Waiman Long wrote:
>> On 11/6/24 11:08 AM, Juri Lelli wrote:
>>> On 04/11/24 17:41, Joel Fernandes wrote:
>>>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
>>> ...
>>>
>>>>> I added a printk in __dl_server_attach_root which is called after the
>>>>> dynamic rd is built to transfer bandwidth to it.
>>>>>
>>>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
>>>>> server interface"), do you have this change in your backport?
>>>> You nailed it! Our 5.15 backport appears to be slightly older and 
>>>> is missing
>>>> this from topology.c as you mentioned. Thanks for clarifying!
>>>>
>>>>
>>>>          /*
>>>>           * Because the rq is not a task, dl_add_task_root_domain() 
>>>> did not
>>>>           * move the fair server bw to the rd if it already started.
>>>>           * Add it now.
>>>>           */
>>>>          if (rq->fair_server.dl_server)
>>>> __dl_server_attach_root(&rq->fair_server, rq);
>>>>
>>>>>> So if rd changes during boot initialization, the correct dl_bw 
>>>>>> has to be
>>>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may 
>>>>>> change.
>>>>> cpusets changes are something that I still need to double check. Will
>>>>> do.
>>>>>
>>>> Sounds good, that would be good to verify.
>>> So, I played a little bit with it and came up with a simple set of ops
>>> that point out an issue (default fedora server install):
>>>
>>> echo Y >/sys/kernel/debug/sched/verbose
>>>
>>> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
>>>
>>> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
>>> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
>>> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
>>>
>>> The domains are rebuilt correctly, but we end up with a null total_bw.
>>>
>>> The conditional call above takes care correctly of adding back 
>>> dl_server
>>> per-rq bandwidth when we pass from the single domain to the 2 exclusive
>>> ones, but I noticed that we go through partition_sched_domains_locked()
>>> twice for a single write of 'root' and the second one, since it's not
>>> actually destroying/rebuilding anything, is resetting total_bw w/o
>>> addition dl_server contribution back.
>>>
>>> Now, not completely sure why we need to go through partition_sched_
>>> domains_locked() twice, as we have (it also looked like a pattern from
>>> other call paths)
>>>
>>> update_prstate()
>>> -> update_cpumasks_hier()
>>>     -> rebuild_sched_domains_locked() <- right at the end
>>> -> update_partition_sd_lb()
>>>     -> rebuild_sched_domains_locked() <- right after the above call
>>>
>>> Removing the first call does indeed fix the issue and domains look OK,
>>> but I'm pretty sure I'm missing all sort of details and corner cases.
>>>
>>> Waiman (now Cc-ed), maybe you can help here understanding why the two
>>> back to back calls are needed?
>>
>> Thanks for letting me know about this case.
>>
>> I am aware that rebuild_sched_domains_locked() can be called more 
>> than once. I have addressed the hotplug case, but it can happen in 
>> some other corner cases as well. The problem with multiple 
>> rebuild_sched_domains_locked() calls is the fact that intermediate 
>> ones may be called where the internal states may not be consistent. I 
>> am going to work on a fix to this issue by making sure that 
>> rebuild_sched_domains_locked() will only be called once.
>
> I am working on a set of cpuset patches to eliminate redundant 
> rebuild_sched_domains_locked() calls. However, my cpuset test script 
> fails after the change due to the presence of test cases where the 
> only CPU in a 1-cpu partition is being offlined. So I sent out a 
> sched/deadline patch [1] to work around this particular corner case.
>
> [1] 
> https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u
>
> Apparently, the null total_bw bug caused by multiple 
> rebuild_sched_domains_locked() calls masks this problem.
>
> Anyway, I should be able to post the cpuset patch series next week 
> after further testing. Please review my sched/deadline patch to see if 
> you are OK with this minor change.

I have the patchset to enforce that rebuild_sched_domains_locked() will 
only be called at most once per cpuset operation.

By adding some debug code to further study the null total_bw issue when 
cpuset.cpus.partition is being changed, I found that eliminating the 
redundant rebuild_sched_domains_locked() reduced the chance of hitting 
null total_bw, it did not eliminate it. By running my cpuset test 
script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With 
my new cpuset patch applied, it reduces it to 120 cases of null total_bw.

I will try to look further for the exact condition that triggers null 
total_bw generation.

Cheers,
Longman

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Waiman Long 2 weeks, 1 day ago

On 11/8/24 10:30 PM, Waiman Long wrote:
> I have the patchset to enforce that rebuild_sched_domains_locked() 
> will only be called at most once per cpuset operation.
>
> By adding some debug code to further study the null total_bw issue 
> when cpuset.cpus.partition is being changed, I found that eliminating 
> the redundant rebuild_sched_domains_locked() reduced the chance of 
> hitting null total_bw, it did not eliminate it. By running my cpuset 
> test script, I hit 250 cases of null total_bw with the v6.12-rc6 
> kernel. With my new cpuset patch applied, it reduces it to 120 cases 
> of null total_bw.
>
> I will try to look further for the exact condition that triggers null 
> total_bw generation.

After further testing, the 120 cases of null total_bw can be classified 
into the following 3 categories.

1) 51 cases when an isolated partition with isolated CPUs is created. 
Isolated CPU is not subjected to scheduling and so a total_bw of 0 is 
fine and not really a problem.

2) 67 cases when a nested partitions are being removed (A1 - A2). There 
is probably caused by some kind of race condtion. If I insert an 
artifical delay between the removal of A2 and A1, total_bw is fine. If 
there is no delay, I can see a null total_bw. That shouldn't really a 
problem in practice, though we may still need to figure out why.

2) Two cases where null total_bw is seen when a new partition is created 
by moving all the CPUs in the parent cgroup into its partition and the 
parent becomes a null partition with no CPU. The following example 
illustrates the steps.

#!/bin/bash
CGRP=/sys/fs/cgroup
cd $CGRP
echo +cpuset > cgroup.subtree_control
mkdir A1
cd A1
echo 0-1 > cpuset.cpus
echo root > cpuset.cpus.partition
echo "A1 partition"
echo +cpuset > cgroup.subtree_control
mkdir A2
cd A2
echo 0-1 > cpuset.cpus
echo root > cpuset.cpus.partition
echo "A2 partition"
cd ..
echo "Remove A2"
rmdir A2
cd ..
echo "Remove A1"
rmdir A1

In this corner case, there is actually no change in the set of sched 
domains. In this case, the sched domain set of CPUs 0-1 is being moved 
from partition A1 to A2 and vice versa in the removal of A2. This is 
similar to calling rebuild_sched_domains_locked() twice with the same 
input. I believe that is the condition that causes null total_bw.

Now the question is why the deadline code behaves this way. It is 
probably a bug that needs to be addressed.

Cheers,
Longman

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 1 week, 6 days ago

On 09/11/24 13:18, Waiman Long wrote:
> On 11/8/24 10:30 PM, Waiman Long wrote:
> > I have the patchset to enforce that rebuild_sched_domains_locked() will
> > only be called at most once per cpuset operation.
> > 
> > By adding some debug code to further study the null total_bw issue when
> > cpuset.cpus.partition is being changed, I found that eliminating the
> > redundant rebuild_sched_domains_locked() reduced the chance of hitting
> > null total_bw, it did not eliminate it. By running my cpuset test
> > script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With
> > my new cpuset patch applied, it reduces it to 120 cases of null
> > total_bw.
> > 
> > I will try to look further for the exact condition that triggers null
> > total_bw generation.
> 
> After further testing, the 120 cases of null total_bw can be classified into
> the following 3 categories.
> 
> 1) 51 cases when an isolated partition with isolated CPUs is created.
> Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine
> and not really a problem.
> 
> 2) 67 cases when a nested partitions are being removed (A1 - A2). There is
> probably caused by some kind of race condtion. If I insert an artifical
> delay between the removal of A2 and A1, total_bw is fine. If there is no
> delay, I can see a null total_bw. That shouldn't really a problem in
> practice, though we may still need to figure out why.
> 
> 2) Two cases where null total_bw is seen when a new partition is created by
> moving all the CPUs in the parent cgroup into its partition and the parent
> becomes a null partition with no CPU. The following example illustrates the
> steps.
> 
> #!/bin/bash
> CGRP=/sys/fs/cgroup
> cd $CGRP
> echo +cpuset > cgroup.subtree_control
> mkdir A1
> cd A1
> echo 0-1 > cpuset.cpus
> echo root > cpuset.cpus.partition
> echo "A1 partition"
> echo +cpuset > cgroup.subtree_control
> mkdir A2
> cd A2
> echo 0-1 > cpuset.cpus
> echo root > cpuset.cpus.partition
> echo "A2 partition"
> cd ..
> echo "Remove A2"
> rmdir A2
> cd ..
> echo "Remove A1"
> rmdir A1
> 
> In this corner case, there is actually no change in the set of sched
> domains. In this case, the sched domain set of CPUs 0-1 is being moved from
> partition A1 to A2 and vice versa in the removal of A2. This is similar to
> calling rebuild_sched_domains_locked() twice with the same input. I believe
> that is the condition that causes null total_bw.
> 
> Now the question is why the deadline code behaves this way. It is probably a
> bug that needs to be addressed.

Thanks for the analysis (and the patches). Will take a look, but I
suspect it might be because in case domains are not destroyed, we clear
them up (total_bw set to 0), but we don't add fair server contribution
back because rqs are not attached to domains (as there have been alredy
attached when such domains were created).

Best,
Juri

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 1 week, 6 days ago

On 11/11/24 09:37, Juri Lelli wrote:
> On 09/11/24 13:18, Waiman Long wrote:
> > On 11/8/24 10:30 PM, Waiman Long wrote:
> > > I have the patchset to enforce that rebuild_sched_domains_locked() will
> > > only be called at most once per cpuset operation.
> > > 
> > > By adding some debug code to further study the null total_bw issue when
> > > cpuset.cpus.partition is being changed, I found that eliminating the
> > > redundant rebuild_sched_domains_locked() reduced the chance of hitting
> > > null total_bw, it did not eliminate it. By running my cpuset test
> > > script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With
> > > my new cpuset patch applied, it reduces it to 120 cases of null
> > > total_bw.
> > > 
> > > I will try to look further for the exact condition that triggers null
> > > total_bw generation.
> > 
> > After further testing, the 120 cases of null total_bw can be classified into
> > the following 3 categories.
> > 
> > 1) 51 cases when an isolated partition with isolated CPUs is created.
> > Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine
> > and not really a problem.
> > 
> > 2) 67 cases when a nested partitions are being removed (A1 - A2). There is
> > probably caused by some kind of race condtion. If I insert an artifical
> > delay between the removal of A2 and A1, total_bw is fine. If there is no
> > delay, I can see a null total_bw. That shouldn't really a problem in
> > practice, though we may still need to figure out why.
> > 
> > 2) Two cases where null total_bw is seen when a new partition is created by
> > moving all the CPUs in the parent cgroup into its partition and the parent
> > becomes a null partition with no CPU. The following example illustrates the
> > steps.
> > 
> > #!/bin/bash
> > CGRP=/sys/fs/cgroup
> > cd $CGRP
> > echo +cpuset > cgroup.subtree_control
> > mkdir A1
> > cd A1
> > echo 0-1 > cpuset.cpus
> > echo root > cpuset.cpus.partition
> > echo "A1 partition"
> > echo +cpuset > cgroup.subtree_control
> > mkdir A2
> > cd A2
> > echo 0-1 > cpuset.cpus
> > echo root > cpuset.cpus.partition
> > echo "A2 partition"
> > cd ..
> > echo "Remove A2"
> > rmdir A2
> > cd ..
> > echo "Remove A1"
> > rmdir A1
> > 
> > In this corner case, there is actually no change in the set of sched
> > domains. In this case, the sched domain set of CPUs 0-1 is being moved from
> > partition A1 to A2 and vice versa in the removal of A2. This is similar to
> > calling rebuild_sched_domains_locked() twice with the same input. I believe
> > that is the condition that causes null total_bw.
> > 
> > Now the question is why the deadline code behaves this way. It is probably a
> > bug that needs to be addressed.
> 
> Thanks for the analysis (and the patches). Will take a look, but I
> suspect it might be because in case domains are not destroyed, we clear
> them up (total_bw set to 0), but we don't add fair server contribution
> back because rqs are not attached to domains (as there have been alredy
> attached when such domains were created).

OK, I'm thinking maybe something like the below might help. It seems to
do well with your test above, does it maybe fix the other cases you
mentioned as well?

---
 include/linux/sched/deadline.h |  1 +
 kernel/sched/deadline.c        | 12 ++++++++++++
 kernel/sched/topology.c        |  9 ++++++---
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 3a912ab42bb5..a74fb0da0fa3 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -34,6 +34,7 @@ static inline bool dl_time_before(u64 a, u64 b)
 struct root_domain;
 extern void dl_add_task_root_domain(struct task_struct *p);
 extern void dl_clear_root_domain(struct root_domain *rd);
+extern void dl_restore_server_root_domain(struct root_domain *rd);
 
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7884e566557c..acbd3039215f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2981,6 +2981,18 @@ void dl_add_task_root_domain(struct task_struct *p)
 	task_rq_unlock(rq, p, &rf);
 }
 
+void dl_restore_server_root_domain(struct root_domain *rd) {
+	int i;
+
+	guard(raw_spinlock_irqsave)(&rd->dl_bw.lock);
+	for_each_cpu(i, rd->span) {
+		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
+
+		if (dl_server(dl_se))
+			rd->dl_bw.total_bw += dl_se->dl_bw;
+	}
+}
+
 void dl_clear_root_domain(struct root_domain *rd)
 {
 	unsigned long flags;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..a0fcae07585e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2721,12 +2721,15 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 
 				/*
 				 * This domain won't be destroyed and as such
-				 * its dl_bw->total_bw needs to be cleared.  It
-				 * will be recomputed in function
-				 * update_tasks_root_domain().
+				 * its dl_bw->total_bw needs to be cleared.
+				 * Tasks contribution will be then recomputed
+				 * in function dl_update_tasks_root_domain(),
+				 * dl_servers contribution in function
+				 * dl_restore_server_root_domain().
 				 */
 				rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
 				dl_clear_root_domain(rd);
+				dl_restore_server_root_domain(rd);
 				goto match1;
 			}
 		}

Re: [PATCH] dl_server: Reset DL server params when rd changes

Posted by Juri Lelli 2 weeks, 2 days ago

On 07/11/24 23:40, Waiman Long wrote:
> On 11/6/24 1:05 PM, Waiman Long wrote:
> > On 11/6/24 11:08 AM, Juri Lelli wrote:
> > > On 04/11/24 17:41, Joel Fernandes wrote:
> > > > On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
> > > ...
> > > 
> > > > > I added a printk in __dl_server_attach_root which is called after the
> > > > > dynamic rd is built to transfer bandwidth to it.
> > > > > 
> > > > > __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
> > > > > server interface"), do you have this change in your backport?
> > > > You nailed it! Our 5.15 backport appears to be slightly older
> > > > and is missing
> > > > this from topology.c as you mentioned. Thanks for clarifying!
> > > > 
> > > > 
> > > >          /*
> > > >           * Because the rq is not a task,
> > > > dl_add_task_root_domain() did not
> > > >           * move the fair server bw to the rd if it already started.
> > > >           * Add it now.
> > > >           */
> > > >          if (rq->fair_server.dl_server)
> > > > __dl_server_attach_root(&rq->fair_server, rq);
> > > > 
> > > > > > So if rd changes during boot initialization, the correct
> > > > > > dl_bw has to be
> > > > > > updated AFAICS. Also if cpusets are used, the rd for a
> > > > > > CPU may change.
> > > > > cpusets changes are something that I still need to double check. Will
> > > > > do.
> > > > > 
> > > > Sounds good, that would be good to verify.
> > > So, I played a little bit with it and came up with a simple set of ops
> > > that point out an issue (default fedora server install):
> > > 
> > > echo Y >/sys/kernel/debug/sched/verbose
> > > 
> > > echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
> > > 
> > > echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
> > > echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
> > > echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
> > > 
> > > The domains are rebuilt correctly, but we end up with a null total_bw.
> > > 
> > > The conditional call above takes care correctly of adding back dl_server
> > > per-rq bandwidth when we pass from the single domain to the 2 exclusive
> > > ones, but I noticed that we go through partition_sched_domains_locked()
> > > twice for a single write of 'root' and the second one, since it's not
> > > actually destroying/rebuilding anything, is resetting total_bw w/o
> > > addition dl_server contribution back.
> > > 
> > > Now, not completely sure why we need to go through partition_sched_
> > > domains_locked() twice, as we have (it also looked like a pattern from
> > > other call paths)
> > > 
> > > update_prstate()
> > > -> update_cpumasks_hier()
> > >     -> rebuild_sched_domains_locked() <- right at the end
> > > -> update_partition_sd_lb()
> > >     -> rebuild_sched_domains_locked() <- right after the above call
> > > 
> > > Removing the first call does indeed fix the issue and domains look OK,
> > > but I'm pretty sure I'm missing all sort of details and corner cases.
> > > 
> > > Waiman (now Cc-ed), maybe you can help here understanding why the two
> > > back to back calls are needed?
> > 
> > Thanks for letting me know about this case.
> > 
> > I am aware that rebuild_sched_domains_locked() can be called more than
> > once. I have addressed the hotplug case, but it can happen in some other
> > corner cases as well. The problem with multiple
> > rebuild_sched_domains_locked() calls is the fact that intermediate ones
> > may be called where the internal states may not be consistent. I am
> > going to work on a fix to this issue by making sure that
> > rebuild_sched_domains_locked() will only be called once.
> 
> I am working on a set of cpuset patches to eliminate redundant
> rebuild_sched_domains_locked() calls. However, my cpuset test script fails
> after the change due to the presence of test cases where the only CPU in a
> 1-cpu partition is being offlined. So I sent out a sched/deadline patch [1]
> to work around this particular corner case.
> 
> [1]
> https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u
> 
> Apparently, the null total_bw bug caused by multiple
> rebuild_sched_domains_locked() calls masks this problem.
> 
> Anyway, I should be able to post the cpuset patch series next week after
> further testing. Please review my sched/deadline patch to see if you are OK
> with this minor change.

Thank you! Will take a look.

Best,
Juri