include/linux/sched.h | 1 + kernel/sched/deadline.c | 13 +++++++++---- 2 files changed, 10 insertions(+), 4 deletions(-)
During boot initialization, DL server parameters are initialized using the
default root domain before the proper scheduler domains and root domains
are built. This results in DL server parameters being tied to the default
root domain's bandwidth accounting instead of the actual root domain
assigned to the CPU after scheduler topology initialization.
When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't
properly track CPUs being added since the DL server was started too early
with the default root domain. Specifically, dl_bw_cpus() is called before
set_cpu_active() during secondary CPU bringup, causing it to not account
for the CPU being brought up in its capacity calculations. This causes
subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth
accounting using the wrong root domain with zeroed total_bw.
This issue also causes under-utilization of system capacity. With the fix,
we see proper capacity initialization and scaling as CPUs come online - the
total system capacity increases from CPU 0 to CPU 1 and continues scaling
up as more CPUs are added (from cap=1024 initially to cap=8192 with 8
CPUs). Without the fix, the capacity initialization was incomplete since
dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(),
leading to CPUs not being properly accounted for in the capacity
calculations.
Fix this by tracking the last root domain used for the DL server and
resetting the server parameters when the root domain changes. This ensures
bandwidth accounting uses the correct, fully initialized root domain after
the scheduler topology is built.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Aashish Sharma <shraash@google.com>
Cc: Shin Kawamura <kawasin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vineeth Remanan Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
include/linux/sched.h | 1 +
kernel/sched/deadline.c | 13 +++++++++----
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d5cc3e50884..427d1da79d05 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -695,6 +695,7 @@ struct sched_dl_entity {
struct rq *rq;
dl_server_has_tasks_f server_has_tasks;
dl_server_pick_f server_pick_task;
+ struct root_domain *last_rd;
#ifdef CONFIG_RT_MUTEXES
/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..076b372f28b5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1627,21 +1627,26 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
void dl_server_start(struct sched_dl_entity *dl_se)
{
struct rq *rq = dl_se->rq;
+ int cpu = cpu_of(rq);
+ struct root_domain *rd = cpu_rq(cpu)->rd;
/*
* XXX: the apply do not work fine at the init phase for the
* fair server because things are not yet set. We need to improve
* this before getting generic.
*/
- if (!dl_server(dl_se)) {
+ if (!dl_server(dl_se) || dl_se->last_rd != rd) {
u64 runtime = 50 * NSEC_PER_MSEC;
u64 period = 1000 * NSEC_PER_MSEC;
+ dl_se->last_rd = rd;
dl_server_apply_params(dl_se, runtime, period, 1);
- dl_se->dl_server = 1;
- dl_se->dl_defer = 1;
- setup_new_dl_entity(dl_se);
+ if (!dl_server(dl_se)) {
+ dl_se->dl_server = 1;
+ dl_se->dl_defer = 1;
+ setup_new_dl_entity(dl_se);
+ }
}
if (!dl_se->dl_runtime)
--
2.47.0.163.g1226f6d8fa-goog
Hi Joel, kernel test robot noticed the following build errors: [auto build test ERROR on tip/sched/core] [also build test ERROR on linus/master v6.12-rc5 next-20241030] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/dl_server-Reset-DL-server-params-when-rd-changes/20241030-065241 base: tip/sched/core patch link: https://lore.kernel.org/r/20241029225116.3998487-1-joel%40joelfernandes.org patch subject: [PATCH] dl_server: Reset DL server params when rd changes config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20241031/202410310046.qSHsY4dH-lkp@intel.com/config) compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/202410310046.qSHsY4dH-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202410310046.qSHsY4dH-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from kernel/sched/build_policy.c:19: In file included from include/linux/sched/isolation.h:5: In file included from include/linux/cpuset.h:17: In file included from include/linux/mm.h:2213: include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion] 518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" | ~~~~~~~~~~~ ^ ~~~ In file included from kernel/sched/build_policy.c:60: >> kernel/sched/deadline.c:1631:40: error: no member named 'rd' in 'struct rq' 1631 | struct root_domain *rd = cpu_rq(cpu)->rd; | ~~~~~~~~~~~ ^ 1 warning and 1 error generated. vim +1631 kernel/sched/deadline.c 1626 1627 void dl_server_start(struct sched_dl_entity *dl_se) 1628 { 1629 struct rq *rq = dl_se->rq; 1630 int cpu = cpu_of(rq); > 1631 struct root_domain *rd = cpu_rq(cpu)->rd; 1632 1633 /* 1634 * XXX: the apply do not work fine at the init phase for the 1635 * fair server because things are not yet set. We need to improve 1636 * this before getting generic. 1637 */ 1638 if (!dl_server(dl_se) || dl_se->last_rd != rd) { 1639 u64 runtime = 50 * NSEC_PER_MSEC; 1640 u64 period = 1000 * NSEC_PER_MSEC; 1641 1642 dl_se->last_rd = rd; 1643 dl_server_apply_params(dl_se, runtime, period, 1); 1644 1645 if (!dl_server(dl_se)) { 1646 dl_se->dl_server = 1; 1647 dl_se->dl_defer = 1; 1648 setup_new_dl_entity(dl_se); 1649 } 1650 } 1651 1652 if (!dl_se->dl_runtime) 1653 return; 1654 1655 enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); 1656 if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl)) 1657 resched_curr(dl_se->rq); 1658 } 1659 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Hi Joel, kernel test robot noticed the following build errors: [auto build test ERROR on tip/sched/core] [also build test ERROR on linus/master v6.12-rc5 next-20241030] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/dl_server-Reset-DL-server-params-when-rd-changes/20241030-065241 base: tip/sched/core patch link: https://lore.kernel.org/r/20241029225116.3998487-1-joel%40joelfernandes.org patch subject: [PATCH] dl_server: Reset DL server params when rd changes config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20241030/202410302318.EbfnSdKM-lkp@intel.com/config) compiler: or1k-linux-gcc (GCC) 14.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241030/202410302318.EbfnSdKM-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202410302318.EbfnSdKM-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from kernel/sched/build_policy.c:60: kernel/sched/deadline.c: In function 'dl_server_start': >> kernel/sched/deadline.c:1631:47: error: 'struct rq' has no member named 'rd'; did you mean 'rt'? 1631 | struct root_domain *rd = cpu_rq(cpu)->rd; | ^~ | rt vim +1631 kernel/sched/deadline.c 1626 1627 void dl_server_start(struct sched_dl_entity *dl_se) 1628 { 1629 struct rq *rq = dl_se->rq; 1630 int cpu = cpu_of(rq); > 1631 struct root_domain *rd = cpu_rq(cpu)->rd; 1632 1633 /* 1634 * XXX: the apply do not work fine at the init phase for the 1635 * fair server because things are not yet set. We need to improve 1636 * this before getting generic. 1637 */ 1638 if (!dl_server(dl_se) || dl_se->last_rd != rd) { 1639 u64 runtime = 50 * NSEC_PER_MSEC; 1640 u64 period = 1000 * NSEC_PER_MSEC; 1641 1642 dl_se->last_rd = rd; 1643 dl_server_apply_params(dl_se, runtime, period, 1); 1644 1645 if (!dl_server(dl_se)) { 1646 dl_se->dl_server = 1; 1647 dl_se->dl_defer = 1; 1648 setup_new_dl_entity(dl_se); 1649 } 1650 } 1651 1652 if (!dl_se->dl_runtime) 1653 return; 1654 1655 enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP); 1656 if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl)) 1657 resched_curr(dl_se->rq); 1658 } 1659 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Hi Joel, On 29/10/24 22:51, Joel Fernandes (Google) wrote: > During boot initialization, DL server parameters are initialized using the > default root domain before the proper scheduler domains and root domains > are built. This results in DL server parameters being tied to the default > root domain's bandwidth accounting instead of the actual root domain > assigned to the CPU after scheduler topology initialization. > > When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't > properly track CPUs being added since the DL server was started too early > with the default root domain. Specifically, dl_bw_cpus() is called before > set_cpu_active() during secondary CPU bringup, causing it to not account > for the CPU being brought up in its capacity calculations. This causes > subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth > accounting using the wrong root domain with zeroed total_bw. > > This issue also causes under-utilization of system capacity. With the fix, > we see proper capacity initialization and scaling as CPUs come online - the > total system capacity increases from CPU 0 to CPU 1 and continues scaling > up as more CPUs are added (from cap=1024 initially to cap=8192 with 8 > CPUs). Without the fix, the capacity initialization was incomplete since > dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(), > leading to CPUs not being properly accounted for in the capacity > calculations. > > Fix this by tracking the last root domain used for the DL server and > resetting the server parameters when the root domain changes. This ensures > bandwidth accounting uses the correct, fully initialized root domain after > the scheduler topology is built. So, I'm trying to reproduce this issue, but currenlty not really seeing it, sorry. I'm on a 40 CPUs box and, even if I fiddle with hotplug, the numbers I see from debug (bw, total_bw) seem sane and consistent with the fair server settings. Could you please provide additional info about how you reproduce the issue? Maybe you have a test script around you could share? Thanks! Juri
On Wed, Oct 30, 2024 at 03:30:12PM +0100, Juri Lelli wrote: > Hi Joel, Hi Juri! > On 29/10/24 22:51, Joel Fernandes (Google) wrote: > > During boot initialization, DL server parameters are initialized using the > > default root domain before the proper scheduler domains and root domains > > are built. This results in DL server parameters being tied to the default > > root domain's bandwidth accounting instead of the actual root domain > > assigned to the CPU after scheduler topology initialization. > > > > When secondary CPUs are brought up, the dl_bw_cpus() accounting doesn't > > properly track CPUs being added since the DL server was started too early > > with the default root domain. Specifically, dl_bw_cpus() is called before > > set_cpu_active() during secondary CPU bringup, causing it to not account > > for the CPU being brought up in its capacity calculations. This causes > > subsequent sysfs parameter updates to fail with -EBUSY due to bandwidth > > accounting using the wrong root domain with zeroed total_bw. > > > > This issue also causes under-utilization of system capacity. With the fix, > > we see proper capacity initialization and scaling as CPUs come online - the > > total system capacity increases from CPU 0 to CPU 1 and continues scaling > > up as more CPUs are added (from cap=1024 initially to cap=8192 with 8 > > CPUs). Without the fix, the capacity initialization was incomplete since > > dl_bw_cpus() runs before the CPU is marked active in set_cpu_active(), > > leading to CPUs not being properly accounted for in the capacity > > calculations. > > > > Fix this by tracking the last root domain used for the DL server and > > resetting the server parameters when the root domain changes. This ensures > > bandwidth accounting uses the correct, fully initialized root domain after > > the scheduler topology is built. > > So, I'm trying to reproduce this issue, but currenlty not really seeing > it, sorry. > > I'm on a 40 CPUs box and, even if I fiddle with hotplug, the numbers I > see from debug (bw, total_bw) seem sane and consistent with the fair > server settings. > > Could you please provide additional info about how you reproduce the > issue? Maybe you have a test script around you could share? With some prints [1] in the kernel, we can see on boot: $ dmesg|grep appl [ 0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 [ 0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 [ 0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 [ 0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 [ 0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 [ 0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 [ 0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 [ 0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 For the 8th apply_params, the 8th CPU is not considered. This is because set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit message. With the patch: $ dmesg|grep appl [ 0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 [ 0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 [ 0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 [ 0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 [ 0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 [ 0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 [ 1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 [ 1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 [ ... here somewhere rd changes as topology init finishes, then all the params are replied, this time with the correct rd. ] [ 1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 [ 1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0 without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to compiler or other differences), shouldn't we use the correct rd for apply_params? The dl_bw is tied to the rd via cpu_rq(cpu)->rd->dl_bw; So if rd changes during boot initialization, the correct dl_bw has to be updated AFAICS. Also if cpusets are used, the rd for a CPU may change. Let me know if I missed something? [1] ----------8<----------- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d9d5a702f1a6..249f99e88b98 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -238,6 +238,12 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus) static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw) { + printk("dl_b->bw: %llu\n", dl_b->bw); + printk("cap: %lu, cap_scale(dl_b->bw, cap): %llu\n", cap, cap_scale(dl_b->bw, cap)); + printk("dl_b->total_bw: %llu\n", dl_b->total_bw); + printk("old_bw: %llu\n", old_bw); + printk("new_bw: %llu\n", new_bw); + return dl_b->bw != -1 && cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw; } @@ -1704,6 +1710,10 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio cpus = dl_bw_cpus(cpu); cap = dl_bw_capacity(cpu); + printk("dl_server_apply_params: cpu=%d, runtime=%llu, period=%llu, cpus=%d, cap=%lu, init=%d\n", + cpu, runtime, period, cpus, cap, init); + printk("initial dl_b->total_bw=%llu, dl_b->bw=%llu\n", dl_b->total_bw, dl_b->bw); + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) return -EBUSY; -- 2.47.0.163.g1226f6d8fa-goog
On 30/10/24 19:50, Joel Fernandes wrote: ... > With some prints [1] in the kernel, we can see on boot: > > $ dmesg|grep appl > [ 0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > [ 0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > [ 0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 > [ 0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 > [ 0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 > [ 0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 > [ 0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 > [ 0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 > > For the 8th apply_params, the 8th CPU is not considered. This is because > set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit > message. > > With the patch: > > $ dmesg|grep appl > [ 0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > [ 0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > [ 0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 > [ 0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 > [ 0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 > [ 0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 > [ 1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 > [ 1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 > > [ ... here somewhere rd changes as topology init finishes, then all the > params are replied, this time with the correct rd. ] > > [ 1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > [ 1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0 > without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to > compiler or other differences), shouldn't we use the correct rd for > apply_params? The dl_bw is tied to the rd via cpu_rq(cpu)->rd->dl_bw; I think I am still seeing something different. [ 0.184629] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=0, dl_b->bw=996147 [ 0.185498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=0, old_bw=0, new_bw=52428 [ 0.371531] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=52428, dl_b->bw=996147 [ 0.372498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=52428, old_bw=0, new_bw=52428 [ 0.373541] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=104856, dl_b->bw=996147 [ 0.374498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=2048, cap_scale=1992294 dl_bw->total_bw=104856, old_bw=0, new_bw=52428 [ 0.375507] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=157284, dl_b->bw=996147 [ 0.376498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=3072, cap_scale=2988441 dl_bw->total_bw=157284, old_bw=0, new_bw=52428 [ 0.377507] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=209712, dl_b->bw=996147 [ 0.378498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=4096, cap_scale=3984588 dl_bw->total_bw=209712, old_bw=0, new_bw=52428 [ 0.379505] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=262140, dl_b->bw=996147 [ 0.380498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=5120, cap_scale=4980735 dl_bw->total_bw=262140, old_bw=0, new_bw=52428 [ 0.381504] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=314568, dl_b->bw=996147 [ 0.382498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=6144, cap_scale=5976882 dl_bw->total_bw=314568, old_bw=0, new_bw=52428 [ 0.384527] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=366996, dl_b->bw=996147 [ 0.385498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=7168, cap_scale=6973029 dl_bw->total_bw=366996, old_bw=0, new_bw=52428 ... [ 0.388556] __dl_server_attach_root: cpu=0, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=52428, dl_b->bw=996147 [ 0.389507] __dl_server_attach_root: cpu=1, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=104856, dl_b->bw=996147 [ 0.390501] __dl_server_attach_root: cpu=2, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=157284, dl_b->bw=996147 [ 0.391503] __dl_server_attach_root: cpu=3, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=209712, dl_b->bw=996147 [ 0.392499] __dl_server_attach_root: cpu=4, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=262140, dl_b->bw=996147 [ 0.393499] __dl_server_attach_root: cpu=5, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=314568, dl_b->bw=996147 [ 0.394428] __dl_server_attach_root: cpu=6, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=366996, dl_b->bw=996147 [ 0.394499] __dl_server_attach_root: cpu=7, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=419424, dl_b->bw=996147 I added a printk in __dl_server_attach_root which is called after the dynamic rd is built to transfer bandwidth to it. __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair server interface"), do you have this change in your backport? > So if rd changes during boot initialization, the correct dl_bw has to be > updated AFAICS. Also if cpusets are used, the rd for a CPU may change. cpusets changes are something that I still need to double check. Will do.
On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: > On 30/10/24 19:50, Joel Fernandes wrote: > > ... > > > With some prints [1] in the kernel, we can see on boot: > > > > $ dmesg|grep appl > > [ 0.930337] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > > [ 0.949025] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > > [ 0.953026] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 > > [ 0.957024] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 > > [ 0.961023] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 > > [ 0.965030] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 > > [ 0.969024] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 > > [ 0.973024] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 > > > > For the 8th apply_params, the 8th CPU is not considered. This is because > > set_cpu_active() for the 8th CPU has not yet happened as mentioned in commit > > message. > > > > With the patch: > > > > $ dmesg|grep appl > > [ 0.961169] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > > [ 0.981936] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1 > > [ 0.985836] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1 > > [ 0.989835] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1 > > [ 0.993840] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1 > > [ 0.997835] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1 > > [ 1.001838] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1 > > [ 1.005834] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1 > > > > [ ... here somewhere rd changes as topology init finishes, then all the > > params are replied, this time with the correct rd. ] > > > > [ 1.009903] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.012409] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.014269] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.019865] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.054908] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.081865] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.108861] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > [ 1.136944] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=8, cap=8192, init=1 > > > > The -EBUSY happens for our 5.15 backport. I see dl_b->total_bw to be 0 > > without my patch. Even if the -EBUSY doesn't happen for you (perhaps due to > > compiler or other differences), shouldn't we use the correct rd for > > apply_params? The dl_bw is tied to the rd via cpu_rq(cpu)->rd->dl_bw; > > I think I am still seeing something different. > > [ 0.184629] dl_server_apply_params: cpu=0, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=0, dl_b->bw=996147 > [ 0.185498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=0, old_bw=0, new_bw=52428 > [ 0.371531] dl_server_apply_params: cpu=1, runtime=50000000, period=1000000000, cpus=1, cap=1024, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=52428, dl_b->bw=996147 > [ 0.372498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=1024, cap_scale=996147 dl_bw->total_bw=52428, old_bw=0, new_bw=52428 > [ 0.373541] dl_server_apply_params: cpu=2, runtime=50000000, period=1000000000, cpus=2, cap=2048, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=104856, dl_b->bw=996147 > [ 0.374498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=2048, cap_scale=1992294 dl_bw->total_bw=104856, old_bw=0, new_bw=52428 > [ 0.375507] dl_server_apply_params: cpu=3, runtime=50000000, period=1000000000, cpus=3, cap=3072, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=157284, dl_b->bw=996147 > [ 0.376498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=3072, cap_scale=2988441 dl_bw->total_bw=157284, old_bw=0, new_bw=52428 > [ 0.377507] dl_server_apply_params: cpu=4, runtime=50000000, period=1000000000, cpus=4, cap=4096, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=209712, dl_b->bw=996147 > [ 0.378498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=4096, cap_scale=3984588 dl_bw->total_bw=209712, old_bw=0, new_bw=52428 > [ 0.379505] dl_server_apply_params: cpu=5, runtime=50000000, period=1000000000, cpus=5, cap=5120, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=262140, dl_b->bw=996147 > [ 0.380498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=5120, cap_scale=4980735 dl_bw->total_bw=262140, old_bw=0, new_bw=52428 > [ 0.381504] dl_server_apply_params: cpu=6, runtime=50000000, period=1000000000, cpus=6, cap=6144, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=314568, dl_b->bw=996147 > [ 0.382498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=6144, cap_scale=5976882 dl_bw->total_bw=314568, old_bw=0, new_bw=52428 > [ 0.384527] dl_server_apply_params: cpu=7, runtime=50000000, period=1000000000, cpus=7, cap=7168, init=1, dl_b=ffffffffb837c080, dl_b->total_bw=366996, dl_b->bw=996147 > [ 0.385498] __dl_overflow: dl_b=ffffffffb837c080, dl_b->bw=996147, cap=7168, cap_scale=6973029 dl_bw->total_bw=366996, old_bw=0, new_bw=52428 > ... > [ 0.388556] __dl_server_attach_root: cpu=0, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=52428, dl_b->bw=996147 > [ 0.389507] __dl_server_attach_root: cpu=1, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=104856, dl_b->bw=996147 > [ 0.390501] __dl_server_attach_root: cpu=2, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=157284, dl_b->bw=996147 > [ 0.391503] __dl_server_attach_root: cpu=3, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=209712, dl_b->bw=996147 > [ 0.392499] __dl_server_attach_root: cpu=4, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=262140, dl_b->bw=996147 > [ 0.393499] __dl_server_attach_root: cpu=5, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=314568, dl_b->bw=996147 > [ 0.394428] __dl_server_attach_root: cpu=6, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=366996, dl_b->bw=996147 > [ 0.394499] __dl_server_attach_root: cpu=7, new_bw=52428, dl_b=ffff949bc080e040, dl_b->total_bw=419424, dl_b->bw=996147 > > > I added a printk in __dl_server_attach_root which is called after the > dynamic rd is built to transfer bandwidth to it. > > __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair > server interface"), do you have this change in your backport? You nailed it! Our 5.15 backport appears to be slightly older and is missing this from topology.c as you mentioned. Thanks for clarifying! /* * Because the rq is not a task, dl_add_task_root_domain() did not * move the fair server bw to the rd if it already started. * Add it now. */ if (rq->fair_server.dl_server) __dl_server_attach_root(&rq->fair_server, rq); > > > So if rd changes during boot initialization, the correct dl_bw has to be > > updated AFAICS. Also if cpusets are used, the rd for a CPU may change. > > cpusets changes are something that I still need to double check. Will > do. > Sounds good, that would be good to verify. Thanks, - Joel
On 04/11/24 17:41, Joel Fernandes wrote: > On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: ... > > I added a printk in __dl_server_attach_root which is called after the > > dynamic rd is built to transfer bandwidth to it. > > > > __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair > > server interface"), do you have this change in your backport? > > You nailed it! Our 5.15 backport appears to be slightly older and is missing > this from topology.c as you mentioned. Thanks for clarifying! > > > /* > * Because the rq is not a task, dl_add_task_root_domain() did not > * move the fair server bw to the rd if it already started. > * Add it now. > */ > if (rq->fair_server.dl_server) > __dl_server_attach_root(&rq->fair_server, rq); > > > > > > So if rd changes during boot initialization, the correct dl_bw has to be > > > updated AFAICS. Also if cpusets are used, the rd for a CPU may change. > > > > cpusets changes are something that I still need to double check. Will > > do. > > > > Sounds good, that would be good to verify. So, I played a little bit with it and came up with a simple set of ops that point out an issue (default fedora server install): echo Y >/sys/kernel/debug/sched/verbose echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition The domains are rebuilt correctly, but we end up with a null total_bw. The conditional call above takes care correctly of adding back dl_server per-rq bandwidth when we pass from the single domain to the 2 exclusive ones, but I noticed that we go through partition_sched_domains_locked() twice for a single write of 'root' and the second one, since it's not actually destroying/rebuilding anything, is resetting total_bw w/o addition dl_server contribution back. Now, not completely sure why we need to go through partition_sched_ domains_locked() twice, as we have (it also looked like a pattern from other call paths) update_prstate() -> update_cpumasks_hier() -> rebuild_sched_domains_locked() <- right at the end -> update_partition_sd_lb() -> rebuild_sched_domains_locked() <- right after the above call Removing the first call does indeed fix the issue and domains look OK, but I'm pretty sure I'm missing all sort of details and corner cases. Waiman (now Cc-ed), maybe you can help here understanding why the two back to back calls are needed? Thanks! Juri
On 11/6/24 11:08 AM, Juri Lelli wrote: > On 04/11/24 17:41, Joel Fernandes wrote: >> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: > ... > >>> I added a printk in __dl_server_attach_root which is called after the >>> dynamic rd is built to transfer bandwidth to it. >>> >>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair >>> server interface"), do you have this change in your backport? >> You nailed it! Our 5.15 backport appears to be slightly older and is missing >> this from topology.c as you mentioned. Thanks for clarifying! >> >> >> /* >> * Because the rq is not a task, dl_add_task_root_domain() did not >> * move the fair server bw to the rd if it already started. >> * Add it now. >> */ >> if (rq->fair_server.dl_server) >> __dl_server_attach_root(&rq->fair_server, rq); >> >>>> So if rd changes during boot initialization, the correct dl_bw has to be >>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may change. >>> cpusets changes are something that I still need to double check. Will >>> do. >>> >> Sounds good, that would be good to verify. > So, I played a little bit with it and came up with a simple set of ops > that point out an issue (default fedora server install): > > echo Y >/sys/kernel/debug/sched/verbose > > echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control > > echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus > echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive > echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition > > The domains are rebuilt correctly, but we end up with a null total_bw. > > The conditional call above takes care correctly of adding back dl_server > per-rq bandwidth when we pass from the single domain to the 2 exclusive > ones, but I noticed that we go through partition_sched_domains_locked() > twice for a single write of 'root' and the second one, since it's not > actually destroying/rebuilding anything, is resetting total_bw w/o > addition dl_server contribution back. > > Now, not completely sure why we need to go through partition_sched_ > domains_locked() twice, as we have (it also looked like a pattern from > other call paths) > > update_prstate() > -> update_cpumasks_hier() > -> rebuild_sched_domains_locked() <- right at the end > -> update_partition_sd_lb() > -> rebuild_sched_domains_locked() <- right after the above call > > Removing the first call does indeed fix the issue and domains look OK, > but I'm pretty sure I'm missing all sort of details and corner cases. > > Waiman (now Cc-ed), maybe you can help here understanding why the two > back to back calls are needed? Thanks for letting me know about this case. I am aware that rebuild_sched_domains_locked() can be called more than once. I have addressed the hotplug case, but it can happen in some other corner cases as well. The problem with multiple rebuild_sched_domains_locked() calls is the fact that intermediate ones may be called where the internal states may not be consistent. I am going to work on a fix to this issue by making sure that rebuild_sched_domains_locked() will only be called once. Cheers, Longman
On 11/6/24 1:05 PM, Waiman Long wrote: > On 11/6/24 11:08 AM, Juri Lelli wrote: >> On 04/11/24 17:41, Joel Fernandes wrote: >>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: >> ... >> >>>> I added a printk in __dl_server_attach_root which is called after the >>>> dynamic rd is built to transfer bandwidth to it. >>>> >>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair >>>> server interface"), do you have this change in your backport? >>> You nailed it! Our 5.15 backport appears to be slightly older and is >>> missing >>> this from topology.c as you mentioned. Thanks for clarifying! >>> >>> >>> /* >>> * Because the rq is not a task, dl_add_task_root_domain() >>> did not >>> * move the fair server bw to the rd if it already started. >>> * Add it now. >>> */ >>> if (rq->fair_server.dl_server) >>> __dl_server_attach_root(&rq->fair_server, rq); >>> >>>>> So if rd changes during boot initialization, the correct dl_bw has >>>>> to be >>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may >>>>> change. >>>> cpusets changes are something that I still need to double check. Will >>>> do. >>>> >>> Sounds good, that would be good to verify. >> So, I played a little bit with it and came up with a simple set of ops >> that point out an issue (default fedora server install): >> >> echo Y >/sys/kernel/debug/sched/verbose >> >> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control >> >> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus >> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive >> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition >> >> The domains are rebuilt correctly, but we end up with a null total_bw. >> >> The conditional call above takes care correctly of adding back dl_server >> per-rq bandwidth when we pass from the single domain to the 2 exclusive >> ones, but I noticed that we go through partition_sched_domains_locked() >> twice for a single write of 'root' and the second one, since it's not >> actually destroying/rebuilding anything, is resetting total_bw w/o >> addition dl_server contribution back. >> >> Now, not completely sure why we need to go through partition_sched_ >> domains_locked() twice, as we have (it also looked like a pattern from >> other call paths) >> >> update_prstate() >> -> update_cpumasks_hier() >> -> rebuild_sched_domains_locked() <- right at the end >> -> update_partition_sd_lb() >> -> rebuild_sched_domains_locked() <- right after the above call >> >> Removing the first call does indeed fix the issue and domains look OK, >> but I'm pretty sure I'm missing all sort of details and corner cases. >> >> Waiman (now Cc-ed), maybe you can help here understanding why the two >> back to back calls are needed? > > Thanks for letting me know about this case. > > I am aware that rebuild_sched_domains_locked() can be called more than > once. I have addressed the hotplug case, but it can happen in some > other corner cases as well. The problem with multiple > rebuild_sched_domains_locked() calls is the fact that intermediate > ones may be called where the internal states may not be consistent. I > am going to work on a fix to this issue by making sure that > rebuild_sched_domains_locked() will only be called once. I am working on a set of cpuset patches to eliminate redundant rebuild_sched_domains_locked() calls. However, my cpuset test script fails after the change due to the presence of test cases where the only CPU in a 1-cpu partition is being offlined. So I sent out a sched/deadline patch [1] to work around this particular corner case. [1] https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u Apparently, the null total_bw bug caused by multiple rebuild_sched_domains_locked() calls masks this problem. Anyway, I should be able to post the cpuset patch series next week after further testing. Please review my sched/deadline patch to see if you are OK with this minor change. Cheers, Longman
On 11/7/24 11:40 PM, Waiman Long wrote: > On 11/6/24 1:05 PM, Waiman Long wrote: >> On 11/6/24 11:08 AM, Juri Lelli wrote: >>> On 04/11/24 17:41, Joel Fernandes wrote: >>>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: >>> ... >>> >>>>> I added a printk in __dl_server_attach_root which is called after the >>>>> dynamic rd is built to transfer bandwidth to it. >>>>> >>>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair >>>>> server interface"), do you have this change in your backport? >>>> You nailed it! Our 5.15 backport appears to be slightly older and >>>> is missing >>>> this from topology.c as you mentioned. Thanks for clarifying! >>>> >>>> >>>> /* >>>> * Because the rq is not a task, dl_add_task_root_domain() >>>> did not >>>> * move the fair server bw to the rd if it already started. >>>> * Add it now. >>>> */ >>>> if (rq->fair_server.dl_server) >>>> __dl_server_attach_root(&rq->fair_server, rq); >>>> >>>>>> So if rd changes during boot initialization, the correct dl_bw >>>>>> has to be >>>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may >>>>>> change. >>>>> cpusets changes are something that I still need to double check. Will >>>>> do. >>>>> >>>> Sounds good, that would be good to verify. >>> So, I played a little bit with it and came up with a simple set of ops >>> that point out an issue (default fedora server install): >>> >>> echo Y >/sys/kernel/debug/sched/verbose >>> >>> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control >>> >>> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus >>> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive >>> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition >>> >>> The domains are rebuilt correctly, but we end up with a null total_bw. >>> >>> The conditional call above takes care correctly of adding back >>> dl_server >>> per-rq bandwidth when we pass from the single domain to the 2 exclusive >>> ones, but I noticed that we go through partition_sched_domains_locked() >>> twice for a single write of 'root' and the second one, since it's not >>> actually destroying/rebuilding anything, is resetting total_bw w/o >>> addition dl_server contribution back. >>> >>> Now, not completely sure why we need to go through partition_sched_ >>> domains_locked() twice, as we have (it also looked like a pattern from >>> other call paths) >>> >>> update_prstate() >>> -> update_cpumasks_hier() >>> -> rebuild_sched_domains_locked() <- right at the end >>> -> update_partition_sd_lb() >>> -> rebuild_sched_domains_locked() <- right after the above call >>> >>> Removing the first call does indeed fix the issue and domains look OK, >>> but I'm pretty sure I'm missing all sort of details and corner cases. >>> >>> Waiman (now Cc-ed), maybe you can help here understanding why the two >>> back to back calls are needed? >> >> Thanks for letting me know about this case. >> >> I am aware that rebuild_sched_domains_locked() can be called more >> than once. I have addressed the hotplug case, but it can happen in >> some other corner cases as well. The problem with multiple >> rebuild_sched_domains_locked() calls is the fact that intermediate >> ones may be called where the internal states may not be consistent. I >> am going to work on a fix to this issue by making sure that >> rebuild_sched_domains_locked() will only be called once. > > I am working on a set of cpuset patches to eliminate redundant > rebuild_sched_domains_locked() calls. However, my cpuset test script > fails after the change due to the presence of test cases where the > only CPU in a 1-cpu partition is being offlined. So I sent out a > sched/deadline patch [1] to work around this particular corner case. > > [1] > https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u > > Apparently, the null total_bw bug caused by multiple > rebuild_sched_domains_locked() calls masks this problem. > > Anyway, I should be able to post the cpuset patch series next week > after further testing. Please review my sched/deadline patch to see if > you are OK with this minor change. I have the patchset to enforce that rebuild_sched_domains_locked() will only be called at most once per cpuset operation. By adding some debug code to further study the null total_bw issue when cpuset.cpus.partition is being changed, I found that eliminating the redundant rebuild_sched_domains_locked() reduced the chance of hitting null total_bw, it did not eliminate it. By running my cpuset test script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With my new cpuset patch applied, it reduces it to 120 cases of null total_bw. I will try to look further for the exact condition that triggers null total_bw generation. Cheers, Longman
On 11/8/24 10:30 PM, Waiman Long wrote: > I have the patchset to enforce that rebuild_sched_domains_locked() > will only be called at most once per cpuset operation. > > By adding some debug code to further study the null total_bw issue > when cpuset.cpus.partition is being changed, I found that eliminating > the redundant rebuild_sched_domains_locked() reduced the chance of > hitting null total_bw, it did not eliminate it. By running my cpuset > test script, I hit 250 cases of null total_bw with the v6.12-rc6 > kernel. With my new cpuset patch applied, it reduces it to 120 cases > of null total_bw. > > I will try to look further for the exact condition that triggers null > total_bw generation. After further testing, the 120 cases of null total_bw can be classified into the following 3 categories. 1) 51 cases when an isolated partition with isolated CPUs is created. Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine and not really a problem. 2) 67 cases when a nested partitions are being removed (A1 - A2). There is probably caused by some kind of race condtion. If I insert an artifical delay between the removal of A2 and A1, total_bw is fine. If there is no delay, I can see a null total_bw. That shouldn't really a problem in practice, though we may still need to figure out why. 2) Two cases where null total_bw is seen when a new partition is created by moving all the CPUs in the parent cgroup into its partition and the parent becomes a null partition with no CPU. The following example illustrates the steps. #!/bin/bash CGRP=/sys/fs/cgroup cd $CGRP echo +cpuset > cgroup.subtree_control mkdir A1 cd A1 echo 0-1 > cpuset.cpus echo root > cpuset.cpus.partition echo "A1 partition" echo +cpuset > cgroup.subtree_control mkdir A2 cd A2 echo 0-1 > cpuset.cpus echo root > cpuset.cpus.partition echo "A2 partition" cd .. echo "Remove A2" rmdir A2 cd .. echo "Remove A1" rmdir A1 In this corner case, there is actually no change in the set of sched domains. In this case, the sched domain set of CPUs 0-1 is being moved from partition A1 to A2 and vice versa in the removal of A2. This is similar to calling rebuild_sched_domains_locked() twice with the same input. I believe that is the condition that causes null total_bw. Now the question is why the deadline code behaves this way. It is probably a bug that needs to be addressed. Cheers, Longman
On 09/11/24 13:18, Waiman Long wrote: > On 11/8/24 10:30 PM, Waiman Long wrote: > > I have the patchset to enforce that rebuild_sched_domains_locked() will > > only be called at most once per cpuset operation. > > > > By adding some debug code to further study the null total_bw issue when > > cpuset.cpus.partition is being changed, I found that eliminating the > > redundant rebuild_sched_domains_locked() reduced the chance of hitting > > null total_bw, it did not eliminate it. By running my cpuset test > > script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With > > my new cpuset patch applied, it reduces it to 120 cases of null > > total_bw. > > > > I will try to look further for the exact condition that triggers null > > total_bw generation. > > After further testing, the 120 cases of null total_bw can be classified into > the following 3 categories. > > 1) 51 cases when an isolated partition with isolated CPUs is created. > Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine > and not really a problem. > > 2) 67 cases when a nested partitions are being removed (A1 - A2). There is > probably caused by some kind of race condtion. If I insert an artifical > delay between the removal of A2 and A1, total_bw is fine. If there is no > delay, I can see a null total_bw. That shouldn't really a problem in > practice, though we may still need to figure out why. > > 2) Two cases where null total_bw is seen when a new partition is created by > moving all the CPUs in the parent cgroup into its partition and the parent > becomes a null partition with no CPU. The following example illustrates the > steps. > > #!/bin/bash > CGRP=/sys/fs/cgroup > cd $CGRP > echo +cpuset > cgroup.subtree_control > mkdir A1 > cd A1 > echo 0-1 > cpuset.cpus > echo root > cpuset.cpus.partition > echo "A1 partition" > echo +cpuset > cgroup.subtree_control > mkdir A2 > cd A2 > echo 0-1 > cpuset.cpus > echo root > cpuset.cpus.partition > echo "A2 partition" > cd .. > echo "Remove A2" > rmdir A2 > cd .. > echo "Remove A1" > rmdir A1 > > In this corner case, there is actually no change in the set of sched > domains. In this case, the sched domain set of CPUs 0-1 is being moved from > partition A1 to A2 and vice versa in the removal of A2. This is similar to > calling rebuild_sched_domains_locked() twice with the same input. I believe > that is the condition that causes null total_bw. > > Now the question is why the deadline code behaves this way. It is probably a > bug that needs to be addressed. Thanks for the analysis (and the patches). Will take a look, but I suspect it might be because in case domains are not destroyed, we clear them up (total_bw set to 0), but we don't add fair server contribution back because rqs are not attached to domains (as there have been alredy attached when such domains were created). Best, Juri
On 11/11/24 09:37, Juri Lelli wrote: > On 09/11/24 13:18, Waiman Long wrote: > > On 11/8/24 10:30 PM, Waiman Long wrote: > > > I have the patchset to enforce that rebuild_sched_domains_locked() will > > > only be called at most once per cpuset operation. > > > > > > By adding some debug code to further study the null total_bw issue when > > > cpuset.cpus.partition is being changed, I found that eliminating the > > > redundant rebuild_sched_domains_locked() reduced the chance of hitting > > > null total_bw, it did not eliminate it. By running my cpuset test > > > script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With > > > my new cpuset patch applied, it reduces it to 120 cases of null > > > total_bw. > > > > > > I will try to look further for the exact condition that triggers null > > > total_bw generation. > > > > After further testing, the 120 cases of null total_bw can be classified into > > the following 3 categories. > > > > 1) 51 cases when an isolated partition with isolated CPUs is created. > > Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine > > and not really a problem. > > > > 2) 67 cases when a nested partitions are being removed (A1 - A2). There is > > probably caused by some kind of race condtion. If I insert an artifical > > delay between the removal of A2 and A1, total_bw is fine. If there is no > > delay, I can see a null total_bw. That shouldn't really a problem in > > practice, though we may still need to figure out why. > > > > 2) Two cases where null total_bw is seen when a new partition is created by > > moving all the CPUs in the parent cgroup into its partition and the parent > > becomes a null partition with no CPU. The following example illustrates the > > steps. > > > > #!/bin/bash > > CGRP=/sys/fs/cgroup > > cd $CGRP > > echo +cpuset > cgroup.subtree_control > > mkdir A1 > > cd A1 > > echo 0-1 > cpuset.cpus > > echo root > cpuset.cpus.partition > > echo "A1 partition" > > echo +cpuset > cgroup.subtree_control > > mkdir A2 > > cd A2 > > echo 0-1 > cpuset.cpus > > echo root > cpuset.cpus.partition > > echo "A2 partition" > > cd .. > > echo "Remove A2" > > rmdir A2 > > cd .. > > echo "Remove A1" > > rmdir A1 > > > > In this corner case, there is actually no change in the set of sched > > domains. In this case, the sched domain set of CPUs 0-1 is being moved from > > partition A1 to A2 and vice versa in the removal of A2. This is similar to > > calling rebuild_sched_domains_locked() twice with the same input. I believe > > that is the condition that causes null total_bw. > > > > Now the question is why the deadline code behaves this way. It is probably a > > bug that needs to be addressed. > > Thanks for the analysis (and the patches). Will take a look, but I > suspect it might be because in case domains are not destroyed, we clear > them up (total_bw set to 0), but we don't add fair server contribution > back because rqs are not attached to domains (as there have been alredy > attached when such domains were created). OK, I'm thinking maybe something like the below might help. It seems to do well with your test above, does it maybe fix the other cases you mentioned as well? --- include/linux/sched/deadline.h | 1 + kernel/sched/deadline.c | 12 ++++++++++++ kernel/sched/topology.c | 9 ++++++--- 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h index 3a912ab42bb5..a74fb0da0fa3 100644 --- a/include/linux/sched/deadline.h +++ b/include/linux/sched/deadline.h @@ -34,6 +34,7 @@ static inline bool dl_time_before(u64 a, u64 b) struct root_domain; extern void dl_add_task_root_domain(struct task_struct *p); extern void dl_clear_root_domain(struct root_domain *rd); +extern void dl_restore_server_root_domain(struct root_domain *rd); #endif /* CONFIG_SMP */ diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 7884e566557c..acbd3039215f 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2981,6 +2981,18 @@ void dl_add_task_root_domain(struct task_struct *p) task_rq_unlock(rq, p, &rf); } +void dl_restore_server_root_domain(struct root_domain *rd) { + int i; + + guard(raw_spinlock_irqsave)(&rd->dl_bw.lock); + for_each_cpu(i, rd->span) { + struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server; + + if (dl_server(dl_se)) + rd->dl_bw.total_bw += dl_se->dl_bw; + } +} + void dl_clear_root_domain(struct root_domain *rd) { unsigned long flags; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9748a4c8d668..a0fcae07585e 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2721,12 +2721,15 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[], /* * This domain won't be destroyed and as such - * its dl_bw->total_bw needs to be cleared. It - * will be recomputed in function - * update_tasks_root_domain(). + * its dl_bw->total_bw needs to be cleared. + * Tasks contribution will be then recomputed + * in function dl_update_tasks_root_domain(), + * dl_servers contribution in function + * dl_restore_server_root_domain(). */ rd = cpu_rq(cpumask_any(doms_cur[i]))->rd; dl_clear_root_domain(rd); + dl_restore_server_root_domain(rd); goto match1; } }
On 07/11/24 23:40, Waiman Long wrote: > On 11/6/24 1:05 PM, Waiman Long wrote: > > On 11/6/24 11:08 AM, Juri Lelli wrote: > > > On 04/11/24 17:41, Joel Fernandes wrote: > > > > On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote: > > > ... > > > > > > > > I added a printk in __dl_server_attach_root which is called after the > > > > > dynamic rd is built to transfer bandwidth to it. > > > > > > > > > > __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair > > > > > server interface"), do you have this change in your backport? > > > > You nailed it! Our 5.15 backport appears to be slightly older > > > > and is missing > > > > this from topology.c as you mentioned. Thanks for clarifying! > > > > > > > > > > > > /* > > > > * Because the rq is not a task, > > > > dl_add_task_root_domain() did not > > > > * move the fair server bw to the rd if it already started. > > > > * Add it now. > > > > */ > > > > if (rq->fair_server.dl_server) > > > > __dl_server_attach_root(&rq->fair_server, rq); > > > > > > > > > > So if rd changes during boot initialization, the correct > > > > > > dl_bw has to be > > > > > > updated AFAICS. Also if cpusets are used, the rd for a > > > > > > CPU may change. > > > > > cpusets changes are something that I still need to double check. Will > > > > > do. > > > > > > > > > Sounds good, that would be good to verify. > > > So, I played a little bit with it and came up with a simple set of ops > > > that point out an issue (default fedora server install): > > > > > > echo Y >/sys/kernel/debug/sched/verbose > > > > > > echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control > > > > > > echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus > > > echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive > > > echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition > > > > > > The domains are rebuilt correctly, but we end up with a null total_bw. > > > > > > The conditional call above takes care correctly of adding back dl_server > > > per-rq bandwidth when we pass from the single domain to the 2 exclusive > > > ones, but I noticed that we go through partition_sched_domains_locked() > > > twice for a single write of 'root' and the second one, since it's not > > > actually destroying/rebuilding anything, is resetting total_bw w/o > > > addition dl_server contribution back. > > > > > > Now, not completely sure why we need to go through partition_sched_ > > > domains_locked() twice, as we have (it also looked like a pattern from > > > other call paths) > > > > > > update_prstate() > > > -> update_cpumasks_hier() > > > -> rebuild_sched_domains_locked() <- right at the end > > > -> update_partition_sd_lb() > > > -> rebuild_sched_domains_locked() <- right after the above call > > > > > > Removing the first call does indeed fix the issue and domains look OK, > > > but I'm pretty sure I'm missing all sort of details and corner cases. > > > > > > Waiman (now Cc-ed), maybe you can help here understanding why the two > > > back to back calls are needed? > > > > Thanks for letting me know about this case. > > > > I am aware that rebuild_sched_domains_locked() can be called more than > > once. I have addressed the hotplug case, but it can happen in some other > > corner cases as well. The problem with multiple > > rebuild_sched_domains_locked() calls is the fact that intermediate ones > > may be called where the internal states may not be consistent. I am > > going to work on a fix to this issue by making sure that > > rebuild_sched_domains_locked() will only be called once. > > I am working on a set of cpuset patches to eliminate redundant > rebuild_sched_domains_locked() calls. However, my cpuset test script fails > after the change due to the presence of test cases where the only CPU in a > 1-cpu partition is being offlined. So I sent out a sched/deadline patch [1] > to work around this particular corner case. > > [1] > https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u > > Apparently, the null total_bw bug caused by multiple > rebuild_sched_domains_locked() calls masks this problem. > > Anyway, I should be able to post the cpuset patch series next week after > further testing. Please review my sched/deadline patch to see if you are OK > with this minor change. Thank you! Will take a look. Best, Juri
© 2016 - 2024 Red Hat, Inc.