for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)

Frederic Weisbecker posted 1 patch 1 year, 10 months ago
for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Frederic Weisbecker 1 year, 10 months ago
On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
> [ 1484.955213] WARNING: CPU: 6 PID: 162 at kernel/time/hrtimer.c:1088 enqueue_hrtimer+0x6f/0x80
> [ 1484.962513] Modules linked in:
> [ 1484.966476] CPU: 6 PID: 162 Comm: rcu_torture_rea Not tainted 6.8.0 #25
> [ 1484.972975] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> [ 1484.977653] RIP: 0010:enqueue_hrtimer+0x6f/0x80
> [ 1484.978679] Code: a3 05 75 6a b7 01 73 bd 48 8b 05 e4 47 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 eb bd ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 1484.983136] RSP: 0000:ffffa553805cfd70 EFLAGS: 00010046
> [ 1484.984386] RAX: ffff8c395f260440 RBX: ffff8c395f260480 RCX: ffff8c395f260440
> [ 1484.986121] RDX: 0000000000000000 RSI: ffff8c395f260480 RDI: ffffa553805cfde0
> [ 1484.987704] RBP: ffffa553805cfde0 R08: 0000000000000001 R09: 000000000000fc03
> [ 1484.989513] R10: 0000000000000001 R11: ffff8c3941248e40 R12: 0000000000000040
> [ 1484.991116] R13: ffff8c395f260480 R14: ffff8c395f260480 R15: ffff8c395f260440
> [ 1484.992835] FS:  0000000000000000(0000) GS:ffff8c395f380000(0000) knlGS:0000000000000000
> [ 1484.994683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1484.995985] CR2: 0000000000000000 CR3: 000000001a62c000 CR4: 00000000000006f0
> [ 1484.997789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1484.999376] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1485.001177] Call Trace:
> [ 1485.002388]  <TASK>
> [ 1485.002923]  ? __warn+0x78/0x120
> [ 1485.003641]  ? enqueue_hrtimer+0x6f/0x80
> [ 1485.004641]  ? report_bug+0xf1/0x1d0
> [ 1485.005537]  ? handle_bug+0x40/0x70
> [ 1485.006318]  ? exc_invalid_op+0x13/0x70
> [ 1485.007167]  ? asm_exc_invalid_op+0x16/0x20
> [ 1485.008117]  ? enqueue_hrtimer+0x6f/0x80
> [ 1485.009131]  hrtimer_start_range_ns+0x258/0x2f0
> [ 1485.010153]  ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1485.011185]  schedule_hrtimeout_range_clock+0x96/0x120
> [ 1485.012394]  ? __pfx_hrtimer_wakeup+0x10/0x10
> [ 1485.013502]  stutter_wait+0x7f/0x90
> [ 1485.014319]  rcu_torture_reader+0x10e/0x280
> [ 1485.015240]  ? __pfx_rcu_torture_timer+0x10/0x10
> [ 1485.016317]  ? kthread+0xc6/0xf0
> [ 1485.017169]  ? __pfx_rcu_torture_reader+0x10/0x10
> [ 1485.018215]  kthread+0xc6/0xf0
> [ 1485.018899]  ? __pfx_kthread+0x10/0x10
> [ 1485.019703]  ret_from_fork+0x2b/0x40
> [ 1485.020546]  ? __pfx_kthread+0x10/0x10
> [ 1485.021428]  ret_from_fork_asm+0x1b/0x30
> [ 1485.022295]  </TASK>
> 
> This happens within the following loop
> 
> 	for_each_domain(cpu, sd) {
> 		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
> 			if (cpu == i)
> 				continue;
> 
> 			if (!idle_cpu(i))
> 				return i;
> 		}
> 	}
> 
> An offline CPU is returned from there. Which is not supposed to happen
> as it's within an RCU read side. But I can't manage to find where those
> rq->sd things are modified when a CPU is offlining. The code path is hard
> to follow. Adding some scheduler people involved in topology just in case.
> Investigation continues...

So I just tried a simple test. I applied the following PoF patch, under the
assumption that right after the exit of takedown_cpu(), the target CPU is
offline and all CPUs must see the updates perfomed by the now dead target
CPU. I expect also that the target CPU should have removed itself from the
scheduling domains (though I don't know where that happens in the maze):

diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 6d67e9a5af6b..2e0580abc73a 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,6 +9,7 @@
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 extern void nohz_balance_enter_idle(int cpu);
 extern int get_nohz_timer_target(void);
+extern void sched_assert_online_domains(int cpu);
 #else
 static inline void nohz_balance_enter_idle(int cpu) { }
 #endif
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6c0a92ca6bb5..3424d1d40142 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1024,7 +1024,7 @@ void clear_tasks_mm_cpumask(int cpu)
 }
 
 /* Take this CPU down. */
-static int take_cpu_down(void *_param)
+int take_cpu_down(void *_param)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..57f483fc1037 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1055,6 +1055,22 @@ void resched_cpu(int cpu)
 
 #ifdef CONFIG_SMP
 #ifdef CONFIG_NO_HZ_COMMON
+void sched_assert_online_domains(int cpu)
+{
+	struct sched_domain *sd;
+	int i;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		for_each_cpu(i, sched_domain_span(sd)) {
+			if (cpu == i)
+				continue;
+
+			WARN_ON_ONCE(cpu_is_offline(i));
+		}
+	}
+	rcu_read_unlock();
+}
 /*
  * In the semi idle case, use the nearest busy CPU for migrating timers
  * from an idle CPU.  This is good for power-savings.
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index cedb17ba158a..354806156bd9 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -22,6 +22,7 @@
 #include <linux/atomic.h>
 #include <linux/nmi.h>
 #include <linux/sched/wake_q.h>
+#include <linux/sched/nohz.h>
 
 /*
  * Structure to determine completion condition and record errors.  May
@@ -199,6 +200,8 @@ notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
 	cpu_relax();
 }
 
+extern int take_cpu_down(void *_param);
+
 /* This is the cpu_stop function which stops the CPU. */
 static int multi_cpu_stop(void *data)
 {
@@ -255,6 +258,8 @@ static int multi_cpu_stop(void *data)
 	} while (curstate != MULTI_STOP_EXIT);
 
 	local_irq_restore(flags);
+	if (!is_active && msdata->fn == take_cpu_down)
+		sched_assert_online_domains(smp_processor_id());
 	return err;
 }
 
Then with that patch I ran TREE07, just some short iterations:

tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2

And the warning triggers very quickly. At least since v6.3 but maybe since
earlier. Is this expected behaviour or am I right to assume that
for_each_domain()/sched_domain_span() shouldn't return an offline CPU?

Thanks.
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Valentin Schneider 1 year, 10 months ago
I've just noticed this thread while cleaning up my inbox, apologies for the
delayed response.

On 22/03/24 14:22, Frederic Weisbecker wrote:
> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> On Thu, Mar 21, 2024 at 05:47:59AM -0700, Paul E. McKenney wrote:
>> [ 1484.955213] WARNING: CPU: 6 PID: 162 at kernel/time/hrtimer.c:1088 enqueue_hrtimer+0x6f/0x80
>> [ 1484.962513] Modules linked in:
>> [ 1484.966476] CPU: 6 PID: 162 Comm: rcu_torture_rea Not tainted 6.8.0 #25
>> [ 1484.972975] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>> [ 1484.977653] RIP: 0010:enqueue_hrtimer+0x6f/0x80
>> [ 1484.978679] Code: a3 05 75 6a b7 01 73 bd 48 8b 05 e4 47 b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 eb bd ff ff 48 8b 03 f6 40 10 10 75 a5 90 <0f> 0b 90 eb 9f 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
>> [ 1484.983136] RSP: 0000:ffffa553805cfd70 EFLAGS: 00010046
>> [ 1484.984386] RAX: ffff8c395f260440 RBX: ffff8c395f260480 RCX: ffff8c395f260440
>> [ 1484.986121] RDX: 0000000000000000 RSI: ffff8c395f260480 RDI: ffffa553805cfde0
>> [ 1484.987704] RBP: ffffa553805cfde0 R08: 0000000000000001 R09: 000000000000fc03
>> [ 1484.989513] R10: 0000000000000001 R11: ffff8c3941248e40 R12: 0000000000000040
>> [ 1484.991116] R13: ffff8c395f260480 R14: ffff8c395f260480 R15: ffff8c395f260440
>> [ 1484.992835] FS:  0000000000000000(0000) GS:ffff8c395f380000(0000) knlGS:0000000000000000
>> [ 1484.994683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1484.995985] CR2: 0000000000000000 CR3: 000000001a62c000 CR4: 00000000000006f0
>> [ 1484.997789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1484.999376] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1485.001177] Call Trace:
>> [ 1485.002388]  <TASK>
>> [ 1485.002923]  ? __warn+0x78/0x120
>> [ 1485.003641]  ? enqueue_hrtimer+0x6f/0x80
>> [ 1485.004641]  ? report_bug+0xf1/0x1d0
>> [ 1485.005537]  ? handle_bug+0x40/0x70
>> [ 1485.006318]  ? exc_invalid_op+0x13/0x70
>> [ 1485.007167]  ? asm_exc_invalid_op+0x16/0x20
>> [ 1485.008117]  ? enqueue_hrtimer+0x6f/0x80
>> [ 1485.009131]  hrtimer_start_range_ns+0x258/0x2f0
>> [ 1485.010153]  ? __pfx_rcu_torture_reader+0x10/0x10
>> [ 1485.011185]  schedule_hrtimeout_range_clock+0x96/0x120
>> [ 1485.012394]  ? __pfx_hrtimer_wakeup+0x10/0x10
>> [ 1485.013502]  stutter_wait+0x7f/0x90
>> [ 1485.014319]  rcu_torture_reader+0x10e/0x280
>> [ 1485.015240]  ? __pfx_rcu_torture_timer+0x10/0x10
>> [ 1485.016317]  ? kthread+0xc6/0xf0
>> [ 1485.017169]  ? __pfx_rcu_torture_reader+0x10/0x10
>> [ 1485.018215]  kthread+0xc6/0xf0
>> [ 1485.018899]  ? __pfx_kthread+0x10/0x10
>> [ 1485.019703]  ret_from_fork+0x2b/0x40
>> [ 1485.020546]  ? __pfx_kthread+0x10/0x10
>> [ 1485.021428]  ret_from_fork_asm+0x1b/0x30
>> [ 1485.022295]  </TASK>
>> 
>> This happens within the following loop
>> 
>> 	for_each_domain(cpu, sd) {
>> 		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
>> 			if (cpu == i)
>> 				continue;
>> 
>> 			if (!idle_cpu(i))
>> 				return i;
>> 		}
>> 	}
>> 
>> An offline CPU is returned from there. Which is not supposed to happen
>> as it's within an RCU read side. But I can't manage to find where those
>> rq->sd things are modified when a CPU is offlining. The code path is hard
>> to follow. Adding some scheduler people involved in topology just in case.
>> Investigation continues...
>
> So I just tried a simple test. I applied the following PoF patch, under the
> assumption that right after the exit of takedown_cpu(), the target CPU is
> offline and all CPUs must see the updates perfomed by the now dead target
> CPU. I expect also that the target CPU should have removed itself from the
> scheduling domains (though I don't know where that happens in the maze):
>

If it's of any help:

during hot-unplug, the scheduler will start poking the sched_domains in
CPUHP_AP_ACTIVE:sched_cpu_deactivate(). We then get to, in the "worst" case
with CONFIG_CPUSET=y:

sched_cpu_deactivate()
`\
  cpuset_cpu_inactive()
  `\
    cpuset_update_active_cpus() # delegated to a workqueue because cpusets are fun

cpuset_hotplug_workfn()
`\
  rebuild_sched_domains()
  `\
    rebuild_sched_domains_locked()

And there, eventually, we assign a NULL rq->sd to the offline CPU:

partition_sched_domains_locked()
`\
  detach_destroy_domains()
  `\
    for_each_cpu(i, cpu_map)
      cpu_attach_domain(NULL, &def_root_domain, i);
        rcu_assign_pointer(rq->sd, sd);

The steps can be followed with CONFIG_SCHED_DEBUG=y and sched_verbose on
the cmdline:

# echo 0 > /sys/devices/system/cpu/cpu3/online
[   23.173551] smpboot: CPU 3 is now offline

# That's the detach_destroy_domains() loop:
[   23.174125] CPU0 attaching NULL sched-domain.
[   23.174570] CPU1 attaching NULL sched-domain.
[   23.175008] CPU2 attaching NULL sched-domain.
[   23.175410] CPU3 attaching NULL sched-domain.

# That's the build_sched_domains() loop:
[   23.175767] CPU0 attaching sched-domain(s):
[   23.176091]  domain-0: span=0-2 level=MC
[   23.176381]   groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }
[   23.176805] CPU1 attaching sched-domain(s):
[   23.177127]  domain-0: span=0-2 level=MC
[   23.177426]   groups: 1:{ span=1 }, 2:{ span=2 }, 0:{ span=0 }
[   23.177849] CPU2 attaching sched-domain(s):
[   23.178171]  domain-0: span=0-2 level=MC
[   23.178482]   groups: 2:{ span=2 }, 0:{ span=0 }, 1:{ span=1 }
[   23.178902] root domain span: 0-2

> diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
> index 6d67e9a5af6b..2e0580abc73a 100644
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,6 +9,7 @@
>  #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>  extern void nohz_balance_enter_idle(int cpu);
>  extern int get_nohz_timer_target(void);
> +extern void sched_assert_online_domains(int cpu);
>  #else
>  static inline void nohz_balance_enter_idle(int cpu) { }
>  #endif
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6c0a92ca6bb5..3424d1d40142 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1024,7 +1024,7 @@ void clear_tasks_mm_cpumask(int cpu)
>  }
>  
>  /* Take this CPU down. */
> -static int take_cpu_down(void *_param)
> +int take_cpu_down(void *_param)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0d18c3969f90..57f483fc1037 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1055,6 +1055,22 @@ void resched_cpu(int cpu)
>  
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_NO_HZ_COMMON
> +void sched_assert_online_domains(int cpu)
> +{
> +	struct sched_domain *sd;
> +	int i;
> +
> +	rcu_read_lock();
> +	for_each_domain(cpu, sd) {
> +		for_each_cpu(i, sched_domain_span(sd)) {
> +			if (cpu == i)
> +				continue;
> +
> +			WARN_ON_ONCE(cpu_is_offline(i));
> +		}
> +	}
> +	rcu_read_unlock();
> +}
>  /*
>   * In the semi idle case, use the nearest busy CPU for migrating timers
>   * from an idle CPU.  This is good for power-savings.
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index cedb17ba158a..354806156bd9 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -22,6 +22,7 @@
>  #include <linux/atomic.h>
>  #include <linux/nmi.h>
>  #include <linux/sched/wake_q.h>
> +#include <linux/sched/nohz.h>
>  
>  /*
>   * Structure to determine completion condition and record errors.  May
> @@ -199,6 +200,8 @@ notrace void __weak stop_machine_yield(const struct cpumask *cpumask)
>  	cpu_relax();
>  }
>  
> +extern int take_cpu_down(void *_param);
> +
>  /* This is the cpu_stop function which stops the CPU. */
>  static int multi_cpu_stop(void *data)
>  {
> @@ -255,6 +258,8 @@ static int multi_cpu_stop(void *data)
>  	} while (curstate != MULTI_STOP_EXIT);
>  
>  	local_irq_restore(flags);
> +	if (!is_active && msdata->fn == take_cpu_down)
> +		sched_assert_online_domains(smp_processor_id());
>  	return err;
>  }
>  
> Then with that patch I ran TREE07, just some short iterations:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>
> And the warning triggers very quickly. At least since v6.3 but maybe since
> earlier. Is this expected behaviour or am I right to assume that
> for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>

I would very much assume an offline CPU shouldn't show up in a
sched_domain_span().

Now, on top of the above, there's one more thing worth noting:
  cpu_up_down_serialize_trainwrecks()

This just flushes the cpuset work, so after that the sched_domain topology
should be sane. However I see it's invoked at the tail end of _cpu_down(),
IOW /after/ takedown_cpu() has run, which sounds too late. The comments
around this vs. lock ordering aren't very reassuring however, so I need to
look into this more.

Maybe as a "quick" test to see if this is the right culprit, you could try
that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
ran within sched_cpu_deactivate().

> Thanks.
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Thomas Gleixner 1 year, 10 months ago
On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> On 22/03/24 14:22, Frederic Weisbecker wrote:
>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> Now, on top of the above, there's one more thing worth noting:
>   cpu_up_down_serialize_trainwrecks()
>
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
the change log.

TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
cpu_hotplug_lock in the hotplug path for whatever silly reason. So
trying to flush the work from within the cpu hotplug lock write held
region is guaranteed to dead lock.

That's why all of this got deferred to a work queue. The flush at the
end of the hotplug code after dropping the hotplug lock is there to
prevent that user space sees the CPU hotplug uevent before the work is
done. But of course between bringing the CPU offline and the flush the
kernel internal state is inconsistent.

There was an attempt to make the CPU hotplug path synchronous in commit
a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
reverted with commit 2b729fe7f3 for the very wrong reason:

https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u

 (cpu_hotplug_lock){++++}-{0:0}:
  lock_acquire+0xe4/0x25c
  cpus_read_lock+0x50/0x154
  static_key_slow_inc+0x18/0x30
  mem_cgroup_css_alloc+0x824/0x8b0
  cgroup_apply_control_enable+0x1d8/0x56c
  cgroup_apply_control+0x40/0x344
  cgroup_subtree_control_write+0x664/0x69c
  cgroup_file_write+0x130/0x2e8
  kernfs_fop_write+0x228/0x32c
  __vfs_write+0x84/0x1d8
  vfs_write+0x13c/0x1b4
  ksys_write+0xb0/0x120

Instead of the revert this should have been fixed by doing:

+  cpus_read_lock();
   mutex_lock();
   mem_cgroup_css_alloc();
-  static_key_slow_inc();
+  static_key_slow_inc_cpuslocked();

Sorry that I did not notice this back then because I was too focussed on
fixing that uevent nonsense in a halfways sane way.

After that revert cpuset locking became completely insane.

cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
advanced part of that insanity. Seriously this commit is just tasteless
and disgusting demonstration of how to paper over a problem which is not
fully understood.

After staring at it some more (including the history which led up to
these insanities) I really think that the CPU hotplug path can be made
truly synchronous and the memory hotplug path should just get the lock
ordering correct.

Can we please fix this for real and get all of this insanity out of the
way including the nonsensical comments in the cpuset code:

 * Call with cpuset_mutex held.  Takes cpus_read_lock().

 ...

       lockdep_assert_cpus_held();
       lockdep_assert_held(&cpuset_mutex);

Oh well...

Thanks,

        tglx
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Valentin Schneider 1 year, 10 months ago
On 27/03/24 21:42, Thomas Gleixner wrote:
> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>> Now, on top of the above, there's one more thing worth noting:
>>   cpu_up_down_serialize_trainwrecks()
>>
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> the change log.
>
> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> trying to flush the work from within the cpu hotplug lock write held
> region is guaranteed to dead lock.
>
> That's why all of this got deferred to a work queue. The flush at the
> end of the hotplug code after dropping the hotplug lock is there to
> prevent that user space sees the CPU hotplug uevent before the work is
> done. But of course between bringing the CPU offline and the flush the
> kernel internal state is inconsistent.
>

Thanks for the summary!

> There was an attempt to make the CPU hotplug path synchronous in commit
> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> reverted with commit 2b729fe7f3 for the very wrong reason:
>
> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>
>  (cpu_hotplug_lock){++++}-{0:0}:
>   lock_acquire+0xe4/0x25c
>   cpus_read_lock+0x50/0x154
>   static_key_slow_inc+0x18/0x30
>   mem_cgroup_css_alloc+0x824/0x8b0
>   cgroup_apply_control_enable+0x1d8/0x56c
>   cgroup_apply_control+0x40/0x344
>   cgroup_subtree_control_write+0x664/0x69c
>   cgroup_file_write+0x130/0x2e8
>   kernfs_fop_write+0x228/0x32c
>   __vfs_write+0x84/0x1d8
>   vfs_write+0x13c/0x1b4
>   ksys_write+0xb0/0x120
>
> Instead of the revert this should have been fixed by doing:
>
> +  cpus_read_lock();
>    mutex_lock();
>    mem_cgroup_css_alloc();
> -  static_key_slow_inc();
> +  static_key_slow_inc_cpuslocked();
>

So looking at the state of things now, writing to the
cgroup.subtree_control file looks like: 

  cgroup_file_write()
  `\
    cgroup_subtree_control_write()
    `\
      cgroup_kn_lock_live()
      `\
      | cgroup_lock()
      | `\
      |   mutex_lock(&cgroup_mutex);
      |
      cgroup_apply_control()
      `\
        cgroup_apply_control_enable()
        `\
          css_create()
          `\
            ss->css_alloc() [mem_cgroup_css_alloc()]
            `\
              static_branch_inc()        

and same with cgroup_mkdir(). So if we want to fix the ordering that caused
the revert, we'd probably want to go for:

  static inline void cgroup_lock(void)
  {
+       cpus_read_lock();
	mutex_lock(&cgroup_mutex);
  }

  static inline void cgroup_unlock(void)
  {
	mutex_unlock(&cgroup_mutex);
-       cpus_read_unlock();        
  }

+ a handful of +_cpuslocked() changes.

As for cpuset, it looks like it's following this lock order:

	cpus_read_lock();
	mutex_lock(&cpuset_mutex);

Which AFAICT is what we want.

> Sorry that I did not notice this back then because I was too focussed on
> fixing that uevent nonsense in a halfways sane way.
>
> After that revert cpuset locking became completely insane.
>
> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> advanced part of that insanity. Seriously this commit is just tasteless
> and disgusting demonstration of how to paper over a problem which is not
> fully understood.
>
> After staring at it some more (including the history which led up to
> these insanities) I really think that the CPU hotplug path can be made
> truly synchronous and the memory hotplug path should just get the lock
> ordering correct.
>
> Can we please fix this for real and get all of this insanity out of the
> way

Yes please!

> including the nonsensical comments in the cpuset code:
>
>  * Call with cpuset_mutex held.  Takes cpus_read_lock().
>
>  ...
>
>        lockdep_assert_cpus_held();
>        lockdep_assert_held(&cpuset_mutex);
>
> Oh well...
>
> Thanks,
>
>         tglx
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Tejun Heo 1 year, 10 months ago
(cc'ing Waiman and quoting whole body)

Hello,

On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
> On 27/03/24 21:42, Thomas Gleixner wrote:
> > On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
> >> On 22/03/24 14:22, Frederic Weisbecker wrote:
> >>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
> >> Now, on top of the above, there's one more thing worth noting:
> >>   cpu_up_down_serialize_trainwrecks()
> >>
> >> This just flushes the cpuset work, so after that the sched_domain topology
> >> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >> around this vs. lock ordering aren't very reassuring however, so I need to
> >> look into this more.
> >
> > commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
> > the change log.
> >
> > TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
> > cpu_hotplug_lock in the hotplug path for whatever silly reason. So
> > trying to flush the work from within the cpu hotplug lock write held
> > region is guaranteed to dead lock.
> >
> > That's why all of this got deferred to a work queue. The flush at the
> > end of the hotplug code after dropping the hotplug lock is there to
> > prevent that user space sees the CPU hotplug uevent before the work is
> > done. But of course between bringing the CPU offline and the flush the
> > kernel internal state is inconsistent.
> >
> 
> Thanks for the summary!
> 
> > There was an attempt to make the CPU hotplug path synchronous in commit
> > a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
> > reverted with commit 2b729fe7f3 for the very wrong reason:
> >
> > https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
> >
> >  (cpu_hotplug_lock){++++}-{0:0}:
> >   lock_acquire+0xe4/0x25c
> >   cpus_read_lock+0x50/0x154
> >   static_key_slow_inc+0x18/0x30
> >   mem_cgroup_css_alloc+0x824/0x8b0
> >   cgroup_apply_control_enable+0x1d8/0x56c
> >   cgroup_apply_control+0x40/0x344
> >   cgroup_subtree_control_write+0x664/0x69c
> >   cgroup_file_write+0x130/0x2e8
> >   kernfs_fop_write+0x228/0x32c
> >   __vfs_write+0x84/0x1d8
> >   vfs_write+0x13c/0x1b4
> >   ksys_write+0xb0/0x120
> >
> > Instead of the revert this should have been fixed by doing:
> >
> > +  cpus_read_lock();
> >    mutex_lock();
> >    mem_cgroup_css_alloc();
> > -  static_key_slow_inc();
> > +  static_key_slow_inc_cpuslocked();
> >
> 
> So looking at the state of things now, writing to the
> cgroup.subtree_control file looks like: 
> 
>   cgroup_file_write()
>   `\
>     cgroup_subtree_control_write()
>     `\
>       cgroup_kn_lock_live()
>       `\
>       | cgroup_lock()
>       | `\
>       |   mutex_lock(&cgroup_mutex);
>       |
>       cgroup_apply_control()
>       `\
>         cgroup_apply_control_enable()
>         `\
>           css_create()
>           `\
>             ss->css_alloc() [mem_cgroup_css_alloc()]
>             `\
>               static_branch_inc()        
> 
> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
> the revert, we'd probably want to go for:
> 
>   static inline void cgroup_lock(void)
>   {
> +       cpus_read_lock();
> 	mutex_lock(&cgroup_mutex);
>   }
> 
>   static inline void cgroup_unlock(void)
>   {
> 	mutex_unlock(&cgroup_mutex);
> -       cpus_read_unlock();        
>   }
> 
> + a handful of +_cpuslocked() changes.
> 
> As for cpuset, it looks like it's following this lock order:
> 
> 	cpus_read_lock();
> 	mutex_lock(&cpuset_mutex);
> 
> Which AFAICT is what we want.
> 
> > Sorry that I did not notice this back then because I was too focussed on
> > fixing that uevent nonsense in a halfways sane way.
> >
> > After that revert cpuset locking became completely insane.
> >
> > cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
> > advanced part of that insanity. Seriously this commit is just tasteless
> > and disgusting demonstration of how to paper over a problem which is not
> > fully understood.
> >
> > After staring at it some more (including the history which led up to
> > these insanities) I really think that the CPU hotplug path can be made
> > truly synchronous and the memory hotplug path should just get the lock
> > ordering correct.
> >
> > Can we please fix this for real and get all of this insanity out of the
> > way
> 
> Yes please!

Yeah, making that operation synchronous would be nice. Waiman, you're a lot
more familiar with this part than I am. Can you please take a look and see
whether we can turn the sched domain updates synchronous?

Thanks.

-- 
tejun
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Waiman Long 1 year, 10 months ago
On 3/28/24 22:08, Tejun Heo wrote:
> (cc'ing Waiman and quoting whole body)
>
> Hello,
>
> On Thu, Mar 28, 2024 at 09:39:56PM +0100, Valentin Schneider wrote:
>> On 27/03/24 21:42, Thomas Gleixner wrote:
>>> On Tue, Mar 26 2024 at 17:46, Valentin Schneider wrote:
>>>> On 22/03/24 14:22, Frederic Weisbecker wrote:
>>>>> On Fri, Mar 22, 2024 at 12:32:26PM +0100, Frederic Weisbecker wrote:
>>>> Now, on top of the above, there's one more thing worth noting:
>>>>    cpu_up_down_serialize_trainwrecks()
>>>>
>>>> This just flushes the cpuset work, so after that the sched_domain topology
>>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>>> look into this more.
>>> commit b22afcdf04c96ca58327784e280e10288cfd3303 has more information in
>>> the change log.
>>>
>>> TLDR: The problem is that cpusets have the lock order cpuset_mutex ->
>>> cpu_hotplug_lock in the hotplug path for whatever silly reason. So
>>> trying to flush the work from within the cpu hotplug lock write held
>>> region is guaranteed to dead lock.
>>>
>>> That's why all of this got deferred to a work queue. The flush at the
>>> end of the hotplug code after dropping the hotplug lock is there to
>>> prevent that user space sees the CPU hotplug uevent before the work is
>>> done. But of course between bringing the CPU offline and the flush the
>>> kernel internal state is inconsistent.
>>>
>> Thanks for the summary!
>>
>>> There was an attempt to make the CPU hotplug path synchronous in commit
>>> a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") which got
>>> reverted with commit 2b729fe7f3 for the very wrong reason:
>>>
>>> https://lore.kernel.org/all/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw/T/#u
>>>
>>>   (cpu_hotplug_lock){++++}-{0:0}:
>>>    lock_acquire+0xe4/0x25c
>>>    cpus_read_lock+0x50/0x154
>>>    static_key_slow_inc+0x18/0x30
>>>    mem_cgroup_css_alloc+0x824/0x8b0
>>>    cgroup_apply_control_enable+0x1d8/0x56c
>>>    cgroup_apply_control+0x40/0x344
>>>    cgroup_subtree_control_write+0x664/0x69c
>>>    cgroup_file_write+0x130/0x2e8
>>>    kernfs_fop_write+0x228/0x32c
>>>    __vfs_write+0x84/0x1d8
>>>    vfs_write+0x13c/0x1b4
>>>    ksys_write+0xb0/0x120
>>>
>>> Instead of the revert this should have been fixed by doing:
>>>
>>> +  cpus_read_lock();
>>>     mutex_lock();
>>>     mem_cgroup_css_alloc();
>>> -  static_key_slow_inc();
>>> +  static_key_slow_inc_cpuslocked();
>>>
>> So looking at the state of things now, writing to the
>> cgroup.subtree_control file looks like:
>>
>>    cgroup_file_write()
>>    `\
>>      cgroup_subtree_control_write()
>>      `\
>>        cgroup_kn_lock_live()
>>        `\
>>        | cgroup_lock()
>>        | `\
>>        |   mutex_lock(&cgroup_mutex);
>>        |
>>        cgroup_apply_control()
>>        `\
>>          cgroup_apply_control_enable()
>>          `\
>>            css_create()
>>            `\
>>              ss->css_alloc() [mem_cgroup_css_alloc()]
>>              `\
>>                static_branch_inc()
>>
>> and same with cgroup_mkdir(). So if we want to fix the ordering that caused
>> the revert, we'd probably want to go for:
>>
>>    static inline void cgroup_lock(void)
>>    {
>> +       cpus_read_lock();
>> 	mutex_lock(&cgroup_mutex);
>>    }
>>
>>    static inline void cgroup_unlock(void)
>>    {
>> 	mutex_unlock(&cgroup_mutex);
>> -       cpus_read_unlock();
>>    }
>>
>> + a handful of +_cpuslocked() changes.
>>
>> As for cpuset, it looks like it's following this lock order:
>>
>> 	cpus_read_lock();
>> 	mutex_lock(&cpuset_mutex);
>>
>> Which AFAICT is what we want.
>>
>>> Sorry that I did not notice this back then because I was too focussed on
>>> fixing that uevent nonsense in a halfways sane way.
>>>
>>> After that revert cpuset locking became completely insane.
>>>
>>> cpuset_hotplug_cpus_read_trylock() is the most recent but also the most
>>> advanced part of that insanity. Seriously this commit is just tasteless
>>> and disgusting demonstration of how to paper over a problem which is not
>>> fully understood.
>>>
>>> After staring at it some more (including the history which led up to
>>> these insanities) I really think that the CPU hotplug path can be made
>>> truly synchronous and the memory hotplug path should just get the lock
>>> ordering correct.
>>>
>>> Can we please fix this for real and get all of this insanity out of the
>>> way
>> Yes please!
> Yeah, making that operation synchronous would be nice. Waiman, you're a lot
> more familiar with this part than I am. Can you please take a look and see
> whether we can turn the sched domain updates synchronous?

Sure. I will take a look on how to make cpu hotplug sched domain update 
synchronous.

Cheers,
Longman
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Frederic Weisbecker 1 year, 10 months ago
Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
> > Then with that patch I ran TREE07, just some short iterations:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
> >
> > And the warning triggers very quickly. At least since v6.3 but maybe since
> > earlier. Is this expected behaviour or am I right to assume that
> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
> >
> 
> I would very much assume an offline CPU shouldn't show up in a
> sched_domain_span().
> 
> Now, on top of the above, there's one more thing worth noting:
>   cpu_up_down_serialize_trainwrecks()
> 
> This just flushes the cpuset work, so after that the sched_domain topology
> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> around this vs. lock ordering aren't very reassuring however, so I need to
> look into this more.

Ouch...

> 
> Maybe as a "quick" test to see if this is the right culprit, you could try
> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
> ran within sched_cpu_deactivate().

I just tried and I fear that doesn't help. It still triggers even without
cpusets :-s

Thanks.
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Valentin Schneider 1 year, 10 months ago
On 27/03/24 13:42, Frederic Weisbecker wrote:
> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>> > Then with that patch I ran TREE07, just some short iterations:
>> >
>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>> >
>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>> > earlier. Is this expected behaviour or am I right to assume that
>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>> >
>> 
>> I would very much assume an offline CPU shouldn't show up in a
>> sched_domain_span().
>> 
>> Now, on top of the above, there's one more thing worth noting:
>>   cpu_up_down_serialize_trainwrecks()
>> 
>> This just flushes the cpuset work, so after that the sched_domain topology
>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> around this vs. lock ordering aren't very reassuring however, so I need to
>> look into this more.
>
> Ouch...
>
>> 
>> Maybe as a "quick" test to see if this is the right culprit, you could try
>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>> ran within sched_cpu_deactivate().
>
> I just tried and I fear that doesn't help. It still triggers even without
> cpusets :-s
>

What, you mean I can't always blame cgroups? What has the world come to?

That's interesting, it means the deferred work item isn't the (only)
issue. I'll grab your test patch and try to reproduce on TREE07.

> Thanks.
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Valentin Schneider 1 year, 10 months ago
On 27/03/24 15:28, Valentin Schneider wrote:
> On 27/03/24 13:42, Frederic Weisbecker wrote:
>> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>>> > Then with that patch I ran TREE07, just some short iterations:
>>> >
>>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>>> >
>>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>>> > earlier. Is this expected behaviour or am I right to assume that
>>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>>> >
>>> 
>>> I would very much assume an offline CPU shouldn't show up in a
>>> sched_domain_span().
>>> 
>>> Now, on top of the above, there's one more thing worth noting:
>>>   cpu_up_down_serialize_trainwrecks()
>>> 
>>> This just flushes the cpuset work, so after that the sched_domain topology
>>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>>> around this vs. lock ordering aren't very reassuring however, so I need to
>>> look into this more.
>>
>> Ouch...
>>
>>> 
>>> Maybe as a "quick" test to see if this is the right culprit, you could try
>>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>>> ran within sched_cpu_deactivate().
>>
>> I just tried and I fear that doesn't help. It still triggers even without
>> cpusets :-s
>>
>
> What, you mean I can't always blame cgroups? What has the world come to?
>
> That's interesting, it means the deferred work item isn't the (only)
> issue. I'll grab your test patch and try to reproduce on TREE07.
>

Unfortunately I haven't been able to trigger your warning with ~20 runs of
TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
culprit...

>> Thanks.
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Frederic Weisbecker 1 year, 10 months ago
Le Thu, Mar 28, 2024 at 03:08:08PM +0100, Valentin Schneider a écrit :
> On 27/03/24 15:28, Valentin Schneider wrote:
> > On 27/03/24 13:42, Frederic Weisbecker wrote:
> >> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
> >>> > Then with that patch I ran TREE07, just some short iterations:
> >>> >
> >>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
> >>> >
> >>> > And the warning triggers very quickly. At least since v6.3 but maybe since
> >>> > earlier. Is this expected behaviour or am I right to assume that
> >>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
> >>> >
> >>> 
> >>> I would very much assume an offline CPU shouldn't show up in a
> >>> sched_domain_span().
> >>> 
> >>> Now, on top of the above, there's one more thing worth noting:
> >>>   cpu_up_down_serialize_trainwrecks()
> >>> 
> >>> This just flushes the cpuset work, so after that the sched_domain topology
> >>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
> >>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
> >>> around this vs. lock ordering aren't very reassuring however, so I need to
> >>> look into this more.
> >>
> >> Ouch...
> >>
> >>> 
> >>> Maybe as a "quick" test to see if this is the right culprit, you could try
> >>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
> >>> ran within sched_cpu_deactivate().
> >>
> >> I just tried and I fear that doesn't help. It still triggers even without
> >> cpusets :-s
> >>
> >
> > What, you mean I can't always blame cgroups? What has the world come to?
> >
> > That's interesting, it means the deferred work item isn't the (only)
> > issue. I'll grab your test patch and try to reproduce on TREE07.
> >
> 
> Unfortunately I haven't been able to trigger your warning with ~20 runs of
> TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
> CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
> culprit...

Funny, I just checked again and I can still reliably reproduce with:

./tools/testing/selftests/rcutorture/bin/kvm.sh --kconfig "CONFIG_CPUSETS=n CONFIG_PROC_PID_CPUSET=n" --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2

I'm thinking there might be several culprits... ;-)
Re: for_each_domain()/sched_domain_span() has offline CPUs (was Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full)
Posted by Valentin Schneider 1 year, 10 months ago
On 28/03/24 17:58, Frederic Weisbecker wrote:
> Le Thu, Mar 28, 2024 at 03:08:08PM +0100, Valentin Schneider a écrit :
>> On 27/03/24 15:28, Valentin Schneider wrote:
>> > On 27/03/24 13:42, Frederic Weisbecker wrote:
>> >> Le Tue, Mar 26, 2024 at 05:46:07PM +0100, Valentin Schneider a écrit :
>> >>> > Then with that patch I ran TREE07, just some short iterations:
>> >>> >
>> >>> > tools/testing/selftests/rcutorture/bin/kvm.sh --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>> >>> >
>> >>> > And the warning triggers very quickly. At least since v6.3 but maybe since
>> >>> > earlier. Is this expected behaviour or am I right to assume that
>> >>> > for_each_domain()/sched_domain_span() shouldn't return an offline CPU?
>> >>> >
>> >>> 
>> >>> I would very much assume an offline CPU shouldn't show up in a
>> >>> sched_domain_span().
>> >>> 
>> >>> Now, on top of the above, there's one more thing worth noting:
>> >>>   cpu_up_down_serialize_trainwrecks()
>> >>> 
>> >>> This just flushes the cpuset work, so after that the sched_domain topology
>> >>> should be sane. However I see it's invoked at the tail end of _cpu_down(),
>> >>> IOW /after/ takedown_cpu() has run, which sounds too late. The comments
>> >>> around this vs. lock ordering aren't very reassuring however, so I need to
>> >>> look into this more.
>> >>
>> >> Ouch...
>> >>
>> >>> 
>> >>> Maybe as a "quick" test to see if this is the right culprit, you could try
>> >>> that with CONFIG_CPUSET=n? Because in that case the sched_domain update is
>> >>> ran within sched_cpu_deactivate().
>> >>
>> >> I just tried and I fear that doesn't help. It still triggers even without
>> >> cpusets :-s
>> >>
>> >
>> > What, you mean I can't always blame cgroups? What has the world come to?
>> >
>> > That's interesting, it means the deferred work item isn't the (only)
>> > issue. I'll grab your test patch and try to reproduce on TREE07.
>> >
>> 
>> Unfortunately I haven't been able to trigger your warning with ~20 runs of
>> TREE07 & CONFIG_CPUSETS=n, however it does trigger reliably with
>> CONFIG_CPUSETS=y, so I'm back to thinking the cpuset work is a likely
>> culprit...
>
> Funny, I just checked again and I can still reliably reproduce with:
>
> ./tools/testing/selftests/rcutorture/bin/kvm.sh --kconfig "CONFIG_CPUSETS=n CONFIG_PROC_PID_CPUSET=n" --configs "10*TREE07" --allcpus --bootargs "rcutorture.onoff_interval=200" --duration 2
>
> I'm thinking there might be several culprits... ;-)

Hmm, frustrating that I can't seem to reproduce this...

Could you run this with CONFIG_SCHED_DEBUG=y and sched_verbose on the
cmdline? And maybe tweak the warning to show which CPU we are scanning the
sched_domain of and which one we found to be offline in the span.