stop_machine: Add stop_housekeeping_cpuslocked()

[RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Costa Shulyupin 1 year, 1 month ago

CPU hotplug interferes with CPU isolation and introduces latency to
real-time tasks.

The test:

  rtla timerlat hist -c 1 -a 500 &
  echo 0 > /sys/devices/system/cpu/cpu2/online

The RTLA tool reveals the following blocking thread stack trace:

               -> multi_cpu_stop
               -> cpu_stopper_thread
               -> smpboot_thread_fn

This happens because multi_cpu_stop() disables interrupts for EACH online
CPU since takedown_cpu() indirectly invokes take_cpu_down() through
stop_machine_cpuslocked(). I'm omitting the detailed description of the
call chain.

Proposal: Limit the stop operation to housekeeping CPUs.

take_cpu_down() invokes with cpuhp_invoke_callback_range_nofail:
 - tick_cpu_dying()
 - hrtimers_cpu_dying()
 - smpcfd_dying_cpu()
 - x86_pmu_dying_cpu()
 - rcutree_dying_cpu()
 - sched_cpu_dying()
 - cache_ap_offline()

Which synchronizations do these functions require instead of stop_machine?

Passed standard regression tests:
 - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6042
 - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/89

Passes primitive CPU hotplug test without warnings.

What tests and test suites do you recommend?

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 include/linux/stop_machine.h | 12 ++++++++++++
 kernel/cpu.c                 |  2 +-
 kernel/stop_machine.c        | 23 +++++++++++++++++++++++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3132262a404dc..4c9e709f174c8 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -140,6 +140,18 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data, const struct cpumask *
  */
 int stop_core_cpuslocked(unsigned int cpu, cpu_stop_fn_t fn, void *data);
 
+/**
+ * stop_housekeeping_cpuslocked: freeze housekeeping CPUs and run specified function.
+ * Unlike stop_machine_cpuslocked, it doesn't stop isolated CPUs.
+ * @fn: the function to run
+ * @data: the data ptr for the @fn()
+ * @cpus: the cpus to run the @fn() on (NULL = any online cpu)
+ *
+ * Must be called from with in a cpus_read_lock() protected
+ * region. Avoids nested calls to cpus_read_lock().
+ */
+int stop_housekeeping_cpuslocked(cpu_stop_fn_t fn, void *data, const struct cpumask *cpus);
+
 int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, void *data,
 				   const struct cpumask *cpus);
 #else	/* CONFIG_SMP || CONFIG_HOTPLUG_CPU */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6d7c51e7c366c..052f48bebf816 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1310,7 +1310,7 @@ static int takedown_cpu(unsigned int cpu)
 	/*
 	 * So now all preempt/rcu users must observe !cpu_active().
 	 */
-	err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
+	err = stop_housekeeping_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
 	if (err) {
 		/* CPU refused to die */
 		irq_unlock_sparse();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index da821ce258ea7..1f36e44d42476 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -22,6 +22,7 @@
 #include <linux/atomic.h>
 #include <linux/nmi.h>
 #include <linux/sched/wake_q.h>
+#include <linux/sched/isolation.h>
 
 /*
  * Structure to determine completion condition and record errors.  May
@@ -619,6 +620,28 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
 	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
 }
 
+int stop_housekeeping_cpuslocked(cpu_stop_fn_t fn, void *data, const struct cpumask *cpus)
+{
+	struct multi_stop_data msdata = { .fn = fn, .data = data, .active_cpus = cpus };
+	cpumask_var_t stop_mask;
+	int ret;
+
+	lockdep_assert_cpus_held();
+
+	if (!alloc_cpumask_var(&stop_mask, GFP_KERNEL))
+		return -ENOMEM;
+	cpumask_and(stop_mask, cpu_online_mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+	cpumask_or(stop_mask, stop_mask, cpus);
+	msdata.num_threads = cpumask_weight(stop_mask);
+
+	/* Set the initial state and stop online housekeeping cpus. */
+	set_state(&msdata, MULTI_STOP_PREPARE);
+	ret = stop_cpus(stop_mask, multi_cpu_stop, &msdata);
+	free_cpumask_var(stop_mask);
+
+	return ret;
+}
+
 int stop_machine(cpu_stop_fn_t fn, void *data, const struct cpumask *cpus)
 {
 	int ret;
-- 
2.47.0

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Srivatsa S. Bhat 1 year, 1 month ago

Hi Costa,

On Wed, Dec 18, 2024 at 07:15:31PM +0200, Costa Shulyupin wrote:
> CPU hotplug interferes with CPU isolation and introduces latency to
> real-time tasks.
> 
> The test:
> 
>   rtla timerlat hist -c 1 -a 500 &
>   echo 0 > /sys/devices/system/cpu/cpu2/online
> 
> The RTLA tool reveals the following blocking thread stack trace:
> 
>                -> multi_cpu_stop
>                -> cpu_stopper_thread
>                -> smpboot_thread_fn
> 
> This happens because multi_cpu_stop() disables interrupts for EACH online
> CPU since takedown_cpu() indirectly invokes take_cpu_down() through
> stop_machine_cpuslocked(). I'm omitting the detailed description of the
> call chain.
> 

I had explored removing stop-machine from the CPU hotplug offline path
a very long time ago:

https://lore.kernel.org/all/20130218123714.26245.61816.stgit@srivatsabhat.in.ibm.com/

Towards the tail end of that patchset is the actual change that
replaces the call to __stop_machine() with stop_one_cpus():

https://lore.kernel.org/all/20130218124431.26245.10956.stgit@srivatsabhat.in.ibm.com/

But before that, there were ~45 odd patches in the series to make sure
that all the existing CPU hotplug callbacks (at the time, in that
kernel version) relying on any implicit assumptions related to the
guarantees provided by stop_machine() were adequately addressed with
an alternative scheme before switching over to stop_one_cpu() for CPU
offlining.

> Proposal: Limit the stop operation to housekeeping CPUs.
> 
> take_cpu_down() invokes with cpuhp_invoke_callback_range_nofail:
>  - tick_cpu_dying()
>  - hrtimers_cpu_dying()
>  - smpcfd_dying_cpu()
>  - x86_pmu_dying_cpu()
>  - rcutree_dying_cpu()
>  - sched_cpu_dying()
>  - cache_ap_offline()
> 
> Which synchronizations do these functions require instead of stop_machine?
> 

I'd recommend taking a look at one such prior attempt to remove
stop_machine from CPU hotplug (shared above) for reference, as you
begin your analysis for the current kernel.

Regards,
Srivatsa
Microsoft Linux Systems Group

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Thomas Gleixner 1 year, 1 month ago

Costa!

On Wed, Dec 18 2024 at 19:15, Costa Shulyupin wrote:
> CPU hotplug interferes with CPU isolation and introduces latency to
> real-time tasks.
>
> The test:
>
>   rtla timerlat hist -c 1 -a 500 &
>   echo 0 > /sys/devices/system/cpu/cpu2/online
>
> The RTLA tool reveals the following blocking thread stack trace:
>
>                -> multi_cpu_stop
>                -> cpu_stopper_thread
>                -> smpboot_thread_fn
>
> This happens because multi_cpu_stop() disables interrupts for EACH online
> CPU since takedown_cpu() indirectly invokes take_cpu_down() through
> stop_machine_cpuslocked(). I'm omitting the detailed description of the
> call chain.

TLDR: It's well known that taking a CPU offline is done via
      stop_machine(), which requires all online CPUs to spin in the
      stopper thread.

> Proposal: Limit the stop operation to housekeeping CPUs.

You proposed other non-working solutions before. How is that any
different?

Especially after Peter told you how to proceed:

  https://lore.kernel.org/all/20241209095730.GG21636@noisy.programming.kicks-ass.net

He said after you tinkered with stop_one_cpu():

  "Audit the full cpu hotplug stack and determine who all relies on this
   'implicit' serialization, then proceed to provide alternative
   solutions for these sites. Then, at the very end move to
   stop_one_cpu()."

Where is that audit or the argument why you don't need the audit because
you now restrict stop_machine() to the housekeeping CPUs?

> take_cpu_down() invokes with cpuhp_invoke_callback_range_nofail:
>  - tick_cpu_dying()
>  - hrtimers_cpu_dying()
>  - smpcfd_dying_cpu()
>  - x86_pmu_dying_cpu()
>  - rcutree_dying_cpu()
>  - sched_cpu_dying()
>  - cache_ap_offline()

That's what happens to be called with your kernel config on your test
system. What about the other 50+ CPU hotplug states in that range, which
are .config or architecture dependent?

> Which synchronizations do these functions require instead of
> stop_machine?

So you propose a change without knowing what the consequences are and
expect us to do the audit (aka. your homework) and provide it to you on
a silver tablet?

Actually you are asking the wrong question. Peter told you to determine
which other parts of the kernel rely on that implicit serialization for
a reason:

  The callbacks on the unplugged CPU are not the main problem, but still
  they have to be audited one by one.

  The real problem is the state change caused by these callbacks, which
  is visible to the rest of the system. Which consequences has such a
  state change, if it is not serialized via stop_machine().

  And that's what Peter asked you to audit and provide, no?

> Passed standard regression tests:
>  - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6042
>  - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/89
>
> Passes primitive CPU hotplug test without warnings.

Q: What is the value of this statement?

A: Zero, as the change comes with zero analysis and the random centos
   regression tests based on primitive CPU hotplug tests are as useful
   as running hello_world().

> What tests and test suites do you recommend?

You are not really asking me to explain the usage of git grep and gurgle
search?

Let me give you a hint.

  You now propose to only stop the housekeeping CPUs because (my
  assumption) you assume that all isolated CPUs are happily computing
  along in user space or in a VM, which is equivalent from the kernel
  POV.

Q: How is that assumption justified?

A: Not at all. [1]

   There is no guarantee that isolated CPUs (user/vm) are not inside the
   kernel, are not coming back to the kernel while the unplug is in
   progress or come back to the kernel post unplug and observe stale
   state.

That's not something which can be answered by random test suites. That's
something which has to be audited and argued for correctness, which is
what you have been asked to provide, right?

The test suites are there to validate correctness and to eventually find
the odd corner cases you missed in your analysis and which were missed
by people reviewing it. They cannot replace upfront analysis as their
test coverage is not complete.

Your approach to solve this problem seems to be to throw random
sh^Wideas at the wall and see what sticks. That's not how it works.

You want this "feature", so it's your task to do the homework, i.e. the
analysis you have been asked for, and to argue why your changes are
correct under all circumstances and not only under a set of your made up
assumptions. Even if these assumptions match the reality of your use
case, you still have to provide the argument and the prove that there is
no other use case, which makes your assumptions moot.

For illustration let me go back to [1] above. Assume there are two
housekeeping CPUs and two isolated CPUs.

HKCPU0		HKCPU1		ISOLCPU0	ISOLCPU1

dont'care       don't care      user space      don't care      // context

unplug(ISOLCPU1)                
                                                CPU hotplug thread
                                                invokes all thread
                                                context callbacks

takedown_cpu()
  stop_machine(HKCPUS)

                stopper_thread()                stopper_thread()
  stopper_thread()				  local_irq_disable();
                                                  // All HK CPUs arrived
                                                  invoke_dying_callbacks()

                                 syscall/vmexit()
                                  ....
                                  smp_function_call(cpu_online_mask, wait=true)
                                    ....
                                    wait_for_completion()

                                                  set_cpu_offline(false);

ISOLCPU0 now waits forever because ISOLCPU1 cannot process the IPI as it
has interrupts disabled already and had not yet removed itself in the
CPU online mask wehn ISOLCPU0 sent the IPI.

That's one simple example which makes your assumption completely moot as
you cannot make any unconditionally valid assumptions about user space
or VM behaviour. There are tons of other examples, but that's not the
maintainers/reviewers job to explain them to you.

It's your job to explain to the maintainers/reviewers why these problems
do not exist according to your fact based analysis.

Please come back when you can provide such fact based analysis and
spare us the next random "brilliant" idea thrown at the wall.

Thanks,

        tglx

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Peter Zijlstra 1 year, 1 month ago

On Wed, Dec 18, 2024 at 07:15:31PM +0200, Costa Shulyupin wrote:

> Which synchronizations do these functions require instead of stop_machine?

*sigh*, so you're asking us to do your homework?

But clearly you're not realizing the scope of the thing: stop_machine()
serializes against every preempt_disable() region in the entire kernel.

So you're telling me there isn't a single preempt_disable() region in
the kernel that depends on being before stop_machine() in its entirety?

I know for a fact I've written some in the past 20 years -- what I don't
know if any of them still live and are still relying on it, because I've
also added synchronize_rcu_sched(), which later became synchronize_rcu()
which implies the same, in various parts of the hotplug machinery.

Anyway, without you doing some proper analysis, your patches are going
exactly nowhere.

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Frederic Weisbecker 1 year, 1 month ago

Le Wed, Dec 18, 2024 at 06:27:54PM +0100, Peter Zijlstra a écrit :
> On Wed, Dec 18, 2024 at 07:15:31PM +0200, Costa Shulyupin wrote:
> 
> > Which synchronizations do these functions require instead of stop_machine?
> 
> *sigh*, so you're asking us to do your homework?
> 
> But clearly you're not realizing the scope of the thing: stop_machine()
> serializes against every preempt_disable() region in the entire kernel.
> 
> So you're telling me there isn't a single preempt_disable() region in
> the kernel that depends on being before stop_machine() in its entirety?
> 
> I know for a fact I've written some in the past 20 years -- what I don't
> know if any of them still live and are still relying on it, because I've
> also added synchronize_rcu_sched(), which later became synchronize_rcu()
> which implies the same, in various parts of the hotplug machinery.
> 
> Anyway, without you doing some proper analysis, your patches are going
> exactly nowhere.

And so given the cost of such analysis and resulting possible patches, here
is an important question: is it worth the effort? What is the usecase of
shutting down a CPU while other isolated CPUs run critical isolated stuff?

Thanks.

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Costa Shulyupin 1 year, 1 month ago

On Wed, 18 Dec 2024 at 19:38, Frederic Weisbecker <frederic@kernel.org> wrote:
> And so given the cost of such analysis and resulting possible patches, here
> is an important question: is it worth the effort? What is the usecase of
> shutting down a CPU while other isolated CPUs run critical isolated stuff?

The goal is to implement dynamic CPU isolation for realtime tasks such as DPDK.

Thomas Gleixner in https://lore.kernel.org/lkml/87bjz2210r.ffs@tglx/ suggested:
"CPU hotplug solves this problem without any hackery. Take a CPU offline,
change the mask of that CPU and bring it online again. Repeat until all
CPU changes are done."

Unfortunately, CPU HP interferes with realtime tests and is
unsuitable for dynamic CPU isolation.

Meanwhile, the maximum number of hyperthreads is climbing to 1024,
increasing the demand for CPU HP.

Thank you,
Costa

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Frederic Weisbecker 1 year, 1 month ago

Le Mon, Jan 06, 2025 at 03:08:11PM +0200, Costa Shulyupin a écrit :
> On Wed, 18 Dec 2024 at 19:38, Frederic Weisbecker <frederic@kernel.org> wrote:
> > And so given the cost of such analysis and resulting possible patches, here
> > is an important question: is it worth the effort? What is the usecase of
> > shutting down a CPU while other isolated CPUs run critical isolated stuff?
> 
> The goal is to implement dynamic CPU isolation for realtime tasks such as DPDK.
> 
> Thomas Gleixner in https://lore.kernel.org/lkml/87bjz2210r.ffs@tglx/ suggested:
> "CPU hotplug solves this problem without any hackery. Take a CPU offline,
> change the mask of that CPU and bring it online again. Repeat until all
> CPU changes are done."
> 
> Unfortunately, CPU HP interferes with realtime tests and is
> unsuitable for dynamic CPU isolation.
> 
> Meanwhile, the maximum number of hyperthreads is climbing to 1024,
> increasing the demand for CPU HP.

I must confess I don't understand well your constraints. Why would you
change the set of isolated CPUs while running realtime tests (btw. did you mean
"tasks"?).

Do I understand it correctly that your server may run different kinds of
workloads concurrently, some of them isolated and some of them not, and these
workloads may be added / removed concurrently _anytime_? And therefore
a newly added isolated workload (which then adds CPUs to the isolated set)
mustn't disturb unrelated already running isolated workloads?

When you refer to realtime tasks, do you mean isolated?

Thanks.

> 
> Thank you,
> Costa
>

Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()

Posted by Costa Shulyupin 1 year, 1 month ago

On Mon, 6 Jan 2025 at 23:25, Frederic Weisbecker <frederic@kernel.org> wrote:
> I must confess I don't understand well your constraints. Why would you
> change the set of isolated CPUs while running realtime tests (btw. did you mean
> "tasks"?).
>
> Do I understand it correctly that your server may run different kinds of
> workloads concurrently, some of them isolated and some of them not, and these
> workloads may be added / removed concurrently _anytime_? And therefore
> a newly added isolated workload (which then adds CPUs to the isolated set)
> mustn't disturb unrelated already running isolated workloads?
Right.

Our client telco companies deploy 5G RAN low-latency realtime
applications on our platform.

Typical System Stack:

Workload: DPDK runs in containers on isolated CPUs.
Orchestration: OpenShift/Kubernetes runs on housekeeping CPUs.
Operating System: Linux with real-time preemption and statically
configured CPU isolation.
Hardware: Intel CPUs with 64 or more hyperthreads.

Current Limitations:

Currently, CPU isolation can only be configured statically. Changing
CPU isolation requires modifying boot arguments and rebooting the
system. The objective is to enable dynamic CPU isolation configuration
without requiring a reboot.

We conduct various realtime latency tests, including RTLA, to ensure
latency remains below 30 µs.

> When you refer to realtime tasks, do you mean isolated?
Realtime tasks require realtime preemption and CPU isolation.

Thank you,
Costa