[PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond

Chuyi Zhou posted 11 patches 6 days, 12 hours ago
[PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 6 days, 12 hours ago
Now smp_call_function_many_cond() disables preemption mainly for the
following reasons:

- To prevent the remote online CPU from going offline. Specifically, we
want to ensure that no new csds are queued after smpcfd_dying_cpu() has
finished. Therefore, preemption must be disabled until all necessary IPIs
are sent.

- To prevent migration to another CPU, which also implicitly prevents the
current CPU from going offline (since stop_machine requires preempting the
current task to execute offline callbacks). This can be achieved equally
using migrate_disable(), as tasks must be migrated to other CPUs before
takedown_cpu().

- To protect the per-cpu cfd_data from concurrent modification by other
smp_call_*() on the current CPU. cfd_data contains cpumasks and per-cpu
csds. Before enqueueing a csd, we block on the csd_lock() to ensure the
previous asyc csd->func() has completed, and then initialize csd->func and
csd->info. After sending the IPI, we spin-wait for the remote CPU to call
csd_unlock(). Actually the csd_lock mechanism already guarantees csd
serialization. If preemption occurs during csd_lock_wait, other concurrent
smp_call_function_many_cond calls will simply block until the previous
csd->func() completes:

task A                    task B

sd->func = fun_a
send ipis

                preempted by B
               --------------->
                        csd_lock(csd); // block until last
                                       // fun_a finished

                        csd->func = func_b;
                        csd->info = info;
                            ...
                        send ipis

                switch back to A
                <---------------

csd_lock_wait(csd); // block until remote finish func_*

This patch use migrate_disable() to protect the scope of
smp_call_function_many_cond() and enables preemption before csd_lock_wait.
This makes the potentially unpredictable csd_lock_wait preemptible. Using
cpumask_stack can avoid concurrency modification issues, and we can
fall back to the default logic if alloc_cpumask_var() fails.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/smp.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 35948afced2e..af9cee7d4939 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -802,7 +802,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 					unsigned int scf_flags,
 					smp_cond_func_t cond_func)
 {
-	int cpu, last_cpu, this_cpu = smp_processor_id();
+	int cpu, last_cpu, this_cpu;
 	struct call_function_data *cfd;
 	bool wait = scf_flags & SCF_WAIT;
 	bool preemptible_wait = true;
@@ -811,11 +811,18 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	int nr_cpus = 0;
 	bool run_remote = false;
 
-	lockdep_assert_preemption_disabled();
-
-	if (!alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
+	if (!wait || !alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
 		preemptible_wait = false;
 
+	/*
+	 * Prevent the current CPU from going offline.
+	 * Being migrated to another CPU and calling csd_lock_wait() may cause
+	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
+	 */
+	migrate_disable();
+
+	this_cpu = get_cpu();
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
@@ -898,6 +905,22 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		local_irq_restore(flags);
 	}
 
+	/*
+	 * We may block in csd_lock_wait() for a significant amount of time, especially
+	 * when interrupts are disabled or with a large number of remote CPUs.
+	 * Try to enable preemption before csd_lock_wait().
+	 *
+	 * - If @wait is true, we try to use the cpumask_stack instead of cfd->cpumask to
+	 * avoid concurrency modification from tasks on the same cpu. If alloc_cpumask_var()
+	 * return false, fallback to the default logic.
+	 *
+	 * - If preemption occurs during csd_lock_wait, other concurrent
+	 * smp_call_function_many_cond() calls will simply block until the previous csd->func()
+	 * complete.
+	 */
+	if (preemptible_wait)
+		put_cpu();
+
 	if (run_remote && wait) {
 		for_each_cpu(cpu, cpumask) {
 			call_single_data_t *csd;
@@ -907,8 +930,12 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		}
 	}
 
-	if (preemptible_wait)
+	if (!preemptible_wait)
+		put_cpu();
+	else
 		free_cpumask_var(cpumask_stack);
+
+	migrate_enable();
 }
 
 /**
-- 
2.20.1
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Peter Zijlstra 4 days, 14 hours ago
On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:

> +	/*
> +	 * Prevent the current CPU from going offline.
> +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
> +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
> +	 */
> +	migrate_disable();

This is horrible crap. migrate_disable() is *NOT* supposed to be used to
serialize cpu hotplug.
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 16 hours ago
在 2026/2/5 17:52, Peter Zijlstra 写道:
> On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:
> 
>> +	/*
>> +	 * Prevent the current CPU from going offline.
>> +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
>> +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
>> +	 */
>> +	migrate_disable();
> 
> This is horrible crap. migrate_disable() is *NOT* supposed to be used to
> serialize cpu hotplug.



Here we can use rcu_read_lock to replace migrate_disable/cpus_read_lock, 
and in smpcfd_dead_cpu(), wait for all rcu read critical sections to 
exit before releasing percpu csd data.

This allows csd_lock_wait() to be preemptible and migratable, while 
avoiding concurrency issues between smpcfd_dead_cpu() and csd_lock_wait.

Thanks.
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Peter Zijlstra 4 days, 13 hours ago
On Thu, Feb 05, 2026 at 10:52:36AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:
> 
> > +	/*
> > +	 * Prevent the current CPU from going offline.
> > +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
> > +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
> > +	 */
> > +	migrate_disable();
> 
> This is horrible crap. migrate_disable() is *NOT* supposed to be used to
> serialize cpu hotplug.

This was too complicated or something?

--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -802,19 +802,20 @@ static void smp_call_function_many_cond(
 					unsigned int scf_flags,
 					smp_cond_func_t cond_func)
 {
-	int cpu, last_cpu, this_cpu = smp_processor_id();
-	struct call_function_data *cfd;
+	struct call_function_data *cfd = this_cpu_ptr(&cfd_data);
+	struct cpumask *cpumask = cfd->cpumask;
 	bool wait = scf_flags & SCF_WAIT;
-	bool preemptible_wait = true;
 	cpumask_var_t cpumask_stack;
-	struct cpumask *cpumask;
+	int cpu, last_cpu, this_cpu;
 	int nr_cpus = 0;
 	bool run_remote = false;
 
-	lockdep_assert_preemption_disabled();
+	if (wait && !alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
+		cpumask = cpumask_stack;
 
-	if (!alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
-		preemptible_wait = false;
+	cpus_read_lock();
+	preempt_disable();
+	this_cpu = smp_processor_id();
 
 	/*
 	 * Can deadlock when called with interrupts disabled.
@@ -836,10 +837,6 @@ static void smp_call_function_many_cond(
 
 	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
-		cfd = this_cpu_ptr(&cfd_data);
-
-		cpumask = preemptible_wait ? cpumask_stack : cfd->cpumask;
-
 		cpumask_and(cpumask, mask, cpu_online_mask);
 		__cpumask_clear_cpu(this_cpu, cpumask);
 
@@ -897,6 +894,7 @@ static void smp_call_function_many_cond(
 		csd_do_func(func, info, NULL);
 		local_irq_restore(flags);
 	}
+	preempt_enable();
 
 	if (run_remote && wait) {
 		for_each_cpu(cpu, cpumask) {
@@ -907,8 +905,8 @@ static void smp_call_function_many_cond(
 		}
 	}
 
-	if (preemptible_wait)
-		free_cpumask_var(cpumask_stack);
+	cpus_read_unlock();
+	free_cpumask_var(cpumask_stack);
 }
 
 /**
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 4 days, 9 hours ago
Hi Peter,

在 2026/2/5 18:57, Peter Zijlstra 写道:
> On Thu, Feb 05, 2026 at 10:52:36AM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:
>>
>>> +	/*
>>> +	 * Prevent the current CPU from going offline.
>>> +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
>>> +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
>>> +	 */
>>> +	migrate_disable();
>>
>> This is horrible crap. migrate_disable() is *NOT* supposed to be used to
>> serialize cpu hotplug.
> 
> This was too complicated or something?
> 

Now most callers of smp_call*() explicitly use preempt_disable(). IIUC, 
if we want to use cpus_read_lock(), we first need to clean up all these 
preempt_disable() calls.

Maybe a stupid question: Why can't migrate_disable prevent CPU removal?

Before takedown_cpu(), all tasks need to be migrated to other CPUs, and 
all kthreads on that CPU must be parked, except the stopper thread and 
the hotplug thread.



> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -802,19 +802,20 @@ static void smp_call_function_many_cond(
>   					unsigned int scf_flags,
>   					smp_cond_func_t cond_func)
>   {
> -	int cpu, last_cpu, this_cpu = smp_processor_id();
> -	struct call_function_data *cfd;
> +	struct call_function_data *cfd = this_cpu_ptr(&cfd_data);
> +	struct cpumask *cpumask = cfd->cpumask;
>   	bool wait = scf_flags & SCF_WAIT;
> -	bool preemptible_wait = true;
>   	cpumask_var_t cpumask_stack;
> -	struct cpumask *cpumask;
> +	int cpu, last_cpu, this_cpu;
>   	int nr_cpus = 0;
>   	bool run_remote = false;
>   
> -	lockdep_assert_preemption_disabled();
> +	if (wait && !alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
> +		cpumask = cpumask_stack;
>   
> -	if (!alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC))
> -		preemptible_wait = false;
> +	cpus_read_lock();
> +	preempt_disable();
> +	this_cpu = smp_processor_id();
>   
>   	/*
>   	 * Can deadlock when called with interrupts disabled.
> @@ -836,10 +837,6 @@ static void smp_call_function_many_cond(
>   
>   	/* Check if we need remote execution, i.e., any CPU excluding this one. */
>   	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
> -		cfd = this_cpu_ptr(&cfd_data);
> -
> -		cpumask = preemptible_wait ? cpumask_stack : cfd->cpumask;
> -
>   		cpumask_and(cpumask, mask, cpu_online_mask);
>   		__cpumask_clear_cpu(this_cpu, cpumask);
>   
> @@ -897,6 +894,7 @@ static void smp_call_function_many_cond(
>   		csd_do_func(func, info, NULL);
>   		local_irq_restore(flags);
>   	}
> +	preempt_enable();
>   
>   	if (run_remote && wait) {
>   		for_each_cpu(cpu, cpumask) {
> @@ -907,8 +905,8 @@ static void smp_call_function_many_cond(
>   		}
>   	}
>   
> -	if (preemptible_wait)
> -		free_cpumask_var(cpumask_stack);
> +	cpus_read_unlock();
> +	free_cpumask_var(cpumask_stack);
>   }
>   
>   /**
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Peter Zijlstra 4 days, 9 hours ago
On Thu, Feb 05, 2026 at 10:29:51PM +0800, Chuyi Zhou wrote:
> Hi Peter,
> 
> 在 2026/2/5 18:57, Peter Zijlstra 写道:
> > On Thu, Feb 05, 2026 at 10:52:36AM +0100, Peter Zijlstra wrote:
> >> On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:
> >>
> >>> +	/*
> >>> +	 * Prevent the current CPU from going offline.
> >>> +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
> >>> +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
> >>> +	 */
> >>> +	migrate_disable();
> >>
> >> This is horrible crap. migrate_disable() is *NOT* supposed to be used to
> >> serialize cpu hotplug.
> > 
> > This was too complicated or something?
> > 
> 
> Now most callers of smp_call*() explicitly use preempt_disable(). IIUC, 
> if we want to use cpus_read_lock(), we first need to clean up all these 
> preempt_disable() calls.
> 
> Maybe a stupid question: Why can't migrate_disable prevent CPU removal?

It can, but migrate_disable() is horrible, it should not be used if at
all possible.
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 3 days, 15 hours ago
Hi Peter,

在 2026/2/5 22:59, Peter Zijlstra 写道:
> On Thu, Feb 05, 2026 at 10:29:51PM +0800, Chuyi Zhou wrote:
>> Hi Peter,
>>
>> 在 2026/2/5 18:57, Peter Zijlstra 写道:
>>> On Thu, Feb 05, 2026 at 10:52:36AM +0100, Peter Zijlstra wrote:
>>>> On Tue, Feb 03, 2026 at 07:23:55PM +0800, Chuyi Zhou wrote:
>>>>
>>>>> +	/*
>>>>> +	 * Prevent the current CPU from going offline.
>>>>> +	 * Being migrated to another CPU and calling csd_lock_wait() may cause
>>>>> +	 * UAF due to smpcfd_dead_cpu() during the current CPU offline process.
>>>>> +	 */
>>>>> +	migrate_disable();
>>>>
>>>> This is horrible crap. migrate_disable() is *NOT* supposed to be used to
>>>> serialize cpu hotplug.
>>>
>>> This was too complicated or something?
>>>
>>
>> Now most callers of smp_call*() explicitly use preempt_disable(). IIUC,
>> if we want to use cpus_read_lock(), we first need to clean up all these
>> preempt_disable() calls.
>>
>> Maybe a stupid question: Why can't migrate_disable prevent CPU removal?
> 
> It can, but migrate_disable() is horrible, it should not be used if at
> all possible.


As you pointed out, using cpus_read_lock() is the simplest approach, and 
indeed, that was the first solution we considered.

However, 99% of callers have preemption disabled, some of them even 
invoking it within spin_locks (for example, we might trigger a TLB flush 
while holding pte spinlocks).

It's difficult for us to eliminate all these preempt_disable(), 
especially for callers that disable preemption for other purposes, 
making the use of cpus_read_lock almost impossible.

In our production environment, we observed that the overhead of 
csd_lock_wait can be as high as several milliseconds, and in extreme 
cases, even exceed 10ms+. Generally speaking, the time spent on 
csd_lock_wait far exceeds the overhead of sending the IPI.

Disabling preemption for the entire duration would obviously affect the 
preemption latency of high-priority tasks, which is unacceptable. This 
optimization primarily targets PREEMPT, although PREEMPT_RT can also 
benefit from it.

Compared to the cost of disabling preemption entirely, maybe using 
migrate_disable() here seems to be an acceptable trade-off.
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Peter Zijlstra 3 days, 14 hours ago
On Fri, Feb 06, 2026 at 04:43:48PM +0800, Chuyi Zhou wrote:

> However, 99% of callers have preemption disabled, some of them even 
> invoking it within spin_locks (for example, we might trigger a TLB flush 
> while holding pte spinlocks).

Then 99% of the callers don't benefit from this and won't see your
latency reduction -- and will be broken on PREEMPT_RT, no?
Re: [PATCH 05/11] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 3 days, 12 hours ago
Hello,

在 2026/2/6 17:47, Peter Zijlstra 写道:
> On Fri, Feb 06, 2026 at 04:43:48PM +0800, Chuyi Zhou wrote:
> 
>> However, 99% of callers have preemption disabled, some of them even
>> invoking it within spin_locks (for example, we might trigger a TLB flush
>> while holding pte spinlocks).
> 
> Then 99% of the callers don't benefit from this and won't see your
> latency reduction -- and will be broken on PREEMPT_RT, no?

Once we make the preemption logic self-contained within smp_call, the 
disabling of preemption by most callers becomes unnecessary and can 
therefore be removed. We can preserve the few instances where it is 
still meaningful or apply special optimizations to them, much like the 
subsequent patch does for arch_tlbbatch_flush/flush_tlb_mm_range.

Consequently, the vast majority of callers stand to benefit. For the RT 
kernel, this optimization provides additional benefits when smp_call is 
invoked within spinlocks, as it reduces the length of non-preemptible 
critical sections.

We can invoke alloc_cpumask_var only when CONFIG_CPUMASK_OFFSTACK is 
disabled, thereby avoiding broken on RT.

However, using cpus_read_lock requires us to absolutely guarantee that 
the current context is sleepable, which is difficult to ensure.