[PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond

Chuyi Zhou posted 12 patches 2 weeks, 5 days ago
There is a newer version of this series
[PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 2 weeks, 5 days ago
Now smp_call_function_many_cond() disables preemption mainly for the
following reasons:

- To prevent the remote online CPU from going offline. Specifically, we
want to ensure that no new csds are queued after smpcfd_dying_cpu() has
finished. Therefore, preemption must be disabled until all necessary IPIs
are sent.

- To prevent migration to another CPU, which also implicitly prevents the
current CPU from going offline (since stop_machine requires preempting the
current task to execute offline callbacks).

- To protect the per-cpu cfd_data from concurrent modification by other
smp_call_*() on the current CPU. cfd_data contains cpumasks and per-cpu
csds. Before enqueueing a csd, we block on the csd_lock() to ensure the
previous asyc csd->func() has completed, and then initialize csd->func and
csd->info. After sending the IPI, we spin-wait for the remote CPU to call
csd_unlock(). Actually the csd_lock mechanism already guarantees csd
serialization. If preemption occurs during csd_lock_wait, other concurrent
smp_call_function_many_cond calls will simply block until the previous
csd->func() completes:

task A                    task B

sd->func = fun_a
send ipis

                preempted by B
               --------------->
                        csd_lock(csd); // block until last
                                       // fun_a finished

                        csd->func = func_b;
                        csd->info = info;
                            ...
                        send ipis

                switch back to A
                <---------------

csd_lock_wait(csd); // block until remote finish func_*

This patch enables preemption before csd_lock_wait() which makes the
potentially unpredictable csd_lock_wait() preemptible and migratable.
Note that being migrated to another CPU and calling csd_lock_wait() may
cause UAF due to smpcfd_dead_cpu() during the current CPU offline process.
Previous patch used the RCU mechanism to synchronize csd_lock_wait()
with smpcfd_dead_cpu() to prevent the above UAF issue.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/smp.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 32c293d8be0e..18e7e4a8f1b6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -801,7 +801,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 					smp_cond_func_t cond_func)
 {
 	bool preemptible_wait = !IS_ENABLED(CONFIG_CPUMASK_OFFSTACK);
-	int cpu, last_cpu, this_cpu = smp_processor_id();
+	int cpu, last_cpu, this_cpu;
 	struct call_function_data *cfd;
 	bool wait = scf_flags & SCF_WAIT;
 	cpumask_var_t cpumask_stack;
@@ -809,9 +809,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	int nr_cpus = 0;
 	bool run_remote = false;
 
-	lockdep_assert_preemption_disabled();
-
 	rcu_read_lock();
+	this_cpu = get_cpu();
+
 	cfd = this_cpu_ptr(&cfd_data);
 	cpumask = cfd->cpumask;
 
@@ -898,6 +898,19 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		local_irq_restore(flags);
 	}
 
+	/*
+	 * We may block in csd_lock_wait() for a significant amount of time,
+	 * especially when interrupts are disabled or with a large number of
+	 * remote CPUs. Try to enable preemption before csd_lock_wait().
+	 *
+	 * Use the cpumask_stack instead of cfd->cpumask to avoid concurrency
+	 * modification from tasks on the same cpu. If preemption occurs during
+	 * csd_lock_wait, other concurrent smp_call_function_many_cond() calls
+	 * will simply block until the previous csd->func() complete.
+	 */
+	if (preemptible_wait)
+		put_cpu();
+
 	if (run_remote && wait) {
 		for_each_cpu(cpu, cpumask) {
 			call_single_data_t *csd;
@@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		}
 	}
 
-	rcu_read_unlock();
-	if (preemptible_wait)
+	if (!preemptible_wait)
+		put_cpu();
+	else
 		free_cpumask_var(cpumask_stack);
+	rcu_read_unlock();
 }
 
 /**
-- 
2.20.1
Re: [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
Posted by Sebastian Andrzej Siewior 2 weeks, 5 days ago
On 2026-03-18 12:56:32 [+0800], Chuyi Zhou wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  		}
>  	}

So now I understand why we have this cpumask on stack.
Could we, on a preemptible kernel, where we have a preemption counter,
in the case of preemptible() allocate a cpumask and use it here? If the
allocation fails or we are not on a preemptbile kernel then we don't do
this optimized wait with enabled preemption.

There is no benefit of doing all this if the caller has already
preemption disabled.

> -	rcu_read_unlock();
> -	if (preemptible_wait)
> +	if (!preemptible_wait)
> +		put_cpu();
> +	else
>  		free_cpumask_var(cpumask_stack);
> +	rcu_read_unlock();
>  }
>  
>  /**

Sebastian
Re: [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
Posted by Chuyi Zhou 2 weeks, 4 days ago
在 2026/3/19 00:55, Sebastian Andrzej Siewior 写道:
> On 2026-03-18 12:56:32 [+0800], Chuyi Zhou wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>>   		}
>>   	}
> 
> So now I understand why we have this cpumask on stack.
> Could we, on a preemptible kernel, where we have a preemption counter,
> in the case of preemptible() allocate a cpumask and use it here? If the
> allocation fails or we are not on a preemptbile kernel then we don't do
> this optimized wait with enabled preemption.
> 
> There is no benefit of doing all this if the caller has already
> preemption disabled.
> 

IIUC, we can enable this feature only when 
`IS_ENABLED(CONFIG_PREEMPTION) && preemptible()`.

This way, the optimization can also take effect for 
CONFIG_CPUMASK_OFFSTACK=y without breaking the RT principle that forbids 
memory allocation inside preemption-disabled critical sections.

Thanks.

>> -	rcu_read_unlock();
>> -	if (preemptible_wait)
>> +	if (!preemptible_wait)
>> +		put_cpu();
>> +	else
>>   		free_cpumask_var(cpumask_stack);
>> +	rcu_read_unlock();
>>   }
>>   
>>   /**
> 
> Sebastian