kernel/smp: Improve smp_call_function_single() CSD-lock diagnostics

[PATCH] kernel/smp: Improve smp_call_function_single() CSD-lock diagnostics

Posted by Paul E. McKenney 2 weeks ago

Both smp_call_function() and smp_call_function_single() use per-CPU
call_single_data_t variable to hold the infamous CSD lock.  However,
while smp_call_function() acquires the destination CPU's CSD lock,
smp_call_function_single() instead uses the source CPU's CSD lock.
(These are two separate sets of CSD locks, cfd_data and csd_data,
respectively.)

This otherwise inexplicable pair of choices is explained by their
respective queueing properties.  If smp_call_function() where to
use the sending CPU's CSD lock, that would serialize the destination
CPUs' IPI handlers and result in long smp_call_function() latencies,
especially on systems with large numbers of CPUs.  For its part, if
smp_call_function_single() were to use the (single) destination CPU's
CSD lock, this would similarly serialize in the case where many CPUs
are sending IPIs to a single "victim" CPU.  Plus it would result in
higher levels of memory contention.

Except that if you don't have NMI-based stack tracing and you are working
with a weakly ordered system where remote unsynchronized stack traces are
especially unreliable, the improved debugging beats the improved queueing.
Keep in mind that this improved queueing only matters if a bunch of
CPUs are calling smp_call_function_single() concurrently for a single
"victim" CPU, which is not the common case.

Therefore, make smp_call_function_single() use the destination CPU's
csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
where csdlock_debug_enabled is also set.  Otherwise, continue to use
the source CPU's csd_data.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/smp.c b/kernel/smp.c
index f349960f79cad..09521cb38a55c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -377,6 +377,20 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+	if (static_branch_unlikely(&csdlock_debug_enabled))
+		return per_cpu_ptr(&csd_data, cpu);
+	return this_cpu_ptr(&csd_data);
+}
+#else
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+	return this_cpu_ptr(&csd_data);
+}
+#endif
+
 void __smp_call_single_queue(int cpu, struct llist_node *node)
 {
 	/*
@@ -670,7 +684,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 
 	csd = &csd_stack;
 	if (!wait) {
-		csd = this_cpu_ptr(&csd_data);
+		csd = get_single_csd_data(cpu);
 		csd_lock(csd);
 	}

Re: [PATCH] kernel/smp: Improve smp_call_function_single() CSD-lock diagnostics

Posted by Ulf Hansson 1 week, 2 days ago

On Fri, 20 Mar 2026 at 11:45, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Both smp_call_function() and smp_call_function_single() use per-CPU
> call_single_data_t variable to hold the infamous CSD lock.  However,
> while smp_call_function() acquires the destination CPU's CSD lock,
> smp_call_function_single() instead uses the source CPU's CSD lock.
> (These are two separate sets of CSD locks, cfd_data and csd_data,
> respectively.)
>
> This otherwise inexplicable pair of choices is explained by their
> respective queueing properties.  If smp_call_function() where to
> use the sending CPU's CSD lock, that would serialize the destination
> CPUs' IPI handlers and result in long smp_call_function() latencies,
> especially on systems with large numbers of CPUs.  For its part, if
> smp_call_function_single() were to use the (single) destination CPU's
> CSD lock, this would similarly serialize in the case where many CPUs
> are sending IPIs to a single "victim" CPU.  Plus it would result in
> higher levels of memory contention.
>
> Except that if you don't have NMI-based stack tracing and you are working
> with a weakly ordered system where remote unsynchronized stack traces are
> especially unreliable, the improved debugging beats the improved queueing.
> Keep in mind that this improved queueing only matters if a bunch of
> CPUs are calling smp_call_function_single() concurrently for a single
> "victim" CPU, which is not the common case.
>
> Therefore, make smp_call_function_single() use the destination CPU's
> csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
> where csdlock_debug_enabled is also set.  Otherwise, continue to use
> the source CPU's csd_data.
>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

FWIW, feel free to add:

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f349960f79cad..09521cb38a55c 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -377,6 +377,20 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
>
>  static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
>
> +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> +static call_single_data_t *get_single_csd_data(int cpu)
> +{
> +       if (static_branch_unlikely(&csdlock_debug_enabled))
> +               return per_cpu_ptr(&csd_data, cpu);
> +       return this_cpu_ptr(&csd_data);
> +}
> +#else
> +static call_single_data_t *get_single_csd_data(int cpu)
> +{
> +       return this_cpu_ptr(&csd_data);
> +}
> +#endif
> +
>  void __smp_call_single_queue(int cpu, struct llist_node *node)
>  {
>         /*
> @@ -670,7 +684,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
>
>         csd = &csd_stack;
>         if (!wait) {
> -               csd = this_cpu_ptr(&csd_data);
> +               csd = get_single_csd_data(cpu);
>                 csd_lock(csd);
>         }
>

Re: [PATCH] kernel/smp: Improve smp_call_function_single() CSD-lock diagnostics

Posted by Paul E. McKenney 1 week, 2 days ago

On Wed, Mar 25, 2026 at 12:32:16PM +0100, Ulf Hansson wrote:
> On Fri, 20 Mar 2026 at 11:45, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Both smp_call_function() and smp_call_function_single() use per-CPU
> > call_single_data_t variable to hold the infamous CSD lock.  However,
> > while smp_call_function() acquires the destination CPU's CSD lock,
> > smp_call_function_single() instead uses the source CPU's CSD lock.
> > (These are two separate sets of CSD locks, cfd_data and csd_data,
> > respectively.)
> >
> > This otherwise inexplicable pair of choices is explained by their
> > respective queueing properties.  If smp_call_function() where to
> > use the sending CPU's CSD lock, that would serialize the destination
> > CPUs' IPI handlers and result in long smp_call_function() latencies,
> > especially on systems with large numbers of CPUs.  For its part, if
> > smp_call_function_single() were to use the (single) destination CPU's
> > CSD lock, this would similarly serialize in the case where many CPUs
> > are sending IPIs to a single "victim" CPU.  Plus it would result in
> > higher levels of memory contention.
> >
> > Except that if you don't have NMI-based stack tracing and you are working
> > with a weakly ordered system where remote unsynchronized stack traces are
> > especially unreliable, the improved debugging beats the improved queueing.
> > Keep in mind that this improved queueing only matters if a bunch of
> > CPUs are calling smp_call_function_single() concurrently for a single
> > "victim" CPU, which is not the common case.
> >
> > Therefore, make smp_call_function_single() use the destination CPU's
> > csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
> > where csdlock_debug_enabled is also set.  Otherwise, continue to use
> > the source CPU's csd_data.
> >
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> FWIW, feel free to add:
> 
> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Thank you!  I will apply this on my next rebase.

							Thanx, Paul

> Kind regards
> Uffe
> 
> >
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index f349960f79cad..09521cb38a55c 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -377,6 +377,20 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
> >
> >  static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
> >
> > +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> > +static call_single_data_t *get_single_csd_data(int cpu)
> > +{
> > +       if (static_branch_unlikely(&csdlock_debug_enabled))
> > +               return per_cpu_ptr(&csd_data, cpu);
> > +       return this_cpu_ptr(&csd_data);
> > +}
> > +#else
> > +static call_single_data_t *get_single_csd_data(int cpu)
> > +{
> > +       return this_cpu_ptr(&csd_data);
> > +}
> > +#endif
> > +
> >  void __smp_call_single_queue(int cpu, struct llist_node *node)
> >  {
> >         /*
> > @@ -670,7 +684,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
> >
> >         csd = &csd_stack;
> >         if (!wait) {
> > -               csd = this_cpu_ptr(&csd_data);
> > +               csd = get_single_csd_data(cpu);
> >                 csd_lock(csd);
> >         }
> >

Re: [PATCH] kernel/smp: Improve smp_call_function_single() CSD-lock diagnostics

Posted by Paul E. McKenney 1 week ago

On Wed, Mar 25, 2026 at 01:59:50PM -0700, Paul E. McKenney wrote:
> On Wed, Mar 25, 2026 at 12:32:16PM +0100, Ulf Hansson wrote:
> > On Fri, 20 Mar 2026 at 11:45, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > Both smp_call_function() and smp_call_function_single() use per-CPU
> > > call_single_data_t variable to hold the infamous CSD lock.  However,
> > > while smp_call_function() acquires the destination CPU's CSD lock,
> > > smp_call_function_single() instead uses the source CPU's CSD lock.
> > > (These are two separate sets of CSD locks, cfd_data and csd_data,
> > > respectively.)
> > >
> > > This otherwise inexplicable pair of choices is explained by their
> > > respective queueing properties.  If smp_call_function() where to
> > > use the sending CPU's CSD lock, that would serialize the destination
> > > CPUs' IPI handlers and result in long smp_call_function() latencies,
> > > especially on systems with large numbers of CPUs.  For its part, if
> > > smp_call_function_single() were to use the (single) destination CPU's
> > > CSD lock, this would similarly serialize in the case where many CPUs
> > > are sending IPIs to a single "victim" CPU.  Plus it would result in
> > > higher levels of memory contention.
> > >
> > > Except that if you don't have NMI-based stack tracing and you are working
> > > with a weakly ordered system where remote unsynchronized stack traces are
> > > especially unreliable, the improved debugging beats the improved queueing.
> > > Keep in mind that this improved queueing only matters if a bunch of
> > > CPUs are calling smp_call_function_single() concurrently for a single
> > > "victim" CPU, which is not the common case.
> > >
> > > Therefore, make smp_call_function_single() use the destination CPU's
> > > csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
> > > where csdlock_debug_enabled is also set.  Otherwise, continue to use
> > > the source CPU's csd_data.
> > >
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > FWIW, feel free to add:
> > 
> > Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
> 
> Thank you!  I will apply this on my next rebase.

Except that -tip beat me to it, which is even better!  ;-)

https://lore.kernel.org/all/177452999316.1647592.118093660207781017.tip-bot2@tip-bot2/

							Thanx, Paul

[tip: smp/core] smp: Improve smp_call_function_single() CSD-lock diagnostics

Posted by tip-bot2 for Paul E. McKenney 1 week, 1 day ago

The following commit has been merged into the smp/core branch of tip:

Commit-ID:     b0473dcd4b1d7e2e44947e7ac1820c73a268821a
Gitweb:        https://git.kernel.org/tip/b0473dcd4b1d7e2e44947e7ac1820c73a268821a
Author:        Paul E. McKenney <paulmck@kernel.org>
AuthorDate:    Fri, 20 Mar 2026 03:45:36 -07:00
Committer:     Thomas Gleixner <tglx@kernel.org>
CommitterDate: Wed, 25 Mar 2026 20:11:30 +01:00

smp: Improve smp_call_function_single() CSD-lock diagnostics

Both smp_call_function() and smp_call_function_single() use per-CPU
call_single_data_t variable to hold the infamous CSD lock.  However,
while smp_call_function() acquires the destination CPU's CSD lock,
smp_call_function_single() instead uses the source CPU's CSD lock.
(These are two separate sets of CSD locks, cfd_data and csd_data,
respectively.)

This otherwise inexplicable pair of choices is explained by their
respective queueing properties.  If smp_call_function() where to
use the sending CPU's CSD lock, that would serialize the destination
CPUs' IPI handlers and result in long smp_call_function() latencies,
especially on systems with large numbers of CPUs.  For its part, if
smp_call_function_single() were to use the (single) destination CPU's
CSD lock, this would similarly serialize in the case where many CPUs
are sending IPIs to a single "victim" CPU.  Plus it would result in
higher levels of memory contention.

Except that if there is no NMI-based stack tracing on a weakly ordered
system where remote unsynchronized stack traces are especially unreliable,
the improved debugging beats the improved queueing.  This improved queueing
only matters if a bunch of CPUs are calling smp_call_function_single()
concurrently for a single "victim" CPU, which is not the common case.

Therefore, make smp_call_function_single() use the destination CPU's
csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y
where csdlock_debug_enabled is also set.  Otherwise, continue to use
the source CPU's csd_data.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/25c2eb97-77c8-49a5-80ac-efe78dea272c@paulmck-laptop
---
 kernel/smp.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index bdbf145..0fc4ff2 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -377,6 +377,20 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+	if (static_branch_unlikely(&csdlock_debug_enabled))
+		return per_cpu_ptr(&csd_data, cpu);
+	return this_cpu_ptr(&csd_data);
+}
+#else
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+	return this_cpu_ptr(&csd_data);
+}
+#endif
+
 void __smp_call_single_queue(int cpu, struct llist_node *node)
 {
 	/*
@@ -671,7 +685,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 
 	csd = &csd_stack;
 	if (!wait) {
-		csd = this_cpu_ptr(&csd_data);
+		csd = get_single_csd_data(cpu);
 		csd_lock(csd);
 	}