include/linux/rcupdate.h | 3 +++ kernel/cpu.c | 2 ++ 2 files changed, 5 insertions(+)
Bulk CPU hotplug operations—such as switching SMT modes across all
cores—require hotplugging multiple CPUs in rapid succession. On large
systems, this process takes significant time, increasing as the number
of CPUs grows, leading to substantial delays on high-core-count
machines. Analysis [1] reveals that the majority of this time is spent
waiting for synchronize_rcu().
Expedite synchronize_rcu() during the hotplug path to accelerate the
operation. Since CPU hotplug is a user-initiated administrative task,
it should complete as quickly as possible.
Performance data on a PPC64 system with 400 CPUs:
+ ppc64_cpu --smt=1 (SMT8 to SMT1)
Before: real 1m14.792s
After: real 0m03.205s # ~23x improvement
+ ppc64_cpu --smt=8 (SMT1 to SMT8)
Before: real 2m27.695s
After: real 0m02.510s # ~58x improvement
Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
[1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
include/linux/rcupdate.h | 3 +++
kernel/cpu.c | 2 ++
2 files changed, 5 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index c5b30054cd01..03c06cfb2b6d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
extern int rcu_expedited;
extern int rcu_normal;
+extern void rcu_expedite_gp(void);
+extern void rcu_unexpedite_gp(void);
+
DEFINE_LOCK_GUARD_0(rcu,
do {
rcu_read_lock();
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..6b0d491d73f4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
void cpus_write_lock(void)
{
+ rcu_expedite_gp();
percpu_down_write(&cpu_hotplug_lock);
}
void cpus_write_unlock(void)
{
percpu_up_write(&cpu_hotplug_lock);
+ rcu_unexpedite_gp();
}
void lockdep_assert_cpus_held(void)
--
2.52.0
On 12/01/26 3:13 pm, Vishal Chourasia wrote:
> Bulk CPU hotplug operations—such as switching SMT modes across all
> cores—require hotplugging multiple CPUs in rapid succession. On large
> systems, this process takes significant time, increasing as the number
> of CPUs grows, leading to substantial delays on high-core-count
> machines. Analysis [1] reveals that the majority of this time is spent
> waiting for synchronize_rcu().
>
> Expedite synchronize_rcu() during the hotplug path to accelerate the
> operation. Since CPU hotplug is a user-initiated administrative task,
> it should complete as quickly as possible.
>
> Performance data on a PPC64 system with 400 CPUs:
>
> + ppc64_cpu --smt=1 (SMT8 to SMT1)
> Before: real 1m14.792s
> After: real 0m03.205s # ~23x improvement
>
> + ppc64_cpu --smt=8 (SMT1 to SMT8)
> Before: real 2m27.695s
> After: real 0m02.510s # ~58x improvement
>
> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
>
> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
> include/linux/rcupdate.h | 3 +++
> kernel/cpu.c | 2 ++
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index c5b30054cd01..03c06cfb2b6d 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
> extern int rcu_expedited;
> extern int rcu_normal;
>
> +extern void rcu_expedite_gp(void);
> +extern void rcu_unexpedite_gp(void);
> +
> DEFINE_LOCK_GUARD_0(rcu,
> do {
> rcu_read_lock();
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..6b0d491d73f4 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
>
> void cpus_write_lock(void)
> {
> + rcu_expedite_gp();
> percpu_down_write(&cpu_hotplug_lock);
> }
>
> void cpus_write_unlock(void)
> {
> percpu_up_write(&cpu_hotplug_lock);
> + rcu_unexpedite_gp();
> }
>
> void lockdep_assert_cpus_held(void)
Hi Vishal,
I verified this patch using the configuration described below.
Configuration:
• Kernel version: 6.19.0-rc5
• Number of CPUs: 2048
Using this setup, I evaluated the patch with both SMT enabled and SMT
disabled. patch shows a significant improvement in the SMT=off case and
a measurable improvement in the SMT=on case.
The results indicate that when SMT is enabled, the system time is
noticeably higher. In contrast, with SMT disabled, no significant
increase in system time is observed.
SMT=ON -> sys 31m18.849s
SMT=OFF -> sys 0m0.087s
SMT Mode | Without Patch | With Patch | % Improvement |
------------------------------------------------------------------
SMT=off | 30m 53.194s | 6m 4.250s | +80.40% |
SMT=on | 49m 5.920s | 36m 50.386s | +25.01% |
Please add below tag:
Tested-by: Samir M <samir@linux.ibm.com>
Regards,
Samir
On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote:
> On 12/01/26 3:13 pm, Vishal Chourasia wrote:
> > Bulk CPU hotplug operations--such as switching SMT modes across all
> > cores--require hotplugging multiple CPUs in rapid succession. On large
> > systems, this process takes significant time, increasing as the number
> > of CPUs grows, leading to substantial delays on high-core-count
> > machines. Analysis [1] reveals that the majority of this time is spent
> > waiting for synchronize_rcu().
> >
> > Expedite synchronize_rcu() during the hotplug path to accelerate the
> > operation. Since CPU hotplug is a user-initiated administrative task,
> > it should complete as quickly as possible.
>
> Hi Vishal,
>
> I verified this patch using the configuration described below.
> Configuration:
> * Kernel version: 6.19.0-rc5
> * Number of CPUs: 2048
>
> SMT Mode | Without Patch | With Patch | % Improvement |
> ------------------------------------------------------------------
> SMT=off | 30m 53.194s | 6m 4.250s | +80.40% |
> SMT=on | 49m 5.920s | 36m 50.386s | +25.01% |
Hi Vishal, Samir,
Thanks for the testing on your large CPU count system.
Considering the SMT=on performance is still terrible, before we expedite
RCU, could we try the approach Peter suggested (avoiding repeated
lock/unlock)? I wrote a patch below.
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git
tag: cpuhp-bulk-optimize-rfc-v1
I tested it lightly on rcutorture hotplug test and it passes. Please share
any performance results, thanks.
Also I'd like to use expediting of RCU as a last resort TBH, we should
optimize the outer operations that require RCU in the first place such as
Peter's suggestion since that will improve the overall efficiency of the
code. And if/when expediting RCU, Peter's other suggestion to not do it in
cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense
to me.
---8<-----------------------
From: Joel Fernandes <joelagnelf@nvidia.com>
Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring
Bulk CPU hotplug operations such as enabling SMT across all cores
require hotplugging multiple CPUs. The current implementation takes
cpus_write_lock() for each individual CPU causing multiple slow grace
period requests.
Therefore introduce cpu_up_locked() that assumes the caller already
holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to
hold the lock once around the entire loop rather than for each CPU.
Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
kernel/cpu.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..4ce7deb236d7 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state)
complete_ap_thread(st, true);
}
-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
+/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */
+static int cpu_up_locked(unsigned int cpu, int tasks_frozen,
+ enum cpuhp_state target)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
struct task_struct *idle;
int ret = 0;
- cpus_write_lock();
+ lockdep_assert_cpus_held();
- if (!cpu_present(cpu)) {
- ret = -EINVAL;
- goto out;
- }
+ if (!cpu_present(cpu))
+ return -EINVAL;
/*
* The caller of cpu_up() might have raced with another
* caller. Nothing to do.
*/
if (st->state >= target)
- goto out;
+ return 0;
if (st->state == CPUHP_OFFLINE) {
/* Let it fail before we try to bring the cpu up */
idle = idle_thread_get(cpu);
- if (IS_ERR(idle)) {
- ret = PTR_ERR(idle);
- goto out;
- }
+ if (IS_ERR(idle))
+ return PTR_ERR(idle);
/*
* Reset stale stack state from the last time this CPU was online.
@@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
* return the error code..
*/
if (ret)
- goto out;
+ return ret;
}
/*
@@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
*/
target = min((int)target, CPUHP_BRINGUP_CPU);
ret = cpuhp_up_callbacks(cpu, st, target);
-out:
+ return ret;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
+{
+ int ret;
+
+ cpus_write_lock();
+ ret = cpu_up_locked(cpu, tasks_frozen, target);
cpus_write_unlock();
arch_smt_update();
return ret;
@@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void)
int cpu, ret = 0;
cpu_maps_update_begin();
+ /* Hold cpus_write_lock() for entire batch operation. */
+ cpus_write_lock();
cpu_smt_control = CPU_SMT_ENABLED;
for_each_present_cpu(cpu) {
/* Skip online CPUs and CPUs on offline nodes */
@@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void)
continue;
if (!cpu_smt_thread_allowed(cpu) || !topology_is_core_online(cpu))
continue;
- ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
+ ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE);
if (ret)
break;
/* See comment in cpuhp_smt_disable() */
cpuhp_online_cpu_device(cpu);
}
+ cpus_write_unlock();
+ arch_smt_update();
cpu_maps_update_done();
return ret;
}
--
2.34.1
On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote:
> Performance data on a PPC64 system with 400 CPUs:
>
>+ ppc64_cpu --smt=1 (SMT8 to SMT1)
>Before: real 1m14.792s
>After: real 0m03.205s # ~23x improvement
>
>+ ppc64_cpu --smt=8 (SMT1 to SMT8)
>Before: real 2m27.695s
>After: real 0m02.510s # ~58x improvement
>
>Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
On Mon, Jan 19, 2026 at 12:18:35AM -0500, Joel Fernandes wrote:
> Hi Vishal, Samir,
>
> Thanks for the testing on your large CPU count system.
>
> Considering the SMT=on performance is still terrible, before we expedite
> RCU, could we try the approach Peter suggested (avoiding repeated
> lock/unlock)? I wrote a patch below.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git
> tag: cpuhp-bulk-optimize-rfc-v1
>
> I tested it lightly on rcutorture hotplug test and it passes. Please share
> any performance results, thanks.
>
> Also I'd like to use expediting of RCU as a last resort TBH, we should
> optimize the outer operations that require RCU in the first place such as
> Peter's suggestion since that will improve the overall efficiency of the
> code. And if/when expediting RCU, Peter's other suggestion to not do it in
> cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense
> to me.
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelagnelf@nvidia.com>
> Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring
>
> Bulk CPU hotplug operations such as enabling SMT across all cores
> require hotplugging multiple CPUs. The current implementation takes
> cpus_write_lock() for each individual CPU causing multiple slow grace
> period requests.
>
> Therefore introduce cpu_up_locked() that assumes the caller already
> holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to
> hold the lock once around the entire loop rather than for each CPU.
>
> Link: https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
> kernel/cpu.c | 40 +++++++++++++++++++++++++---------------
> 1 file changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..4ce7deb236d7 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state)
> complete_ap_thread(st, true);
> }
> -/* Requires cpu_add_remove_lock to be held */
> -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
> +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */
> +static int cpu_up_locked(unsigned int cpu, int tasks_frozen,
> + enum cpuhp_state target)
> {
> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> struct task_struct *idle;
> int ret = 0;
> - cpus_write_lock();
> + lockdep_assert_cpus_held();
> - if (!cpu_present(cpu)) {
> - ret = -EINVAL;
> - goto out;
> - }
> + if (!cpu_present(cpu))
> + return -EINVAL;
> /*
> * The caller of cpu_up() might have raced with another
> * caller. Nothing to do.
> */
> if (st->state >= target)
> - goto out;
> + return 0;
> if (st->state == CPUHP_OFFLINE) {
> /* Let it fail before we try to bring the cpu up */
> idle = idle_thread_get(cpu);
> - if (IS_ERR(idle)) {
> - ret = PTR_ERR(idle);
> - goto out;
> - }
> + if (IS_ERR(idle))
> + return PTR_ERR(idle);
> /*
> * Reset stale stack state from the last time this CPU was online.
> @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
> * return the error code..
> */
> if (ret)
> - goto out;
> + return ret;
> }
> /*
> @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
> */
> target = min((int)target, CPUHP_BRINGUP_CPU);
> ret = cpuhp_up_callbacks(cpu, st, target);
> -out:
> + return ret;
> +}
> +
> +/* Requires cpu_add_remove_lock to be held */
> +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
> +{
> + int ret;
> +
> + cpus_write_lock();
> + ret = cpu_up_locked(cpu, tasks_frozen, target);
> cpus_write_unlock();
> arch_smt_update();
> return ret;
> @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void)
> int cpu, ret = 0;
> cpu_maps_update_begin();
> + /* Hold cpus_write_lock() for entire batch operation. */
> + cpus_write_lock();
> cpu_smt_control = CPU_SMT_ENABLED;
> for_each_present_cpu(cpu) {
> /* Skip online CPUs and CPUs on offline nodes */
> @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void)
> continue;
> if (!cpu_smt_thread_allowed(cpu) || !topology_is_core_online(cpu))
> continue;
> - ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
> + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE);
> if (ret)
> break;
> /* See comment in cpuhp_smt_disable() */
> cpuhp_online_cpu_device(cpu);
> }
> + cpus_write_unlock();
> + arch_smt_update();
> cpu_maps_update_done();
> return ret;
> }
> --
> 2.34.1
>
Hi Joel,
I tested above patch on 400 CPU machine that I had originally posted the
numbers for.
# time echo 1 > /sys/devices/system/cpu/smt/control
real 1m27.133s # Base
real 1m25.859s # With patch
# time echo 8 > /sys/devices/system/cpu/smt/control
real 1m0.682s # Base
real 1m3.423s # With patch
On 1/19/26 10:48 AM, Joel Fernandes wrote:
> On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote:
>> On 12/01/26 3:13 pm, Vishal Chourasia wrote:
>> > Bulk CPU hotplug operations--such as switching SMT modes across all
>> > cores--require hotplugging multiple CPUs in rapid succession. On large
>> > systems, this process takes significant time, increasing as the number
>> > of CPUs grows, leading to substantial delays on high-core-count
>> > machines. Analysis [1] reveals that the majority of this time is spent
>> > waiting for synchronize_rcu().
>> >
>> > Expedite synchronize_rcu() during the hotplug path to accelerate the
>> > operation. Since CPU hotplug is a user-initiated administrative task,
>> > it should complete as quickly as possible.
>>
>> Hi Vishal,
>>
>> I verified this patch using the configuration described below.
>> Configuration:
>> * Kernel version: 6.19.0-rc5
>> * Number of CPUs: 2048
>>
>> SMT Mode | Without Patch | With Patch | % Improvement |
>> ------------------------------------------------------------------
>> SMT=off | 30m 53.194s | 6m 4.250s | +80.40% |
>> SMT=on | 49m 5.920s | 36m 50.386s | +25.01% |
>
> Hi Vishal, Samir,
>
> Thanks for the testing on your large CPU count system.
>
> Considering the SMT=on performance is still terrible, before we expedite
> RCU, could we try the approach Peter suggested (avoiding repeated
> lock/unlock)? I wrote a patch below.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git
> tag: cpuhp-bulk-optimize-rfc-v1
>
> I tested it lightly on rcutorture hotplug test and it passes. Please share
> any performance results, thanks.
>
> Also I'd like to use expediting of RCU as a last resort TBH, we should
> optimize the outer operations that require RCU in the first place such as
> Peter's suggestion since that will improve the overall efficiency of the
> code. And if/when expediting RCU, Peter's other suggestion to not do it in
> cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes
> sense
> to me.
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelagnelf@nvidia.com>
> Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock
> acquiring
>
> Bulk CPU hotplug operations such as enabling SMT across all cores
> require hotplugging multiple CPUs. The current implementation takes
> cpus_write_lock() for each individual CPU causing multiple slow grace
> period requests.
>
> Therefore introduce cpu_up_locked() that assumes the caller already
> holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to
> hold the lock once around the entire loop rather than for each CPU.
>
> Link: https://lore.kernel.org/
> all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
> kernel/cpu.c | 40 +++++++++++++++++++++++++---------------
> 1 file changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..4ce7deb236d7 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state)
> complete_ap_thread(st, true);
> }
>
> -/* Requires cpu_add_remove_lock to be held */
> -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state
> target)
> +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */
> +static int cpu_up_locked(unsigned int cpu, int tasks_frozen,
> + enum cpuhp_state target)
> {
> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> struct task_struct *idle;
> int ret = 0;
>
> - cpus_write_lock();
> + lockdep_assert_cpus_held();
>
> - if (!cpu_present(cpu)) {
> - ret = -EINVAL;
> - goto out;
> - }
> + if (!cpu_present(cpu))
> + return -EINVAL;
>
> /*
> * The caller of cpu_up() might have raced with another
> * caller. Nothing to do.
> */
> if (st->state >= target)
> - goto out;
> + return 0;
>
> if (st->state == CPUHP_OFFLINE) {
> /* Let it fail before we try to bring the cpu up */
> idle = idle_thread_get(cpu);
> - if (IS_ERR(idle)) {
> - ret = PTR_ERR(idle);
> - goto out;
> - }
> + if (IS_ERR(idle))
> + return PTR_ERR(idle);
>
> /*
> * Reset stale stack state from the last time this CPU was
> online.
> @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int
> tasks_frozen, enum cpuhp_state target)
> * return the error code..
> */
> if (ret)
> - goto out;
> + return ret;
> }
>
> /*
> @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int
> tasks_frozen, enum cpuhp_state target)
> */
> target = min((int)target, CPUHP_BRINGUP_CPU);
> ret = cpuhp_up_callbacks(cpu, st, target);
> -out:
> + return ret;
> +}
> +
> +/* Requires cpu_add_remove_lock to be held */
> +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state
> target)
> +{
> + int ret;
> +
> + cpus_write_lock();
> + ret = cpu_up_locked(cpu, tasks_frozen, target);
> cpus_write_unlock();
> arch_smt_update();
> return ret;
> @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void)
> int cpu, ret = 0;
>
> cpu_maps_update_begin();
> + /* Hold cpus_write_lock() for entire batch operation. */
> + cpus_write_lock();
> cpu_smt_control = CPU_SMT_ENABLED;
> for_each_present_cpu(cpu) {
> /* Skip online CPUs and CPUs on offline nodes */
> @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void)
> continue;
> if (!cpu_smt_thread_allowed(cpu) || !
> topology_is_core_online(cpu))
> continue;
> - ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
> + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE);
> if (ret)
> break;
> /* See comment in cpuhp_smt_disable() */
> cpuhp_online_cpu_device(cpu);
> }
> + cpus_write_unlock();
> + arch_smt_update();
> cpu_maps_update_done();
> return ret;
> }
What about cpuhp_smt_disable?
> On Jan 19, 2026, at 8:54\ufffd\ufffd\ufffdAM, Shrikanth Hegde wrote:
>
> \ufffd\ufffd\ufffd
>
>> On 1/19/26 10:48 AM, Joel Fernandes wrote:
>>> On Sun, Jan 18, 2026 at 05:08:44PM +0530, Samir M wrote:
>>>> On 12/01/26 3:13 pm, Vishal Chourasia wrote:
>>> > Bulk CPU hotplug operations--such as switching SMT modes across all
>>> > cores--require hotplugging multiple CPUs in rapid succession. On large
>>> > systems, this process takes significant time, increasing as the number
>>> > of CPUs grows, leading to substantial delays on high-core-count
>>> > machines. Analysis [1] reveals that the majority of this time is spent
>>> > waiting for synchronize_rcu().
>>> >
>>> > Expedite synchronize_rcu() during the hotplug path to accelerate the
>>> > operation. Since CPU hotplug is a user-initiated administrative task,
>>> > it should complete as quickly as possible.
>>>
>>> Hi Vishal,
>>>
>>> I verified this patch using the configuration described below.
>>> Configuration:
>>> * Kernel version: 6.19.0-rc5
>>> * Number of CPUs: 2048
>>>
>>> SMT Mode | Without Patch | With Patch | % Improvement |
>>> ------------------------------------------------------------------
>>> SMT=off | 30m 53.194s | 6m 4.250s | +80.40% |
>>> SMT=on | 49m 5.920s | 36m 50.386s | +25.01% |
>> Hi Vishal, Samir,
>> Thanks for the testing on your large CPU count system.
>> Considering the SMT=on performance is still terrible, before we expedite RCU, could we try the approach Peter suggested (avoiding repeated
>> lock/unlock)? I wrote a patch below.
>> git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git
>> tag: cpuhp-bulk-optimize-rfc-v1
>> I tested it lightly on rcutorture hotplug test and it passes. Please share
>> any performance results, thanks.
>> Also I'd like to use expediting of RCU as a last resort TBH, we should
>> optimize the outer operations that require RCU in the first place such as
>> Peter's suggestion since that will improve the overall efficiency of the
>> code. And if/when expediting RCU, Peter's other suggestion to not do it in
>> cpus_write_lock() and instead do it from cpuhp_smt_enable() also makes sense
>> to me.
>> ---8<-----------------------
>> From: Joel Fernandes
>> Subject: [PATCH] cpuhp: Optimize batch SMT enable by reducing lock acquiring
>> Bulk CPU hotplug operations such as enabling SMT across all cores
>> require hotplugging multiple CPUs. The current implementation takes
>> cpus_write_lock() for each individual CPU causing multiple slow grace
>> period requests.
>> Therefore introduce cpu_up_locked() that assumes the caller already
>> holds cpus_write_lock(). The cpuhp_smt_enable() function is updated to
>> hold the lock once around the entire loop rather than for each CPU.
>> Link: https://lore.kernel.org/ all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
>> Suggested-by: Peter Zijlstra
>> Signed-off-by: Joel Fernandes
>> ---
>> kernel/cpu.c | 40 +++++++++++++++++++++++++---------------
>> 1 file changed, 25 insertions(+), 15 deletions(-)
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 8df2d773fe3b..4ce7deb236d7 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1623,34 +1623,31 @@ void cpuhp_online_idle(enum cpuhp_state state)
>> complete_ap_thread(st, true);
>> }
>> -/* Requires cpu_add_remove_lock to be held */
>> -static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
>> +/* Requires cpu_add_remove_lock and cpus_write_lock to be held. */
>> +static int cpu_up_locked(unsigned int cpu, int tasks_frozen,
>> + enum cpuhp_state target)
>> {
>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>> struct task_struct *idle;
>> int ret = 0;
>> - cpus_write_lock();
>> + lockdep_assert_cpus_held();
>> - if (!cpu_present(cpu)) {
>> - ret = -EINVAL;
>> - goto out;
>> - }
>> + if (!cpu_present(cpu))
>> + return -EINVAL;
>> /*
>> * The caller of cpu_up() might have raced with another
>> * caller. Nothing to do.
>> */
>> if (st->state >= target)
>> - goto out;
>> + return 0;
>> if (st->state == CPUHP_OFFLINE) {
>> /* Let it fail before we try to bring the cpu up */
>> idle = idle_thread_get(cpu);
>> - if (IS_ERR(idle)) {
>> - ret = PTR_ERR(idle);
>> - goto out;
>> - }
>> + if (IS_ERR(idle))
>> + return PTR_ERR(idle);
>> /*
>> * Reset stale stack state from the last time this CPU was online.
>> @@ -1673,7 +1670,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
>> * return the error code..
>> */
>> if (ret)
>> - goto out;
>> + return ret;
>> }
>> /*
>> @@ -1683,7 +1680,16 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
>> */
>> target = min((int)target, CPUHP_BRINGUP_CPU);
>> ret = cpuhp_up_callbacks(cpu, st, target);
>> -out:
>> + return ret;
>> +}
>> +
>> +/* Requires cpu_add_remove_lock to be held */
>> +static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
>> +{
>> + int ret;
>> +
>> + cpus_write_lock();
>> + ret = cpu_up_locked(cpu, tasks_frozen, target);
>> cpus_write_unlock();
>> arch_smt_update();
>> return ret;
>> @@ -2715,6 +2721,8 @@ int cpuhp_smt_enable(void)
>> int cpu, ret = 0;
>> cpu_maps_update_begin();
>> + /* Hold cpus_write_lock() for entire batch operation. */
>> + cpus_write_lock();
>> cpu_smt_control = CPU_SMT_ENABLED;
>> for_each_present_cpu(cpu) {
>> /* Skip online CPUs and CPUs on offline nodes */
>> @@ -2722,12 +2730,14 @@ int cpuhp_smt_enable(void)
>> continue;
>> if (!cpu_smt_thread_allowed(cpu) || ! topology_is_core_online(cpu))
>> continue;
>> - ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
>> + ret = cpu_up_locked(cpu, 0, CPUHP_ONLINE);
>> if (ret)
>> break;
>> /* See comment in cpuhp_smt_disable() */
>> cpuhp_online_cpu_device(cpu);
>> }
>> + cpus_write_unlock();
>> + arch_smt_update();
>> cpu_maps_update_done();
>> return ret;
>> }
>
> What about cpuhp_smt_disable?
Doing this is not that easy on disable path, AFAICS. Considering that the enable
path in the performance tests was much worse, I wanted to contain it to that.
This does not have to be the only fix though, but one of the cures to get there.
Thanks.
On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > But who cares? Its not like you'd *ever* do this, right?
Hello Peter, On Mon, Jan 12, 2026 at 03:24:40PM +0100, Peter Zijlstra wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > Bulk CPU hotplug operations—such as switching SMT modes across all > > cores—require hotplugging multiple CPUs in rapid succession. On large > > systems, this process takes significant time, increasing as the number > > of CPUs grows, leading to substantial delays on high-core-count > > machines. Analysis [1] reveals that the majority of this time is spent > > waiting for synchronize_rcu(). > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > operation. Since CPU hotplug is a user-initiated administrative task, > > it should complete as quickly as possible. > > > > Performance data on a PPC64 system with 400 CPUs: > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > Before: real 1m14.792s > > After: real 0m03.205s # ~23x improvement > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > Before: real 2m27.695s > > After: real 0m02.510s # ~58x improvement > > > > But who cares? Its not like you'd *ever* do this, right? Users dynamically adjust SMT modes to optimize performance of the workload being run. And, yes it doesn't happen too often, but when it does, on machines with (>= 1920 CPUs) it takes more than 20 mins to finish. - vishal
On Mon, Jan 12, 2026 at 11:30:52PM +0530, Vishal Chourasia wrote:
> Hello Peter,
>
>
>
> On Mon, Jan 12, 2026 at 03:24:40PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote:
> > > Bulk CPU hotplug operations—such as switching SMT modes across all
> > > cores—require hotplugging multiple CPUs in rapid succession. On large
> > > systems, this process takes significant time, increasing as the number
> > > of CPUs grows, leading to substantial delays on high-core-count
> > > machines. Analysis [1] reveals that the majority of this time is spent
> > > waiting for synchronize_rcu().
> > >
> > > Expedite synchronize_rcu() during the hotplug path to accelerate the
> > > operation. Since CPU hotplug is a user-initiated administrative task,
> > > it should complete as quickly as possible.
> > >
> > > Performance data on a PPC64 system with 400 CPUs:
> > >
> > > + ppc64_cpu --smt=1 (SMT8 to SMT1)
> > > Before: real 1m14.792s
> > > After: real 0m03.205s # ~23x improvement
> > >
> > > + ppc64_cpu --smt=8 (SMT1 to SMT8)
> > > Before: real 2m27.695s
> > > After: real 0m02.510s # ~58x improvement
> > >
> >
> > But who cares? Its not like you'd *ever* do this, right?
> Users dynamically adjust SMT modes to optimize performance of the
> workload being run. And, yes it doesn't happen too often, but when it
> does, on machines with (>= 1920 CPUs) it takes more than 20 mins to
> finish.
Users cannot change this, it is root only.
Having to change SMT mode per workload seems quite insane; but whatever.
If you do have to put RCU hooks anywhere; I'd much rather see them in
cpuhp_smt_{en,dis}able(), such that they only affect the batch hotplug
case, rather than everything using cpus_write_lock().
Also note that there is a case to be made to optimize this batch hotplug
case; for one it makes no sense to take cpus_write_lock() over and over
and over again; if you can pull that out, just like it already lifted
cpu_maps_update_begin(), this would help.
And Joel has a point, in that it might make sense for RCU to behave
'better' under these conditions.
Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
accelerate the operation.
Bulk CPU hotplug operations—such as switching SMT modes across all
cores—require hotplugging multiple CPUs in rapid succession. On large
systems, this process takes significant time, increasing as the number
of CPUs to hotplug during SMT switch grows, leading to substantial
delays on high-core-count machines. Analysis [1] reveals that the
majority of this time is spent waiting for synchronize_rcu().
SMT switch is a user-initiated administrative task, it should complete
as quickly as possible.
Performance data on a PPC64 system with 2048 CPUs:
+ ppc64_cpu --smt=1 (SMT8 to SMT1)
Before: real 30m53.194s
After: real 6m4.678s # ~5x improvement
+ ppc64_cpu --smt=8 (SMT1 to SMT8)
Before: real 49m5.920s
After: real 36m47.798s # ~1.3x improvement
[1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Tested-by: Samir M <samir@linux.ibm.com>
---
include/linux/rcupdate.h | 3 +++
kernel/cpu.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index c5b30054cd01..03c06cfb2b6d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
extern int rcu_expedited;
extern int rcu_normal;
+extern void rcu_expedite_gp(void);
+extern void rcu_unexpedite_gp(void);
+
DEFINE_LOCK_GUARD_0(rcu,
do {
rcu_read_lock();
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..a264d7170842 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
int cpu, ret = 0;
cpu_maps_update_begin();
+ rcu_expedite_gp();
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
continue;
@@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
}
if (!ret)
cpu_smt_control = ctrlval;
+ rcu_unexpedite_gp();
cpu_maps_update_done();
return ret;
}
@@ -2716,6 +2718,7 @@ int cpuhp_smt_enable(void)
cpu_maps_update_begin();
cpu_smt_control = CPU_SMT_ENABLED;
+ rcu_expedite_gp();
for_each_present_cpu(cpu) {
/* Skip online CPUs and CPUs on offline nodes */
if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
@@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
/* See comment in cpuhp_smt_disable() */
cpuhp_online_cpu_device(cpu);
}
+ rcu_unexpedite_gp();
cpu_maps_update_done();
return ret;
}
--
2.52.0
On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
> accelerate the operation.
>
> Bulk CPU hotplug operations—such as switching SMT modes across all
> cores—require hotplugging multiple CPUs in rapid succession. On large
> systems, this process takes significant time, increasing as the number
> of CPUs to hotplug during SMT switch grows, leading to substantial
> delays on high-core-count machines. Analysis [1] reveals that the
> majority of this time is spent waiting for synchronize_rcu().
>
You seem to have left out all the useful bits from your changelog again
:/
Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
something we can't live with either.
Also, memory got jogged and I think something like the below will remove
2/3 of your rcu woes as well.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..1365c19444b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
int cpu, ret = 0;
cpu_maps_update_begin();
+ rcu_sync_enter(&cpu_hotplug_lock.rss);
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
continue;
@@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
}
if (!ret)
cpu_smt_control = ctrlval;
+ rcu_sync_exit(&cpu_hotplug_lock.rss);
cpu_maps_update_done();
return ret;
}
@@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
int cpu, ret = 0;
cpu_maps_update_begin();
+ rcu_sync_enter(&cpu_hotplug_lock.rss);
cpu_smt_control = CPU_SMT_ENABLED;
for_each_present_cpu(cpu) {
/* Skip online CPUs and CPUs on offline nodes */
@@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
/* See comment in cpuhp_smt_disable() */
cpuhp_online_cpu_device(cpu);
}
+ rcu_sync_exit(&cpu_hotplug_lock.rss);
cpu_maps_update_done();
return ret;
}
On 19/01/26 5:13 pm, Peter Zijlstra wrote:
> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
>> accelerate the operation.
>>
>> Bulk CPU hotplug operations—such as switching SMT modes across all
>> cores—require hotplugging multiple CPUs in rapid succession. On large
>> systems, this process takes significant time, increasing as the number
>> of CPUs to hotplug during SMT switch grows, leading to substantial
>> delays on high-core-count machines. Analysis [1] reveals that the
>> majority of this time is spent waiting for synchronize_rcu().
>>
> You seem to have left out all the useful bits from your changelog again
> :/
>
> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
> something we can't live with either.
>
> Also, memory got jogged and I think something like the below will remove
> 2/3 of your rcu woes as well.
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..1365c19444b2 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> int cpu, ret = 0;
>
> cpu_maps_update_begin();
> + rcu_sync_enter(&cpu_hotplug_lock.rss);
> for_each_online_cpu(cpu) {
> if (topology_is_primary_thread(cpu))
> continue;
> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> }
> if (!ret)
> cpu_smt_control = ctrlval;
> + rcu_sync_exit(&cpu_hotplug_lock.rss);
> cpu_maps_update_done();
> return ret;
> }
> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
> int cpu, ret = 0;
>
> cpu_maps_update_begin();
> + rcu_sync_enter(&cpu_hotplug_lock.rss);
> cpu_smt_control = CPU_SMT_ENABLED;
> for_each_present_cpu(cpu) {
> /* Skip online CPUs and CPUs on offline nodes */
> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
> /* See comment in cpuhp_smt_disable() */
> cpuhp_online_cpu_device(cpu);
> }
> + rcu_sync_exit(&cpu_hotplug_lock.rss);
> cpu_maps_update_done();
> return ret;
> }
Hi,
I verified this patch using the configuration described below.
Configuration:
• Kernel version: 6.19.0-rc6
• Number of CPUs: 1536
Earlier verification of an older version of this patch was performed on
a system with *2048 CPUs*. Due to system unavailability, the current
verification was carried out on a *different system.*
Using this setup, I evaluated the patch with both SMT enabled and SMT
disabled. patch shows a significant improvement in the SMT=off case and
a measurable improvement in the SMT=on case.
The results indicate that when SMT is enabled, the system time is
noticeably higher. In contrast, with SMT disabled, no significant
increase in system time is observed.
SMT=ON -> sys 50m42.805s
SMT=OFF -> sys 0m0.064s
SMT Mode | Without Patch | With Patch | % Improvement |
------------------------------------------------------------------
SMT=off | 20m 32.210s | 5m 30.898s | +73.15% |
SMT=on | 62m 46.549s | 55m 45.671s | +11.18% |
Please add below tag:
Tested-by: Samir M <samir@linux.ibm.com>
Regards,
Samir
On 27/01/26 11:18 pm, Samir M wrote:
>
> On 19/01/26 5:13 pm, Peter Zijlstra wrote:
>> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
>>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable]
>>> path to
>>> accelerate the operation.
>>>
>>> Bulk CPU hotplug operations—such as switching SMT modes across all
>>> cores—require hotplugging multiple CPUs in rapid succession. On large
>>> systems, this process takes significant time, increasing as the number
>>> of CPUs to hotplug during SMT switch grows, leading to substantial
>>> delays on high-core-count machines. Analysis [1] reveals that the
>>> majority of this time is spent waiting for synchronize_rcu().
>>>
>> You seem to have left out all the useful bits from your changelog again
>> :/
>>
>> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
>> something we can't live with either.
>>
>> Also, memory got jogged and I think something like the below will remove
>> 2/3 of your rcu woes as well.
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 8df2d773fe3b..1365c19444b2 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control
>> ctrlval)
>> int cpu, ret = 0;
>> cpu_maps_update_begin();
>> + rcu_sync_enter(&cpu_hotplug_lock.rss);
>> for_each_online_cpu(cpu) {
>> if (topology_is_primary_thread(cpu))
>> continue;
>> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control
>> ctrlval)
>> }
>> if (!ret)
>> cpu_smt_control = ctrlval;
>> + rcu_sync_exit(&cpu_hotplug_lock.rss);
>> cpu_maps_update_done();
>> return ret;
>> }
>> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
>> int cpu, ret = 0;
>> cpu_maps_update_begin();
>> + rcu_sync_enter(&cpu_hotplug_lock.rss);
>> cpu_smt_control = CPU_SMT_ENABLED;
>> for_each_present_cpu(cpu) {
>> /* Skip online CPUs and CPUs on offline nodes */
>> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
>> /* See comment in cpuhp_smt_disable() */
>> cpuhp_online_cpu_device(cpu);
>> }
>> + rcu_sync_exit(&cpu_hotplug_lock.rss);
>> cpu_maps_update_done();
>> return ret;
>> }
>
>
> Hi,
>
> I verified this patch using the configuration described below.
> Configuration:
> • Kernel version: 6.19.0-rc6
> • Number of CPUs: 1536
>
> Earlier verification of an older version of this patch was performed
> on a system with *2048 CPUs*. Due to system unavailability, the
> current verification was carried out on a *different system.*
>
>
> Using this setup, I evaluated the patch with both SMT enabled and SMT
> disabled. patch shows a significant improvement in the SMT=off case
> and a measurable improvement in the SMT=on case.
> The results indicate that when SMT is enabled, the system time is
> noticeably higher. In contrast, with SMT disabled, no significant
> increase in system time is observed.
>
> SMT=ON -> sys 50m42.805s
> SMT=OFF -> sys 0m0.064s
>
>
> SMT Mode | Without Patch | With Patch | % Improvement |
> ------------------------------------------------------------------
> SMT=off | 20m 32.210s | 5m 30.898s | +73.15% |
> SMT=on | 62m 46.549s | 55m 45.671s | +11.18% |
>
>
> Please add below tag:
Tested-by: Samir M <samir@linux.ibm.com>
>
> Regards,
> Samir
>
Hi All,
Apologies for the confusion in the previous report.
In the previous report, I used the b4 am command to apply the patch. As
a result, all patches in the mail thread were applied, rather than only
the intended one.
Consequently, the results that were posted earlier included changes from
multiple patches, and therefore cannot be considered valid for
evaluating this patch in isolation.
We have since tested Peter’s patch separately, applied in isolation.
Based on this testing, we did not observe any improvement in the SMT
timing details.
Configuration:
• Kernel version: 6.19.0-rc6
• Number of CPUs: 1536
SMT Mode | Without Patch | With Patch | % Improvement |
------------------------------------------------------------------
SMT=off | 20m 32.210s | 20m22.441s | +0.79% |
SMT=on | 62m 46.549s | 63m0.532s | -0.37% |
Regards,
Samir
On 27/01/26 11:18 pm, Samir M wrote:
>
> On 19/01/26 5:13 pm, Peter Zijlstra wrote:
>> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
>>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable]
>>> path to
>>> accelerate the operation.
>>>
>>> Bulk CPU hotplug operations—such as switching SMT modes across all
>>> cores—require hotplugging multiple CPUs in rapid succession. On large
>>> systems, this process takes significant time, increasing as the number
>>> of CPUs to hotplug during SMT switch grows, leading to substantial
>>> delays on high-core-count machines. Analysis [1] reveals that the
>>> majority of this time is spent waiting for synchronize_rcu().
>>>
>> You seem to have left out all the useful bits from your changelog again
>> :/
>>
>> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
>> something we can't live with either.
>>
>> Also, memory got jogged and I think something like the below will remove
>> 2/3 of your rcu woes as well.
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 8df2d773fe3b..1365c19444b2 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control
>> ctrlval)
>> int cpu, ret = 0;
>> cpu_maps_update_begin();
>> + rcu_sync_enter(&cpu_hotplug_lock.rss);
>> for_each_online_cpu(cpu) {
>> if (topology_is_primary_thread(cpu))
>> continue;
>> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control
>> ctrlval)
>> }
>> if (!ret)
>> cpu_smt_control = ctrlval;
>> + rcu_sync_exit(&cpu_hotplug_lock.rss);
>> cpu_maps_update_done();
>> return ret;
>> }
>> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
>> int cpu, ret = 0;
>> cpu_maps_update_begin();
>> + rcu_sync_enter(&cpu_hotplug_lock.rss);
>> cpu_smt_control = CPU_SMT_ENABLED;
>> for_each_present_cpu(cpu) {
>> /* Skip online CPUs and CPUs on offline nodes */
>> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
>> /* See comment in cpuhp_smt_disable() */
>> cpuhp_online_cpu_device(cpu);
>> }
>> + rcu_sync_exit(&cpu_hotplug_lock.rss);
>> cpu_maps_update_done();
>> return ret;
>> }
>
>
> Hi,
>
> I verified this patch using the configuration described below.
> Configuration:
> • Kernel version: 6.19.0-rc6
> • Number of CPUs: 1536
>
> Earlier verification of an older version of this patch was performed
> on a system with *2048 CPUs*. Due to system unavailability, the
> current verification was carried out on a *different system.*
>
>
> Using this setup, I evaluated the patch with both SMT enabled and SMT
> disabled. patch shows a significant improvement in the SMT=off case
> and a measurable improvement in the SMT=on case.
> The results indicate that when SMT is enabled, the system time is
> noticeably higher. In contrast, with SMT disabled, no significant
> increase in system time is observed.
>
> SMT=ON -> sys 50m42.805s
> SMT=OFF -> sys 0m0.064s
>
>
> SMT Mode | Without Patch | With Patch | % Improvement |
> ------------------------------------------------------------------
> SMT=off | 20m 32.210s | 5m 30.898s | +73.15% |
> SMT=on | 62m 46.549s | 55m 45.671s | +11.18% |
>
>
> Please add below tag:
Tested-by: Samir M <samir@linux.ibm.com>
>
> Regards,
> Samir
>
Hi All,
For reference, I am updating the results for the Vishal and Rezki patches.
Configuration:
Kernel version: 6.19.0-rc6
Number of CPUs: 1536
Earlier verification of an older revision of this patch was performed on
a system with 2048 CPUs*.* Due to system unavailability, the current
verification was carried out on a different**system. For reference, I am
updating the results corresponding to the patches verified using the
above configuration.
Patch:
https://lore.kernel.org/all/20260112094332.66006-2-vishalc@linux.ibm.com/
SMT Mode | Without Patch | With Patch | % Improvement |
------------------------------------------------------------------
SMT=off | 20m 32.210s | 5m 31.807s | +73.09% |
SMT=on | 62m 46.549s | 55m 48.801s | +11.08% |
SMT=ON -> sys 50m44.105s
SMT=OFF -> sys 0m0.035s
patch: https://lore.kernel.org/all/20260114183415.286489-1-urezki@gmail.com/
SMT Mode | Without Patch | With Patch | % Improvement |
------------------------------------------------------------------
SMT=off | 20m 32.210s | 18m 2.012s | +12.19% |
SMT=on | 62m 46.549s | 62m 35.076s | +0.30% |
SMT=ON -> sys 50m43.806s
SMT=OFF -> sys 0m0.109s
Regards,
Samir
Hi Peter.
On 1/19/26 5:13 PM, Peter Zijlstra wrote:
> On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
>> Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
>> accelerate the operation.
>>
>> Bulk CPU hotplug operations—such as switching SMT modes across all
>> cores—require hotplugging multiple CPUs in rapid succession. On large
>> systems, this process takes significant time, increasing as the number
>> of CPUs to hotplug during SMT switch grows, leading to substantial
>> delays on high-core-count machines. Analysis [1] reveals that the
>> majority of this time is spent waiting for synchronize_rcu().
>>
>
> You seem to have left out all the useful bits from your changelog again
> :/
>
> Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
> something we can't live with either.
>
> Also, memory got jogged and I think something like the below will remove
> 2/3 of your rcu woes as well.
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..1365c19444b2 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> int cpu, ret = 0;
>
> cpu_maps_update_begin();
> + rcu_sync_enter(&cpu_hotplug_lock.rss);
> for_each_online_cpu(cpu) {
> if (topology_is_primary_thread(cpu))
> continue;
> @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> }
> if (!ret)
> cpu_smt_control = ctrlval;
> + rcu_sync_exit(&cpu_hotplug_lock.rss);
> cpu_maps_update_done();
> return ret;
> }
> @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
> int cpu, ret = 0;
>
> cpu_maps_update_begin();
> + rcu_sync_enter(&cpu_hotplug_lock.rss);
> cpu_smt_control = CPU_SMT_ENABLED;
> for_each_present_cpu(cpu) {
> /* Skip online CPUs and CPUs on offline nodes */
> @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
> /* See comment in cpuhp_smt_disable() */
> cpuhp_online_cpu_device(cpu);
> }
> + rcu_sync_exit(&cpu_hotplug_lock.rss);
> cpu_maps_update_done();
> return ret;
> }
Currently, cpuhp_smt_[enable/disable] calls _cpu_up/_cpu_down
which does the same in cpus_write_lock/unlock. though it is per
cpu enable/disable one after another.
How hoisting this up will help?
On Mon, Jan 19, 2026 at 07:15:09PM +0530, Shrikanth Hegde wrote:
> Hi Peter.
>
> On 1/19/26 5:13 PM, Peter Zijlstra wrote:
> > On Mon, Jan 19, 2026 at 04:17:40PM +0530, Vishal Chourasia wrote:
> > > Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
> > > accelerate the operation.
> > >
> > > Bulk CPU hotplug operations—such as switching SMT modes across all
> > > cores—require hotplugging multiple CPUs in rapid succession. On large
> > > systems, this process takes significant time, increasing as the number
> > > of CPUs to hotplug during SMT switch grows, leading to substantial
> > > delays on high-core-count machines. Analysis [1] reveals that the
> > > majority of this time is spent waiting for synchronize_rcu().
> > >
> >
> > You seem to have left out all the useful bits from your changelog again
> > :/
> >
> > Anyway, ISTR Joel posted a patch hoisting a lock; it was a icky, but not
> > something we can't live with either.
> >
> > Also, memory got jogged and I think something like the below will remove
> > 2/3 of your rcu woes as well.
> >
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 8df2d773fe3b..1365c19444b2 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> > int cpu, ret = 0;
> > cpu_maps_update_begin();
> > + rcu_sync_enter(&cpu_hotplug_lock.rss);
> > for_each_online_cpu(cpu) {
> > if (topology_is_primary_thread(cpu))
> > continue;
> > @@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> > }
> > if (!ret)
> > cpu_smt_control = ctrlval;
> > + rcu_sync_exit(&cpu_hotplug_lock.rss);
> > cpu_maps_update_done();
> > return ret;
> > }
> > @@ -2715,6 +2717,7 @@ int cpuhp_smt_enable(void)
> > int cpu, ret = 0;
> > cpu_maps_update_begin();
> > + rcu_sync_enter(&cpu_hotplug_lock.rss);
> > cpu_smt_control = CPU_SMT_ENABLED;
> > for_each_present_cpu(cpu) {
> > /* Skip online CPUs and CPUs on offline nodes */
> > @@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
> > /* See comment in cpuhp_smt_disable() */
> > cpuhp_online_cpu_device(cpu);
> > }
> > + rcu_sync_exit(&cpu_hotplug_lock.rss);
> > cpu_maps_update_done();
> > return ret;
> > }
>
>
> Currently, cpuhp_smt_[enable/disable] calls _cpu_up/_cpu_down
> which does the same in cpus_write_lock/unlock. though it is per
> cpu enable/disable one after another.
>
> How hoisting this up will help?
By holding an extra rcu_sync reference, the percpu rwsem is kept into
the the slow path, avoiding the rcu-sync on down_write(), which was very
prevalent per this:
https://lkml.kernel.org/r/aWU9HRcs4ghazIRg@linux.ibm.com
[Edits] Added Suggested-by tag
Expedite synchronize_rcu() during the cpuhp_smt_[enable|disable] path to
accelerate the operation.
Bulk CPU hotplug operations—such as switching SMT modes across all
cores—require hotplugging multiple CPUs in rapid succession. On large
systems, this process takes significant time, increasing as the number
of CPUs to hotplug during SMT switch grows, leading to substantial
delays on high-core-count machines. Analysis [1] reveals that the
majority of this time is spent waiting for synchronize_rcu().
SMT switch is a user-initiated administrative task, it should complete
as quickly as possible.
Performance data on a PPC64 system with 2048 CPUs:
+ ppc64_cpu --smt=1 (SMT8 to SMT1)
Before: real 30m53.194s
After: real 6m4.678s # ~5x improvement
+ ppc64_cpu --smt=8 (SMT1 to SMT8)
Before: real 49m5.920s
After: real 36m47.798s # ~1.3x improvement
[1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
[2] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Tested-by: Samir M <samir@linux.ibm.com>
---
include/linux/rcupdate.h | 3 +++
kernel/cpu.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index c5b30054cd01..03c06cfb2b6d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
extern int rcu_expedited;
extern int rcu_normal;
+extern void rcu_expedite_gp(void);
+extern void rcu_unexpedite_gp(void);
+
DEFINE_LOCK_GUARD_0(rcu,
do {
rcu_read_lock();
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8df2d773fe3b..a264d7170842 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2669,6 +2669,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
int cpu, ret = 0;
cpu_maps_update_begin();
+ rcu_expedite_gp();
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
continue;
@@ -2698,6 +2699,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
}
if (!ret)
cpu_smt_control = ctrlval;
+ rcu_unexpedite_gp();
cpu_maps_update_done();
return ret;
}
@@ -2716,6 +2718,7 @@ int cpuhp_smt_enable(void)
cpu_maps_update_begin();
cpu_smt_control = CPU_SMT_ENABLED;
+ rcu_expedite_gp();
for_each_present_cpu(cpu) {
/* Skip online CPUs and CPUs on offline nodes */
if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
@@ -2728,6 +2731,7 @@ int cpuhp_smt_enable(void)
/* See comment in cpuhp_smt_disable() */
cpuhp_online_cpu_device(cpu);
}
+ rcu_unexpedite_gp();
cpu_maps_update_done();
return ret;
}
--
2.52.0
> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote:
>
> Bulk CPU hotplug operations—such as switching SMT modes across all
> cores—require hotplugging multiple CPUs in rapid succession. On large
> systems, this process takes significant time, increasing as the number
> of CPUs grows, leading to substantial delays on high-core-count
> machines. Analysis [1] reveals that the majority of this time is spent
> waiting for synchronize_rcu().
>
> Expedite synchronize_rcu() during the hotplug path to accelerate the
> operation. Since CPU hotplug is a user-initiated administrative task,
> it should complete as quickly as possible.
When does the user initiate this in your system?
Hotplug should not be happening that often to begin with, it is a slow path that
depends on the disruptive stop-machine mechanism.
>
> Performance data on a PPC64 system with 400 CPUs:
>
> + ppc64_cpu --smt=1 (SMT8 to SMT1)
> Before: real 1m14.792s
> After: real 0m03.205s # ~23x improvement
>
> + ppc64_cpu --smt=8 (SMT1 to SMT8)
> Before: real 2m27.695s
> After: real 0m02.510s # ~58x improvement
This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)?
Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug.
thanks,
- Joel
>
> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
>
> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
> include/linux/rcupdate.h | 3 +++
> kernel/cpu.c | 2 ++
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index c5b30054cd01..03c06cfb2b6d 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
> extern int rcu_expedited;
> extern int rcu_normal;
>
> +extern void rcu_expedite_gp(void);
> +extern void rcu_unexpedite_gp(void);
> +
> DEFINE_LOCK_GUARD_0(rcu,
> do {
> rcu_read_lock();
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..6b0d491d73f4 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
>
> void cpus_write_lock(void)
> {
> + rcu_expedite_gp();
> percpu_down_write(&cpu_hotplug_lock);
> }
>
> void cpus_write_unlock(void)
> {
> percpu_up_write(&cpu_hotplug_lock);
> + rcu_unexpedite_gp();
> }
>
> void lockdep_assert_cpus_held(void)
> --
> 2.52.0
>
> On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote:
>
>
>
>> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote:
>>
>> Bulk CPU hotplug operations—such as switching SMT modes across all
>> cores—require hotplugging multiple CPUs in rapid succession. On large
>> systems, this process takes significant time, increasing as the number
>> of CPUs grows, leading to substantial delays on high-core-count
>> machines. Analysis [1] reveals that the majority of this time is spent
>> waiting for synchronize_rcu().
>>
>> Expedite synchronize_rcu() during the hotplug path to accelerate the
>> operation. Since CPU hotplug is a user-initiated administrative task,
>> it should complete as quickly as possible.
>
> When does the user initiate this in your system?
>
> Hotplug should not be happening that often to begin with, it is a slow path that
> depends on the disruptive stop-machine mechanism.
>
>>
>> Performance data on a PPC64 system with 400 CPUs:
>>
>> + ppc64_cpu --smt=1 (SMT8 to SMT1)
>> Before: real 1m14.792s
>> After: real 0m03.205s # ~23x improvement
>>
>> + ppc64_cpu --smt=8 (SMT1 to SMT8)
>> Before: real 2m27.695s
>> After: real 0m02.510s # ~58x improvement
>
> This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)?
>
> Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug.
Also, why not just use the expedite api at the callsite that is slow than blanket expediting everything between hotplug lock and unlock. That is more specific fix than this fix which applies more broadly to all operations. It appears the report you provided does provide the culprit callsite.
- Joel
>
> thanks,
>
> - Joel
>
>>
>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
>>
>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
>>
>> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> ---
>> include/linux/rcupdate.h | 3 +++
>> kernel/cpu.c | 2 ++
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index c5b30054cd01..03c06cfb2b6d 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
>> extern int rcu_expedited;
>> extern int rcu_normal;
>>
>> +extern void rcu_expedite_gp(void);
>> +extern void rcu_unexpedite_gp(void);
>> +
>> DEFINE_LOCK_GUARD_0(rcu,
>> do {
>> rcu_read_lock();
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 8df2d773fe3b..6b0d491d73f4 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
>>
>> void cpus_write_lock(void)
>> {
>> + rcu_expedite_gp();
>> percpu_down_write(&cpu_hotplug_lock);
>> }
>>
>> void cpus_write_unlock(void)
>> {
>> percpu_up_write(&cpu_hotplug_lock);
>> + rcu_unexpedite_gp();
>> }
>>
>> void lockdep_assert_cpus_held(void)
>> --
>> 2.52.0
>>
On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: > > > > > > > >> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: > >> > >> Bulk CPU hotplug operations—such as switching SMT modes across all > >> cores—require hotplugging multiple CPUs in rapid succession. On large > >> systems, this process takes significant time, increasing as the number > >> of CPUs grows, leading to substantial delays on high-core-count > >> machines. Analysis [1] reveals that the majority of this time is spent > >> waiting for synchronize_rcu(). > >> > >> Expedite synchronize_rcu() during the hotplug path to accelerate the > >> operation. Since CPU hotplug is a user-initiated administrative task, > >> it should complete as quickly as possible. > > > > When does the user initiate this in your system? > > > > Hotplug should not be happening that often to begin with, it is a slow path that > > depends on the disruptive stop-machine mechanism. > > > >> > >> Performance data on a PPC64 system with 400 CPUs: > >> > >> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >> Before: real 1m14.792s > >> After: real 0m03.205s # ~23x improvement > >> > >> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >> Before: real 2m27.695s > >> After: real 0m02.510s # ~58x improvement > > > > This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? > > > > Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. > > Also, why not just use the expedite api at the callsite that is slow > than blanket expediting everything between hotplug lock and unlock. > That is more specific fix than this fix which applies more broadly to > all operations. It appears the report you provided does provide the > culprit callsite. Because hotplug is not a fast path; there is no expectation of performance here.
> On Jan 12, 2026, at 9:24 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote: >>> >>> >>> >>>> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote: >>>> >>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>> systems, this process takes significant time, increasing as the number >>>> of CPUs grows, leading to substantial delays on high-core-count >>>> machines. Analysis [1] reveals that the majority of this time is spent >>>> waiting for synchronize_rcu(). >>>> >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>> it should complete as quickly as possible. >>> >>> When does the user initiate this in your system? >>> >>> Hotplug should not be happening that often to begin with, it is a slow path that >>> depends on the disruptive stop-machine mechanism. >>> >>>> >>>> Performance data on a PPC64 system with 400 CPUs: >>>> >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>> Before: real 1m14.792s >>>> After: real 0m03.205s # ~23x improvement >>>> >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>> Before: real 2m27.695s >>>> After: real 0m02.510s # ~58x improvement >>> >>> This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)? >>> >>> Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug. >> >> Also, why not just use the expedite api at the callsite that is slow >> than blanket expediting everything between hotplug lock and unlock. >> That is more specific fix than this fix which applies more broadly to >> all operations. It appears the report you provided does provide the >> culprit callsite. > > Because hotplug is not a fast path; there is no expectation of > performance here. Agreed, I was just wondering if it was incredibly slow or something. Looking forward to more justification from Vishal on usecase, - Joel >
Hello Joel, Peter
On Mon, Jan 12, 2026 at 02:37:14PM +0000, Joel Fernandes wrote:
>
>
> > On Jan 12, 2026, at 9:24 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jan 12, 2026 at 02:20:44PM +0000, Joel Fernandes wrote:
> >>
> >>
> >>>> On Jan 12, 2026, at 9:03 AM, Joel Fernandes <joelagnelf@nvidia.com> wrote:
> >>>
> >>>
> >>>
> >>>> On Jan 12, 2026, at 4:44 AM, Vishal Chourasia <vishalc@linux.ibm.com> wrote:
> >>>>
> >>>> Bulk CPU hotplug operations—such as switching SMT modes across all
> >>>> cores—require hotplugging multiple CPUs in rapid succession. On large
> >>>> systems, this process takes significant time, increasing as the number
> >>>> of CPUs grows, leading to substantial delays on high-core-count
> >>>> machines. Analysis [1] reveals that the majority of this time is spent
> >>>> waiting for synchronize_rcu().
> >>>>
> >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the
> >>>> operation. Since CPU hotplug is a user-initiated administrative task,
> >>>> it should complete as quickly as possible.
> >>>
> >>> When does the user initiate this in your system?
Workloads exhibit varying sensitivity to SMT levels. Users dynamically
adjust SMT modes to optimize performance.
> >>>
> >>> Hotplug should not be happening that often to begin with, it is a slow path that
> >>> depends on the disruptive stop-machine mechanism.
Yes, it doesn't happen too often, but when it does, on machines with
(>= 1920 CPUs) it takes more than 20 mins to finish.
> >>>
> >>>>
> >>>> Performance data on a PPC64 system with 400 CPUs:
> >>>>
> >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1)
> >>>> Before: real 1m14.792s
> >>>> After: real 0m03.205s # ~23x improvement
> >>>>
> >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8)
> >>>> Before: real 2m27.695s
> >>>> After: real 0m02.510s # ~58x improvement
> >>>
> >>> This does look compelling but, Could you provide more information about how this was tested - what does the ppc binary do (how many hot plugs , how does the performance change with cycle count etc)?
The ppc64_cpu utility generates a list of target CPUs based on the
requested SMT state and writes to their corresponding sysfs online
entries.
Sorry, I didn't get your second question about the performance change
with cycle count.
> >>>
> >>> Can you also run rcutorture testing? Some of the scenarios like TREE03 stress hotplug.
Sure, I will get back with the numbers.
> >>
> >> Also, why not just use the expedite api at the callsite that is slow
> >> than blanket expediting everything between hotplug lock and unlock.
> >> That is more specific fix than this fix which applies more broadly to
> >> all operations. It appears the report you provided does provide the
> >> culprit callsite.
I initially attempted to replace synchronize_rcu() with
synchronize_rcu_expedited() at specific callsites. However, the primary
bottlenecks are within percpu_down_write(), called via _cpu_up() and
try_online_node(). Please refer to the callstack shared below. Since
percpu_down_write() is used throughout the kernel, modifying it directly
would force expedited grace periods on unrelated subsystems.
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
_cpu_up+140
cpu_up+440
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
try_online_node+64
cpu_up+120
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
> >
> > Because hotplug is not a fast path; there is no expectation of
> > performance here.
True.
>
> Agreed, I was just wondering if it was incredibly slow or something. Looking forward to more justification from Vishal on usecase,
>
> - Joel
>
>
> >
- vishalc
On 1/12/26 3:13 PM, Vishal Chourasia wrote:
> Bulk CPU hotplug operations—such as switching SMT modes across all
> cores—require hotplugging multiple CPUs in rapid succession. On large
> systems, this process takes significant time, increasing as the number
> of CPUs grows, leading to substantial delays on high-core-count
> machines. Analysis [1] reveals that the majority of this time is spent
> waiting for synchronize_rcu().
>
> Expedite synchronize_rcu() during the hotplug path to accelerate the
> operation. Since CPU hotplug is a user-initiated administrative task,
> it should complete as quickly as possible.
>
> Performance data on a PPC64 system with 400 CPUs:
>
> + ppc64_cpu --smt=1 (SMT8 to SMT1)
> Before: real 1m14.792s
> After: real 0m03.205s # ~23x improvement
>
> + ppc64_cpu --smt=8 (SMT1 to SMT8)
> Before: real 2m27.695s
> After: real 0m02.510s # ~58x improvement
>
> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
>
Hi Vishal,
I tried on tip/master at 315f416d3e26.
It fails to apply. is rcu tree updated?
> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
> include/linux/rcupdate.h | 3 +++
> kernel/cpu.c | 2 ++
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index c5b30054cd01..03c06cfb2b6d 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
> extern int rcu_expedited;
> extern int rcu_normal;
>
> +extern void rcu_expedite_gp(void);
> +extern void rcu_unexpedite_gp(void);
> +
Why extern is needed? All it needs is declarations no?
> DEFINE_LOCK_GUARD_0(rcu,
> do {
> rcu_read_lock();
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 8df2d773fe3b..6b0d491d73f4 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
>
> void cpus_write_lock(void)
> {
> + rcu_expedite_gp();
> percpu_down_write(&cpu_hotplug_lock);
> }
>
> void cpus_write_unlock(void)
> {
> percpu_up_write(&cpu_hotplug_lock);
> + rcu_unexpedite_gp();
> }
>
> void lockdep_assert_cpus_held(void)
Have you tested kexec path or suspend/resume path?
Seems like the counter can nest, but would be good to verify.
On 12/01/26 17:51, Shrikanth Hegde wrote:
>
>
> On 1/12/26 3:13 PM, Vishal Chourasia wrote:
>> Bulk CPU hotplug operations—such as switching SMT modes across all
>> cores—require hotplugging multiple CPUs in rapid succession. On large
>> systems, this process takes significant time, increasing as the number
>> of CPUs grows, leading to substantial delays on high-core-count
>> machines. Analysis [1] reveals that the majority of this time is spent
>> waiting for synchronize_rcu().
>>
>> Expedite synchronize_rcu() during the hotplug path to accelerate the
>> operation. Since CPU hotplug is a user-initiated administrative task,
>> it should complete as quickly as possible.
>>
>> Performance data on a PPC64 system with 400 CPUs:
>>
>> + ppc64_cpu --smt=1 (SMT8 to SMT1)
>> Before: real 1m14.792s
>> After: real 0m03.205s # ~23x improvement
>>
>> + ppc64_cpu --smt=8 (SMT1 to SMT8)
>> Before: real 2m27.695s
>> After: real 0m02.510s # ~58x improvement
>>
>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b
>>
>
> Hi Vishal,
>
> I tried on tip/master at 315f416d3e26.
> It fails to apply. is rcu tree updated?
I'm currently working off the GitHub mirror (|github.com/torvalds/linux)|
>
>
>> [1]
>> https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
>>
>> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> ---
>> include/linux/rcupdate.h | 3 +++
>> kernel/cpu.c | 2 ++
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index c5b30054cd01..03c06cfb2b6d 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -1192,6 +1192,9 @@ rcu_head_after_call_rcu(struct rcu_head *rhp,
>> rcu_callback_t f)
>> extern int rcu_expedited;
>> extern int rcu_normal;
>> +extern void rcu_expedite_gp(void);
>> +extern void rcu_unexpedite_gp(void);
>> +
>
> Why extern is needed? All it needs is declarations no?
Already declared in kernel/rcu/rcu.h
kernel/cpu.c already includes linux/rcupdate.h, therefore
added an extern.
>
>
>> DEFINE_LOCK_GUARD_0(rcu,
>> do {
>> rcu_read_lock();
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 8df2d773fe3b..6b0d491d73f4 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -506,12 +506,14 @@ EXPORT_SYMBOL_GPL(cpus_read_unlock);
>> void cpus_write_lock(void)
>> {
>> + rcu_expedite_gp();
>> percpu_down_write(&cpu_hotplug_lock);
>> }
>> void cpus_write_unlock(void)
>> {
>> percpu_up_write(&cpu_hotplug_lock);
>> + rcu_unexpedite_gp();
>> }
>> void lockdep_assert_cpus_held(void)
>
> Have you tested kexec path or suspend/resume path?
I did test kexec patch by booting into another kernel via kexec path.
But, I didn't test suspend/resume.
> Seems like the counter can nest, but would be good to verify.
On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > Bulk CPU hotplug operations—such as switching SMT modes across all > cores—require hotplugging multiple CPUs in rapid succession. On large > systems, this process takes significant time, increasing as the number > of CPUs grows, leading to substantial delays on high-core-count > machines. Analysis [1] reveals that the majority of this time is spent > waiting for synchronize_rcu(). > > Expedite synchronize_rcu() during the hotplug path to accelerate the > operation. Since CPU hotplug is a user-initiated administrative task, > it should complete as quickly as possible. > > Performance data on a PPC64 system with 400 CPUs: > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > Before: real 1m14.792s > After: real 0m03.205s # ~23x improvement > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > Before: real 2m27.695s > After: real 0m02.510s # ~58x improvement > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp to speedup regular synchronize_rcu() call. But i am not saying that it would beat your "expedited switch" improvement. -- Uladzislau Rezki
Hi Vishal. Thanks for the patch. On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >> Bulk CPU hotplug operations—such as switching SMT modes across all >> cores—require hotplugging multiple CPUs in rapid succession. On large >> systems, this process takes significant time, increasing as the number >> of CPUs grows, leading to substantial delays on high-core-count >> machines. Analysis [1] reveals that the majority of this time is spent >> waiting for synchronize_rcu(). >> >> Expedite synchronize_rcu() during the hotplug path to accelerate the >> operation. Since CPU hotplug is a user-initiated administrative task, >> it should complete as quickly as possible. >> >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> >> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > your "expedited switch" improvement. > Hi Uladzislau. Had a discussion on this at LPC, having in kernel solution is likely better than having it in userspace. - Having it in kernel would make it work across all archs. Why should any user wait when one initiates the hotplug. - userspace tools are spread across such as chcpu, ppc64_cpu etc. though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". We will have to repeat the same in each tool. - There is already /sys/kernel/rcu_expedited which is better if at all we need to fallback to userspace.
Hello, Shrikanth! > > On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > > Bulk CPU hotplug operations—such as switching SMT modes across all > > > cores—require hotplugging multiple CPUs in rapid succession. On large > > > systems, this process takes significant time, increasing as the number > > > of CPUs grows, leading to substantial delays on high-core-count > > > machines. Analysis [1] reveals that the majority of this time is spent > > > waiting for synchronize_rcu(). > > > > > > Expedite synchronize_rcu() during the hotplug path to accelerate the > > > operation. Since CPU hotplug is a user-initiated administrative task, > > > it should complete as quickly as possible. > > > > > > Performance data on a PPC64 system with 400 CPUs: > > > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > > Before: real 1m14.792s > > > After: real 0m03.205s # ~23x improvement > > > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > > Before: real 2m27.695s > > > After: real 0m02.510s # ~58x improvement > > > > > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > > > > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > > > > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > your "expedited switch" improvement. > > > > Hi Uladzislau. > > Had a discussion on this at LPC, having in kernel solution is likely > better than having it in userspace. > > - Having it in kernel would make it work across all archs. Why should > any user wait when one initiates the hotplug. > > - userspace tools are spread across such as chcpu, ppc64_cpu etc. > though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > We will have to repeat the same in each tool. > > - There is already /sys/kernel/rcu_expedited which is better if at all > we need to fallback to userspace. > Sounds good to me. I agree it is better to bypass parameters. -- Uladzislau Rezki
> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > Hello, Shrikanth! > >> >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>> systems, this process takes significant time, increasing as the number >>>> of CPUs grows, leading to substantial delays on high-core-count >>>> machines. Analysis [1] reveals that the majority of this time is spent >>>> waiting for synchronize_rcu(). >>>> >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>> it should complete as quickly as possible. >>>> >>>> Performance data on a PPC64 system with 400 CPUs: >>>> >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>> Before: real 1m14.792s >>>> After: real 0m03.205s # ~23x improvement >>>> >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>> Before: real 2m27.695s >>>> After: real 0m02.510s # ~58x improvement >>>> >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>> >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>> >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>> your "expedited switch" improvement. >>> >> >> Hi Uladzislau. >> >> Had a discussion on this at LPC, having in kernel solution is likely >> better than having it in userspace. >> >> - Having it in kernel would make it work across all archs. Why should >> any user wait when one initiates the hotplug. >> >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >> We will have to repeat the same in each tool. >> >> - There is already /sys/kernel/rcu_expedited which is better if at all >> we need to fallback to userspace. >> > Sounds good to me. I agree it is better to bypass parameters. Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. thanks, - Joel > > -- > Uladzislau Rezki
On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Shrikanth! > > > >> > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>> systems, this process takes significant time, increasing as the number > >>>> of CPUs grows, leading to substantial delays on high-core-count > >>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>> waiting for synchronize_rcu(). > >>>> > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>> it should complete as quickly as possible. > >>>> > >>>> Performance data on a PPC64 system with 400 CPUs: > >>>> > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>> Before: real 1m14.792s > >>>> After: real 0m03.205s # ~23x improvement > >>>> > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>> Before: real 2m27.695s > >>>> After: real 0m02.510s # ~58x improvement > >>>> > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>> > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>> > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>> your "expedited switch" improvement. > >>> > >> > >> Hi Uladzislau. > >> > >> Had a discussion on this at LPC, having in kernel solution is likely > >> better than having it in userspace. > >> > >> - Having it in kernel would make it work across all archs. Why should > >> any user wait when one initiates the hotplug. > >> > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >> We will have to repeat the same in each tool. > >> > >> - There is already /sys/kernel/rcu_expedited which is better if at all > >> we need to fallback to userspace. > >> > > Sounds good to me. I agree it is better to bypass parameters. > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > IMO, we can increase that threshold. 512/1024 is not a problem at all. But as Paul mentioned, we should consider scalability enhancement. From the other hand it is also probably worth to get into the state when we really see them :) -- Uladzislau Rezki
> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> Hello, Shrikanth! >>> >>>> >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>> systems, this process takes significant time, increasing as the number >>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>> waiting for synchronize_rcu(). >>>>>> >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>> it should complete as quickly as possible. >>>>>> >>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>> >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>> Before: real 1m14.792s >>>>>> After: real 0m03.205s # ~23x improvement >>>>>> >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>> Before: real 2m27.695s >>>>>> After: real 0m02.510s # ~58x improvement >>>>>> >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>> >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>> >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>> your "expedited switch" improvement. >>>>> >>>> >>>> Hi Uladzislau. >>>> >>>> Had a discussion on this at LPC, having in kernel solution is likely >>>> better than having it in userspace. >>>> >>>> - Having it in kernel would make it work across all archs. Why should >>>> any user wait when one initiates the hotplug. >>>> >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>> We will have to repeat the same in each tool. >>>> >>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>> we need to fallback to userspace. >>>> >>> Sounds good to me. I agree it is better to bypass parameters. >> >> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >> >> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >> > IMO, we can increase that threshold. 512/1024 is not a problem at all. > But as Paul mentioned, we should consider scalability enhancement. From > the other hand it is also probably worth to get into the state when we > really see them :) Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) Thanks. > > -- > Uladzislau Rezki
On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >> > >> > >>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> Hello, Shrikanth! > >>> > >>>> > >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>> systems, this process takes significant time, increasing as the number > >>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>> waiting for synchronize_rcu(). > >>>>>> > >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>> it should complete as quickly as possible. > >>>>>> > >>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>> > >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>> Before: real 1m14.792s > >>>>>> After: real 0m03.205s # ~23x improvement > >>>>>> > >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>> Before: real 2m27.695s > >>>>>> After: real 0m02.510s # ~58x improvement > >>>>>> > >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>> > >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>> > >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>> your "expedited switch" improvement. > >>>>> > >>>> > >>>> Hi Uladzislau. > >>>> > >>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>> better than having it in userspace. > >>>> > >>>> - Having it in kernel would make it work across all archs. Why should > >>>> any user wait when one initiates the hotplug. > >>>> > >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>> We will have to repeat the same in each tool. > >>>> > >>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>> we need to fallback to userspace. > >>>> > >>> Sounds good to me. I agree it is better to bypass parameters. > >> > >> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >> > >> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >> > > IMO, we can increase that threshold. 512/1024 is not a problem at all. > > But as Paul mentioned, we should consider scalability enhancement. From > > the other hand it is also probably worth to get into the state when we > > really see them :) > > Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > Honestly i do not see use cases when we are not up to speed to process all callbacks in time keeping in mind that it is blocking context call. How many of them should be in flight(blocked contexts) to make it starve... :) According to my last evaluation it was ~64K. Note i do not say that it should not be scaled. -- Uladzislau Rezki
> On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: >> >> >>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> Hello, Shrikanth! >>>>> >>>>>> >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>> waiting for synchronize_rcu(). >>>>>>>> >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>> it should complete as quickly as possible. >>>>>>>> >>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>> >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>> Before: real 1m14.792s >>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>> >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>> Before: real 2m27.695s >>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>> >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>> >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>> your "expedited switch" improvement. >>>>>>> >>>>>> >>>>>> Hi Uladzislau. >>>>>> >>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>> better than having it in userspace. >>>>>> >>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>> any user wait when one initiates the hotplug. >>>>>> >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>> We will have to repeat the same in each tool. >>>>>> >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>> we need to fallback to userspace. >>>>>> >>>>> Sounds good to me. I agree it is better to bypass parameters. >>>> >>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>> >>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>> >>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>> But as Paul mentioned, we should consider scalability enhancement. From >>> the other hand it is also probably worth to get into the state when we >>> really see them :) >> >> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >> > Honestly i do not see use cases when we are not up to speed to process > all callbacks in time keeping in mind that it is blocking context call. > > How many of them should be in flight(blocked contexts) to make it starve... :) > According to my last evaluation it was ~64K. > > Note i do not say that it should not be scaled. But you did not test that on large system with 1000s of CPUs right? So the options I see are: either default to always using the optimization, not just for less than 17 CPUs (what you are saying above). Or, do what I said above (safer for system with 1000s of CPUs and less risky). Let me know if I missed something. Thanks. > > -- > Uladzislau Rezki
On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: > > > > On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > >> > >> > >>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> Hello, Shrikanth! > >>>>> > >>>>>> > >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>> waiting for synchronize_rcu(). > >>>>>>>> > >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>> it should complete as quickly as possible. > >>>>>>>> > >>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>> Before: real 1m14.792s > >>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>> Before: real 2m27.695s > >>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>> > >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>> > >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>> > >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>> your "expedited switch" improvement. > >>>>>>> > >>>>>> > >>>>>> Hi Uladzislau. > >>>>>> > >>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>> better than having it in userspace. > >>>>>> > >>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>> any user wait when one initiates the hotplug. > >>>>>> > >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>> We will have to repeat the same in each tool. > >>>>>> > >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>> we need to fallback to userspace. > >>>>>> > >>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>> > >>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >>>> > >>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >>>> > >>> IMO, we can increase that threshold. 512/1024 is not a problem at all. > >>> But as Paul mentioned, we should consider scalability enhancement. From > >>> the other hand it is also probably worth to get into the state when we > >>> really see them :) > >> > >> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > >> > > Honestly i do not see use cases when we are not up to speed to process > > all callbacks in time keeping in mind that it is blocking context call. > > > > How many of them should be in flight(blocked contexts) to make it starve... :) > > According to my last evaluation it was ~64K. > > > > Note i do not say that it should not be scaled. > > But you did not test that on large system with 1000s of CPUs right? > No, no. I do not have access to such systems. > > So the options I see are: either default to always using the optimization, > not just for less than 17 CPUs (what you are saying above). Or, do what I said > above (safer for system with 1000s of CPUs and less risky). > You mean introduce threshold and count how many nodes are in queue? To me it sounds not optimal and looks like a temporary solution. Long term wise, it is better to split it, i mean to scale. Do you know who can test it on ~1000 CPUs system? So we have some figures. What i have is 256 CPUs system i can test on. -- Uladzislau Rezki
On 1/13/2026 9:17 AM, Uladzislau Rezki wrote: > On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: >> >> >>> On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>>>> >>>>>> >>>>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>>>> >>>>>>> Hello, Shrikanth! >>>>>>> >>>>>>>> >>>>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>>>> waiting for synchronize_rcu(). >>>>>>>>>> >>>>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>>>> it should complete as quickly as possible. >>>>>>>>>> >>>>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>>>> >>>>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>>>> Before: real 1m14.792s >>>>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>>>> >>>>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>>>> Before: real 2m27.695s >>>>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>>>> >>>>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>>>> >>>>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>>>> >>>>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>>>> your "expedited switch" improvement. >>>>>>>>> >>>>>>>> >>>>>>>> Hi Uladzislau. >>>>>>>> >>>>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>>>> better than having it in userspace. >>>>>>>> >>>>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>>>> any user wait when one initiates the hotplug. >>>>>>>> >>>>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>>>> We will have to repeat the same in each tool. >>>>>>>> >>>>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>>>> we need to fallback to userspace. >>>>>>>> >>>>>>> Sounds good to me. I agree it is better to bypass parameters. >>>>>> >>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>>>> >>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>>>> >>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>>>> But as Paul mentioned, we should consider scalability enhancement. From >>>>> the other hand it is also probably worth to get into the state when we >>>>> really see them :) >>>> >>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >>>> >>> Honestly i do not see use cases when we are not up to speed to process >>> all callbacks in time keeping in mind that it is blocking context call. >>> >>> How many of them should be in flight(blocked contexts) to make it starve... :) >>> According to my last evaluation it was ~64K. >>> >>> Note i do not say that it should not be scaled. >> >> But you did not test that on large system with 1000s of CPUs right? >> > No, no. I do not have access to such systems. > >> >> So the options I see are: either default to always using the optimization, >> not just for less than 17 CPUs (what you are saying above). Or, do what I said >> above (safer for system with 1000s of CPUs and less risky). >> > You mean introduce threshold and count how many nodes are in queue? Yes. > To me it sounds not optimal and looks like a temporary solution. Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > > Long term wise, it is better to split it, i mean to scale. But the scalable solution is already there: the !synchronize_rcu_normal path, right? And splitting the list won't help this use case anyway. > > Do you know who can test it on ~1000 CPUs system? So we have some figures. I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the folks on this thread have such systems as they mentioned 1900+ CPU systems. They should be happy to test. > > What i have is 256 CPUs system i can test on. Same boat. ;-) thanks, - Joel
On Tue, Jan 13, 2026 at 09:32:13AM -0500, Joel Fernandes wrote: > > > On 1/13/2026 9:17 AM, Uladzislau Rezki wrote: > > On Tue, Jan 13, 2026 at 12:44:10PM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 13, 2026, at 7:19 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> On Mon, Jan 12, 2026 at 05:36:24PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>>> On Jan 12, 2026, at 12:09 PM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>>>> > >>>>>> > >>>>>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>>>> > >>>>>>> Hello, Shrikanth! > >>>>>>> > >>>>>>>> > >>>>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>>>> waiting for synchronize_rcu(). > >>>>>>>>>> > >>>>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>>>> it should complete as quickly as possible. > >>>>>>>>>> > >>>>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>>>> > >>>>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>>>> Before: real 1m14.792s > >>>>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>>>> > >>>>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>>>> Before: real 2m27.695s > >>>>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>>>> > >>>>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>>>> > >>>>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>>>> > >>>>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>>>> your "expedited switch" improvement. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Uladzislau. > >>>>>>>> > >>>>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>>>> better than having it in userspace. > >>>>>>>> > >>>>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>>>> any user wait when one initiates the hotplug. > >>>>>>>> > >>>>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>>>> We will have to repeat the same in each tool. > >>>>>>>> > >>>>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>>>> we need to fallback to userspace. > >>>>>>>> > >>>>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>>>> > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > >>>>>> > >>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > >>>>>> > >>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. > >>>>> But as Paul mentioned, we should consider scalability enhancement. From > >>>>> the other hand it is also probably worth to get into the state when we > >>>>> really see them :) > >>>> > >>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > >>>> > >>> Honestly i do not see use cases when we are not up to speed to process > >>> all callbacks in time keeping in mind that it is blocking context call. > >>> > >>> How many of them should be in flight(blocked contexts) to make it starve... :) > >>> According to my last evaluation it was ~64K. > >>> > >>> Note i do not say that it should not be scaled. > >> > >> But you did not test that on large system with 1000s of CPUs right? > >> > > No, no. I do not have access to such systems. > > > >> > >> So the options I see are: either default to always using the optimization, > >> not just for less than 17 CPUs (what you are saying above). Or, do what I said > >> above (safer for system with 1000s of CPUs and less risky). > >> > > You mean introduce threshold and count how many nodes are in queue? > > Yes. > > > To me it sounds not optimal and looks like a temporary solution. > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > It was trial testing :) Agree we should do something with it. > > > > Long term wise, it is better to split it, i mean to scale. > > But the scalable solution is already there: the !synchronize_rcu_normal path, > right? And splitting the list won't help this use case anyway. > Fair point. > > > > Do you know who can test it on ~1000 CPUs system? So we have some figures. > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > should be happy to test. > > > > > What i have is 256 CPUs system i can test on. > Same boat. ;-) > :) -- Uladzislau Rezki
Hi. On 1/13/26 8:02 PM, Joel Fernandes wrote: > > >>>>>>> Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. >>>>>>> >>>>>>> I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. >>>>>>> >>>>>> IMO, we can increase that threshold. 512/1024 is not a problem at all. >>>>>> But as Paul mentioned, we should consider scalability enhancement. From >>>>>> the other hand it is also probably worth to get into the state when we >>>>>> really see them :) >>>>> >>>>> Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) >>>>> >>>> Honestly i do not see use cases when we are not up to speed to process >>>> all callbacks in time keeping in mind that it is blocking context call. >>>> >>>> How many of them should be in flight(blocked contexts) to make it starve... :) >>>> According to my last evaluation it was ~64K. >>>> >>>> Note i do not say that it should not be scaled. >>> >>> But you did not test that on large system with 1000s of CPUs right? >>> >> No, no. I do not have access to such systems. >> >>> >>> So the options I see are: either default to always using the optimization, >>> not just for less than 17 CPUs (what you are saying above). Or, do what I said >>> above (safer for system with 1000s of CPUs and less risky). >>> >> You mean introduce threshold and count how many nodes are in queue? > > Yes. > >> To me it sounds not optimal and looks like a temporary solution. > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > >> >> Long term wise, it is better to split it, i mean to scale. > > But the scalable solution is already there: the !synchronize_rcu_normal path, > right? And splitting the list won't help this use case anyway. > >> >> Do you know who can test it on ~1000 CPUs system? So we have some figures. > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > should be happy to test. > Do you have a patch to try out? We can test it on these systems. Note: Might take a while to test it, as those systems are bit tricky to get. >> >> What i have is 256 CPUs system i can test on. > Same boat. ;-) > > thanks, > > - Joel >
On Tue, Jan 13, 2026 at 08:23:29PM +0530, Shrikanth Hegde wrote: > Hi. > > On 1/13/26 8:02 PM, Joel Fernandes wrote: > > > > > > > > > > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > > > > > > > > > > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > > > > > > > > > > > > > > IMO, we can increase that threshold. 512/1024 is not a problem at all. > > > > > > > But as Paul mentioned, we should consider scalability enhancement. From > > > > > > > the other hand it is also probably worth to get into the state when we > > > > > > > really see them :) > > > > > > > > > > > > Instead of pegging to number of CPUs, perhaps the optimization should be dynamic? That is, default to it unless synchronize_rcu load is high, default to the sr_normal wake-up optimization. Of course carefully considering all corner cases, adequate testing and all that ;-) > > > > > > > > > > > Honestly i do not see use cases when we are not up to speed to process > > > > > all callbacks in time keeping in mind that it is blocking context call. > > > > > > > > > > How many of them should be in flight(blocked contexts) to make it starve... :) > > > > > According to my last evaluation it was ~64K. > > > > > > > > > > Note i do not say that it should not be scaled. > > > > > > > > But you did not test that on large system with 1000s of CPUs right? > > > > > > > No, no. I do not have access to such systems. > > > > > > > > > > > So the options I see are: either default to always using the optimization, > > > > not just for less than 17 CPUs (what you are saying above). Or, do what I said > > > > above (safer for system with 1000s of CPUs and less risky). > > > > > > > You mean introduce threshold and count how many nodes are in queue? > > > > Yes. > > > > > To me it sounds not optimal and looks like a temporary solution. > > > > Not more sub-optimal than the existing 16 CPU hard-coded solution I suppose. > > > > > > > > Long term wise, it is better to split it, i mean to scale. > > > > But the scalable solution is already there: the !synchronize_rcu_normal path, > > right? And splitting the list won't help this use case anyway. > > > > > > > > Do you know who can test it on ~1000 CPUs system? So we have some figures. > > > > I don't have such systems either. The most I can go is ~200+ CPUs. Perhaps the > > folks on this thread have such systems as they mentioned 1900+ CPU systems. They > > should be happy to test. > > > > Do you have a patch to try out? We can test it on these systems. > > > Note: Might take a while to test it, as those systems are bit tricky to > get. > Let me prepare something. I will come back. -- Uladzislau Rezki
On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Shrikanth! > > > >> > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>> systems, this process takes significant time, increasing as the number > >>>> of CPUs grows, leading to substantial delays on high-core-count > >>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>> waiting for synchronize_rcu(). > >>>> > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>> it should complete as quickly as possible. > >>>> > >>>> Performance data on a PPC64 system with 400 CPUs: > >>>> > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>> Before: real 1m14.792s > >>>> After: real 0m03.205s # ~23x improvement > >>>> > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>> Before: real 2m27.695s > >>>> After: real 0m02.510s # ~58x improvement > >>>> > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>> > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>> > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>> your "expedited switch" improvement. > >>> > >> > >> Hi Uladzislau. > >> > >> Had a discussion on this at LPC, having in kernel solution is likely > >> better than having it in userspace. > >> > >> - Having it in kernel would make it work across all archs. Why should > >> any user wait when one initiates the hotplug. > >> > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >> We will have to repeat the same in each tool. > >> > >> - There is already /sys/kernel/rcu_expedited which is better if at all > >> we need to fallback to userspace. > >> > > Sounds good to me. I agree it is better to bypass parameters. > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. This would require increasing the scalability of this optimization, right? Or am I thinking of the wrong optimization? ;-) Thanx, Paul
On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >> >> >>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>> >>> Hello, Shrikanth! >>> >>>> >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>> systems, this process takes significant time, increasing as the number >>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>> waiting for synchronize_rcu(). >>>>>> >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>> it should complete as quickly as possible. >>>>>> >>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>> >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>> Before: real 1m14.792s >>>>>> After: real 0m03.205s # ~23x improvement >>>>>> >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>> Before: real 2m27.695s >>>>>> After: real 0m02.510s # ~58x improvement >>>>>> >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>> >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>> >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>> your "expedited switch" improvement. >>>>> >>>> >>>> Hi Uladzislau. >>>> >>>> Had a discussion on this at LPC, having in kernel solution is likely >>>> better than having it in userspace. >>>> >>>> - Having it in kernel would make it work across all archs. Why should >>>> any user wait when one initiates the hotplug. >>>> >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>> We will have to repeat the same in each tool. >>>> >>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>> we need to fallback to userspace. >>>> >>> Sounds good to me. I agree it is better to bypass parameters. >> >> Another way to make it in-kernel would be to make the RCU normal wake >> from GP optimization enabled for > 16 CPUs by default.>> >> I was considering this, but I did not bring it up because I did not >> know that there are large systems that might benefit from it until now.> > This would require increasing the scalability of this optimization, > right? Or am I thinking of the wrong optimization? ;-) > Yes I think you are considering the correct one, the concern you have is regarding large number of wake ups initiated from the GP thread, correct? I was suggesting on the thread, a more dynamic approach where using synchronize_rcu_normal() until it gets overloaded with requests. One approach might be to measure the length of the rcu_state.srs_next to detect an overload condition, similar to qhimark? Or perhaps qhimark itself can be used. And under lightly loaded conditions, default to synchronize_rcu_normal() without checking for the 16 CPU count. Thoughts? thanks, - Joel
On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: > > > On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>> > >>> Hello, Shrikanth! > >>> > >>>> > >>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>> systems, this process takes significant time, increasing as the number > >>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>> waiting for synchronize_rcu(). > >>>>>> > >>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>> it should complete as quickly as possible. > >>>>>> > >>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>> > >>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>> Before: real 1m14.792s > >>>>>> After: real 0m03.205s # ~23x improvement > >>>>>> > >>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>> Before: real 2m27.695s > >>>>>> After: real 0m02.510s # ~58x improvement > >>>>>> > >>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>> > >>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>> > >>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>> your "expedited switch" improvement. > >>>>> > >>>> > >>>> Hi Uladzislau. > >>>> > >>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>> better than having it in userspace. > >>>> > >>>> - Having it in kernel would make it work across all archs. Why should > >>>> any user wait when one initiates the hotplug. > >>>> > >>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>> We will have to repeat the same in each tool. > >>>> > >>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>> we need to fallback to userspace. > >>>> > >>> Sounds good to me. I agree it is better to bypass parameters. > >> > >> Another way to make it in-kernel would be to make the RCU normal wake > >> from GP optimization enabled for > 16 CPUs by default.>> > >> I was considering this, but I did not bring it up because I did not > >> know that there are large systems that might benefit from it until now.> > > This would require increasing the scalability of this optimization, > > right? Or am I thinking of the wrong optimization? ;-) > > > Yes I think you are considering the correct one, the concern you have is > regarding large number of wake ups initiated from the GP thread, correct? > > I was suggesting on the thread, a more dynamic approach where using > synchronize_rcu_normal() until it gets overloaded with requests. One approach > might be to measure the length of the rcu_state.srs_next to detect an overload > condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > lightly loaded conditions, default to synchronize_rcu_normal() without checking > for the 16 CPU count. > > Thoughts? Or maintain multiple lists. Systems with 1000+ CPUs can be a bit unforgiving of pretty much any form of contention. Thanx, Paul
> On Jan 12, 2026, at 7:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: >> >> >>> On 1/12/2026 11:48 AM, Paul E. McKenney wrote: >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: >>>> >>>> >>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: >>>>> >>>>> Hello, Shrikanth! >>>>> >>>>>> >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large >>>>>>>> systems, this process takes significant time, increasing as the number >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent >>>>>>>> waiting for synchronize_rcu(). >>>>>>>> >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, >>>>>>>> it should complete as quickly as possible. >>>>>>>> >>>>>>>> Performance data on a PPC64 system with 400 CPUs: >>>>>>>> >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) >>>>>>>> Before: real 1m14.792s >>>>>>>> After: real 0m03.205s # ~23x improvement >>>>>>>> >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) >>>>>>>> Before: real 2m27.695s >>>>>>>> After: real 0m02.510s # ~58x improvement >>>>>>>> >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >>>>>>>> >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat >>>>>>> your "expedited switch" improvement. >>>>>>> >>>>>> >>>>>> Hi Uladzislau. >>>>>> >>>>>> Had a discussion on this at LPC, having in kernel solution is likely >>>>>> better than having it in userspace. >>>>>> >>>>>> - Having it in kernel would make it work across all archs. Why should >>>>>> any user wait when one initiates the hotplug. >>>>>> >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". >>>>>> We will have to repeat the same in each tool. >>>>>> >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all >>>>>> we need to fallback to userspace. >>>>>> >>>>> Sounds good to me. I agree it is better to bypass parameters. >>>> >>>> Another way to make it in-kernel would be to make the RCU normal wake >>>> from GP optimization enabled for > 16 CPUs by default.>> >>>> I was considering this, but I did not bring it up because I did not >>>> know that there are large systems that might benefit from it until now.> >>> This would require increasing the scalability of this optimization, >>> right? Or am I thinking of the wrong optimization? ;-) >>> >> Yes I think you are considering the correct one, the concern you have is >> regarding large number of wake ups initiated from the GP thread, correct? >> >> I was suggesting on the thread, a more dynamic approach where using >> synchronize_rcu_normal() until it gets overloaded with requests. One approach >> might be to measure the length of the rcu_state.srs_next to detect an overload >> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >> lightly loaded conditions, default to synchronize_rcu_normal() without checking >> for the 16 CPU count. >> >> Thoughts? > > Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > unforgiving of pretty much any form of contention. Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. That would address the conveyor belt pattern Vishal expressed. thanks, - Joel > > Thanx, Paul
On Tue, Jan 13, 2026 at 02:46:56AM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 7:01 PM, Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Mon, Jan 12, 2026 at 05:24:40PM -0500, Joel Fernandes wrote: > >> > >> > >>> On 1/12/2026 11:48 AM, Paul E. McKenney wrote: > >>> On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > >>>> > >>>> > >>>>> On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > >>>>> > >>>>> Hello, Shrikanth! > >>>>> > >>>>>> > >>>>>>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > >>>>>>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > >>>>>>>> Bulk CPU hotplug operations—such as switching SMT modes across all > >>>>>>>> cores—require hotplugging multiple CPUs in rapid succession. On large > >>>>>>>> systems, this process takes significant time, increasing as the number > >>>>>>>> of CPUs grows, leading to substantial delays on high-core-count > >>>>>>>> machines. Analysis [1] reveals that the majority of this time is spent > >>>>>>>> waiting for synchronize_rcu(). > >>>>>>>> > >>>>>>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > >>>>>>>> operation. Since CPU hotplug is a user-initiated administrative task, > >>>>>>>> it should complete as quickly as possible. > >>>>>>>> > >>>>>>>> Performance data on a PPC64 system with 400 CPUs: > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > >>>>>>>> Before: real 1m14.792s > >>>>>>>> After: real 0m03.205s # ~23x improvement > >>>>>>>> > >>>>>>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > >>>>>>>> Before: real 2m27.695s > >>>>>>>> After: real 0m02.510s # ~58x improvement > >>>>>>>> > >>>>>>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > >>>>>>>> > >>>>>>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > >>>>>>>> > >>>>>>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > >>>>>>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > >>>>>>> your "expedited switch" improvement. > >>>>>>> > >>>>>> > >>>>>> Hi Uladzislau. > >>>>>> > >>>>>> Had a discussion on this at LPC, having in kernel solution is likely > >>>>>> better than having it in userspace. > >>>>>> > >>>>>> - Having it in kernel would make it work across all archs. Why should > >>>>>> any user wait when one initiates the hotplug. > >>>>>> > >>>>>> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > >>>>>> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > >>>>>> We will have to repeat the same in each tool. > >>>>>> > >>>>>> - There is already /sys/kernel/rcu_expedited which is better if at all > >>>>>> we need to fallback to userspace. > >>>>>> > >>>>> Sounds good to me. I agree it is better to bypass parameters. > >>>> > >>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>> I was considering this, but I did not bring it up because I did not > >>>> know that there are large systems that might benefit from it until now.> > >>> This would require increasing the scalability of this optimization, > >>> right? Or am I thinking of the wrong optimization? ;-) > >>> > >> Yes I think you are considering the correct one, the concern you have is > >> regarding large number of wake ups initiated from the GP thread, correct? > >> > >> I was suggesting on the thread, a more dynamic approach where using > >> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >> might be to measure the length of the rcu_state.srs_next to detect an overload > >> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >> for the 16 CPU count. > >> > >> Thoughts? > > > > Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > > unforgiving of pretty much any form of contention. > > Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > > That would address the conveyor belt pattern Vishal expressed. On a system with more than 1,000 CPUs, any single list needs to be handled extremely carefully to avoid contention of one sort or another. At that many CPUs, the default rule is instead "never have just one of anything". Unless that "just one" is protected by some contention-avoidance scheme, for example, like the rcu_node tree protects the root rcu_node structure and the rcu_state structure from contention. Thanx, Paul
Hi. On 1/13/26 8:16 AM, Joel Fernandes wrote: > > >>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>> I was considering this, but I did not bring it up because I did not >>>>> know that there are large systems that might benefit from it until now.> >>>> This would require increasing the scalability of this optimization, >>>> right? Or am I thinking of the wrong optimization? ;-) >>>> >>> Yes I think you are considering the correct one, the concern you have is >>> regarding large number of wake ups initiated from the GP thread, correct? >>> >>> I was suggesting on the thread, a more dynamic approach where using >>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>> might be to measure the length of the rcu_state.srs_next to detect an overload >>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>> for the 16 CPU count. >>> >>> Thoughts? >> >> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >> unforgiving of pretty much any form of contention. > > Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > > That would address the conveyor belt pattern Vishal expressed. > > thanks, > > - Joel > Wouldn't that make most of the sync_rcu calls on large system with synchronize_rcu_normal off? Whats the cost of doing this? (Me not knowing much about rcu internals)
> On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > > Hi. > > On 1/13/26 8:16 AM, Joel Fernandes wrote: > > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>>> I was considering this, but I did not bring it up because I did not >>>>>> know that there are large systems that might benefit from it until now.> >>>>> This would require increasing the scalability of this optimization, >>>>> right? Or am I thinking of the wrong optimization? ;-) >>>>> >>>> Yes I think you are considering the correct one, the concern you have is >>>> regarding large number of wake ups initiated from the GP thread, correct? >>>> >>>> I was suggesting on the thread, a more dynamic approach where using >>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>>> might be to measure the length of the rcu_state.srs_next to detect an overload >>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>>> for the 16 CPU count. >>>> >>>> Thoughts? >>> >>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >>> unforgiving of pretty much any form of contention. >> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. >> That would address the conveyor belt pattern Vishal expressed. >> thanks, >> - Joel > > Wouldn't that make most of the sync_rcu calls on large system > with synchronize_rcu_normal off? It would and that is expected. > > Whats the cost of doing this? There is no cost, that is the point right. The scalability issue Paul is referring to is the large number of wake ups. You wont have that if the number of synchronous callers is small. - Joel > > (Me not knowing much about rcu internals)
On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: > > > > On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > > > > Hi. > > > > On 1/13/26 8:16 AM, Joel Fernandes wrote: > > > > > >>>>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>>>> I was considering this, but I did not bring it up because I did not > >>>>>> know that there are large systems that might benefit from it until now.> > >>>>> This would require increasing the scalability of this optimization, > >>>>> right? Or am I thinking of the wrong optimization? ;-) > >>>>> > >>>> Yes I think you are considering the correct one, the concern you have is > >>>> regarding large number of wake ups initiated from the GP thread, correct? > >>>> > >>>> I was suggesting on the thread, a more dynamic approach where using > >>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >>>> might be to measure the length of the rcu_state.srs_next to detect an overload > >>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >>>> for the 16 CPU count. > >>>> > >>>> Thoughts? > >>> > >>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > >>> unforgiving of pretty much any form of contention. > >> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > >> That would address the conveyor belt pattern Vishal expressed. > >> thanks, > >> - Joel > > > > Wouldn't that make most of the sync_rcu calls on large system > > with synchronize_rcu_normal off? > > It would and that is expected. > > > > > Whats the cost of doing this? > > There is no cost, that is the point right. The scalability issue Paul is referring to is the > large number of wake ups. You wont have that if the number of synchronous callers is small. Also the contention involved in the list management, if there is still only the one list. Thanx, Paul
On 1/13/2026 11:00 PM, Paul E. McKenney wrote: > On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: >> >> >>> On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: >>> >>> Hi. >>> >>> On 1/13/26 8:16 AM, Joel Fernandes wrote: >>> >>> >>>>>>>> Another way to make it in-kernel would be to make the RCU normal wake >>>>>>>> from GP optimization enabled for > 16 CPUs by default.>> >>>>>>>> I was considering this, but I did not bring it up because I did not >>>>>>>> know that there are large systems that might benefit from it until now.> >>>>>>> This would require increasing the scalability of this optimization, >>>>>>> right? Or am I thinking of the wrong optimization? ;-) >>>>>>> >>>>>> Yes I think you are considering the correct one, the concern you have is >>>>>> regarding large number of wake ups initiated from the GP thread, correct? >>>>>> >>>>>> I was suggesting on the thread, a more dynamic approach where using >>>>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach >>>>>> might be to measure the length of the rcu_state.srs_next to detect an overload >>>>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under >>>>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking >>>>>> for the 16 CPU count. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit >>>>> unforgiving of pretty much any form of contention. >>>> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. >>>> That would address the conveyor belt pattern Vishal expressed. >>>> thanks, >>>> - Joel >>> >>> Wouldn't that make most of the sync_rcu calls on large system >>> with synchronize_rcu_normal off? >> >> It would and that is expected. >> >>> >>> Whats the cost of doing this? >> >> There is no cost, that is the point right. The scalability issue Paul is referring to is the >> large number of wake ups. You wont have that if the number of synchronous callers is small. > > Also the contention involved in the list management, if there is still > only the one list. > Even if the number of synchronize_rcu() in flight is a small number? like < 10. To clarify, I meant keeping the threshold that small in favor of the list contention issue you're raising. Thanks! - Joel
On Wed, Jan 14, 2026 at 03:54:44AM -0500, Joel Fernandes wrote: > > > On 1/13/2026 11:00 PM, Paul E. McKenney wrote: > > On Tue, Jan 13, 2026 at 08:57:20AM +0000, Joel Fernandes wrote: > >> > >> > >>> On Jan 12, 2026, at 11:55 PM, Shrikanth Hegde <sshegde@linux.ibm.com> wrote: > >>> > >>> Hi. > >>> > >>> On 1/13/26 8:16 AM, Joel Fernandes wrote: > >>> > >>> > >>>>>>>> Another way to make it in-kernel would be to make the RCU normal wake > >>>>>>>> from GP optimization enabled for > 16 CPUs by default.>> > >>>>>>>> I was considering this, but I did not bring it up because I did not > >>>>>>>> know that there are large systems that might benefit from it until now.> > >>>>>>> This would require increasing the scalability of this optimization, > >>>>>>> right? Or am I thinking of the wrong optimization? ;-) > >>>>>>> > >>>>>> Yes I think you are considering the correct one, the concern you have is > >>>>>> regarding large number of wake ups initiated from the GP thread, correct? > >>>>>> > >>>>>> I was suggesting on the thread, a more dynamic approach where using > >>>>>> synchronize_rcu_normal() until it gets overloaded with requests. One approach > >>>>>> might be to measure the length of the rcu_state.srs_next to detect an overload > >>>>>> condition, similar to qhimark? Or perhaps qhimark itself can be used. And under > >>>>>> lightly loaded conditions, default to synchronize_rcu_normal() without checking > >>>>>> for the 16 CPU count. > >>>>>> > >>>>>> Thoughts? > >>>>> > >>>>> Or maintain multiple lists. Systems with 1000+ CPUs can be a bit > >>>>> unforgiving of pretty much any form of contention. > >>>> Makes sense. We could also just have a single list but a much smaller threshold for switching synchronize_rcu_normal off. > >>>> That would address the conveyor belt pattern Vishal expressed. > >>>> thanks, > >>>> - Joel > >>> > >>> Wouldn't that make most of the sync_rcu calls on large system > >>> with synchronize_rcu_normal off? > >> > >> It would and that is expected. > >> > >>> > >>> Whats the cost of doing this? > >> > >> There is no cost, that is the point right. The scalability issue Paul is referring to is the > >> large number of wake ups. You wont have that if the number of synchronous callers is small. > > > > Also the contention involved in the list management, if there is still > > only the one list. > > > Even if the number of synchronize_rcu() in flight is a small number? like < 10. > To clarify, I meant keeping the threshold that small in favor of the list > contention issue you're raising. Maybe? One remaining concern is the reads of the counter. Another remaining concern is the possibility of large numbers of CPUs reading the counter, seeing that it is small, then all piling on the list. But it looks like you might get some testing on a large system, so nothing quite like finding out. If it breaks, you guys get to fix it on whatever schedule is indicated at that time. ;-) Thanx, Paul
On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote: > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > Hello, Shrikanth! > > > > > >> > > >>> On 1/12/26 3:38 PM, Uladzislau Rezki wrote: > > >>> On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > >>>> Bulk CPU hotplug operations—such as switching SMT modes across all > > >>>> cores—require hotplugging multiple CPUs in rapid succession. On large > > >>>> systems, this process takes significant time, increasing as the number > > >>>> of CPUs grows, leading to substantial delays on high-core-count > > >>>> machines. Analysis [1] reveals that the majority of this time is spent > > >>>> waiting for synchronize_rcu(). > > >>>> > > >>>> Expedite synchronize_rcu() during the hotplug path to accelerate the > > >>>> operation. Since CPU hotplug is a user-initiated administrative task, > > >>>> it should complete as quickly as possible. > > >>>> > > >>>> Performance data on a PPC64 system with 400 CPUs: > > >>>> > > >>>> + ppc64_cpu --smt=1 (SMT8 to SMT1) > > >>>> Before: real 1m14.792s > > >>>> After: real 0m03.205s # ~23x improvement > > >>>> > > >>>> + ppc64_cpu --smt=8 (SMT1 to SMT8) > > >>>> Before: real 2m27.695s > > >>>> After: real 0m02.510s # ~58x improvement > > >>>> > > >>>> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > >>>> > > >>>> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > >>>> > > >>> Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > >>> to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > >>> your "expedited switch" improvement. > > >>> > > >> > > >> Hi Uladzislau. > > >> > > >> Had a discussion on this at LPC, having in kernel solution is likely > > >> better than having it in userspace. > > >> > > >> - Having it in kernel would make it work across all archs. Why should > > >> any user wait when one initiates the hotplug. > > >> > > >> - userspace tools are spread across such as chcpu, ppc64_cpu etc. > > >> though internally most do "0/1 > /sys/devices/system/cpu/cpuN/online". > > >> We will have to repeat the same in each tool. > > >> > > >> - There is already /sys/kernel/rcu_expedited which is better if at all > > >> we need to fallback to userspace. > > >> > > > Sounds good to me. I agree it is better to bypass parameters. > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > This would require increasing the scalability of this optimization, > right? Or am I thinking of the wrong optimization? ;-) > I tested this before. I noticed that after 64K of simultaneous synchronize_rcu() calls the scalability is required. Everything less was faster with a new approach. I can retest. Should i? :) -- Uladzsislau Rezki
Hello Joel, Paul, Uladzislau,
On Mon, Jan 12, 2026 at 06:05:30PM +0100, Uladzislau Rezki wrote:
> On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote:
> > >
> > >
> > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > >>
> > > > Sounds good to me. I agree it is better to bypass parameters.
> > >
> > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default.
> > >
> > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now.
> >
> > This would require increasing the scalability of this optimization,
> > right? Or am I thinking of the wrong optimization? ;-)
> >
> I tested this before. I noticed that after 64K of simultaneous
> synchronize_rcu() calls the scalability is required. Everything
> less was faster with a new approach.
It is worth noting that bulk CPU hotplug represents a different stress
pattern than the "simultaneous call" scenario mentioned above.
In a large-scale hotplug event (like a SMT mode switch), we aren't
necessarily seeing thousands of simultaneous synchronize_rcu() calls.
Instead, because CPU hotplug operations are serialized, we see a
"conveyor belt" of sequential calls. One synchronize_rcu() blocks, the
hotplug state machine waits, it unblocks, and then the next call is
triggered shortly after.
The bottleneck here isn't RCU scalability under concurrent load, but
rather the accumulated latency of hundreds of sequential Grace Periods.
For example, on pSeries, onlining 350 out of 400 CPUs triggers exactly
350 calls at three different points in the hotplug state machine. Even
though they happen one at a time, the sheer volume makes the total
operation time prohibitive.
Following callstack was collected during SMT mode switch where 350 out
of 400 CPUs were onlined,
@[
synchronize_rcu+12
cpuidle_pause_and_lock+120
pseries_cpuidle_cpu_online+88
cpuhp_invoke_callback+500
cpuhp_thread_fun+316
smpboot_thread_fn+512
kthread+308
start_kernel_thread+20
]: 350
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
_cpu_up+140
cpu_up+440
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
try_online_node+64
cpu_up+120
cpu_subsys_online+128
device_online+176
online_store+220
dev_attr_store+52
sysfs_kf_write+120
kernfs_fop_write_iter+456
vfs_write+952
ksys_write+132
system_call_exception+292
system_call_vectored_common+348
]: 350
Following callstack was collected during SMT mode switch where 350 out
of 400 CPUs where offlined,
@[
synchronize_rcu+12
rcu_sync_enter+260
percpu_down_write+76
_cpu_down+188
__cpu_down_maps_locked+44
work_for_cpu_fn+56
process_one_work+508
worker_thread+840
kthread+308
start_kernel_thread+20
]: 1
@[
synchronize_rcu+12
sched_cpu_deactivate+244
cpuhp_invoke_callback+500
cpuhp_thread_fun+316
smpboot_thread_fn+512
kthread+308
start_kernel_thread+20
]: 350
@[
synchronize_rcu+12
cpuidle_pause_and_lock+120
pseries_cpuidle_cpu_dead+88
cpuhp_invoke_callback+500
__cpuhp_invoke_callback_range+200
_cpu_down+412
__cpu_down_maps_locked+44
work_for_cpu_fn+56
process_one_work+508
worker_thread+840
kthread+308
start_kernel_thread+20
]: 350
- vishalc
On Mon, Jan 12, 2026 at 11:57:41PM +0530, Vishal Chourasia wrote: > Hello Joel, Paul, Uladzislau, > > On Mon, Jan 12, 2026 at 06:05:30PM +0100, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 08:48:42AM -0800, Paul E. McKenney wrote: > > > On Mon, Jan 12, 2026 at 04:09:49PM +0000, Joel Fernandes wrote: > > > > > > > > > > > > > On Jan 12, 2026, at 7:57 AM, Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > >> > > > > > Sounds good to me. I agree it is better to bypass parameters. > > > > > > > > Another way to make it in-kernel would be to make the RCU normal wake from GP optimization enabled for > 16 CPUs by default. > > > > > > > > I was considering this, but I did not bring it up because I did not know that there are large systems that might benefit from it until now. > > > > > > This would require increasing the scalability of this optimization, > > > right? Or am I thinking of the wrong optimization? ;-) > > > > > I tested this before. I noticed that after 64K of simultaneous > > synchronize_rcu() calls the scalability is required. Everything > > less was faster with a new approach. > > It is worth noting that bulk CPU hotplug represents a different stress > pattern than the "simultaneous call" scenario mentioned above. > > In a large-scale hotplug event (like a SMT mode switch), we aren't > necessarily seeing thousands of simultaneous synchronize_rcu() calls. > Instead, because CPU hotplug operations are serialized, we see a > "conveyor belt" of sequential calls. One synchronize_rcu() blocks, the > hotplug state machine waits, it unblocks, and then the next call is > triggered shortly after. > > The bottleneck here isn't RCU scalability under concurrent load, but > rather the accumulated latency of hundreds of sequential Grace Periods. > > For example, on pSeries, onlining 350 out of 400 CPUs triggers exactly > 350 calls at three different points in the hotplug state machine. Even > though they happen one at a time, the sheer volume makes the total > operation time prohibitive. > > Following callstack was collected during SMT mode switch where 350 out > of 400 CPUs were onlined, > > @[ > synchronize_rcu+12 > cpuidle_pause_and_lock+120 > pseries_cpuidle_cpu_online+88 > cpuhp_invoke_callback+500 > cpuhp_thread_fun+316 > smpboot_thread_fn+512 > kthread+308 > start_kernel_thread+20 > ]: 350 > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > _cpu_up+140 > cpu_up+440 > cpu_subsys_online+128 > device_online+176 > online_store+220 > dev_attr_store+52 > sysfs_kf_write+120 > kernfs_fop_write_iter+456 > vfs_write+952 > ksys_write+132 > system_call_exception+292 > system_call_vectored_common+348 > ]: 350 > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > try_online_node+64 > cpu_up+120 > cpu_subsys_online+128 > device_online+176 > online_store+220 > dev_attr_store+52 > sysfs_kf_write+120 > kernfs_fop_write_iter+456 > vfs_write+952 > ksys_write+132 > system_call_exception+292 > system_call_vectored_common+348 > ]: 350 > > Following callstack was collected during SMT mode switch where 350 out > of 400 CPUs where offlined, > > @[ > synchronize_rcu+12 > rcu_sync_enter+260 > percpu_down_write+76 > _cpu_down+188 > __cpu_down_maps_locked+44 > work_for_cpu_fn+56 > process_one_work+508 > worker_thread+840 > kthread+308 > start_kernel_thread+20 > ]: 1 > @[ > synchronize_rcu+12 > sched_cpu_deactivate+244 > cpuhp_invoke_callback+500 > cpuhp_thread_fun+316 > smpboot_thread_fn+512 > kthread+308 > start_kernel_thread+20 > ]: 350 > @[ > synchronize_rcu+12 > cpuidle_pause_and_lock+120 > pseries_cpuidle_cpu_dead+88 > cpuhp_invoke_callback+500 > __cpuhp_invoke_callback_range+200 > _cpu_down+412 > __cpu_down_maps_locked+44 > work_for_cpu_fn+56 > process_one_work+508 > worker_thread+840 > kthread+308 > start_kernel_thread+20 > ]: 350 I still suggest that you test on a big system. There are other sources of synchronize_rcu() calls than just CPU hotplug. ;-) Thanx, Paul
Hi Uladzislau, On 12/01/26 15:38, Uladzislau Rezki wrote: > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: >> Performance data on a PPC64 system with 400 CPUs: >> >> + ppc64_cpu --smt=1 (SMT8 to SMT1) >> Before: real 1m14.792s >> After: real 0m03.205s # ~23x improvement >> >> + ppc64_cpu --smt=8 (SMT1 to SMT8) >> Before: real 2m27.695s >> After: real 0m02.510s # ~58x improvement >> >> Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b >> >> [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com >> > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > your "expedited switch" improvement. # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp After setting, # time ppc64_cpu --smt=1; real 1m10.726s # Run 1 real 1m12.530s # Run 2 # time ppc64_cpu --smt=8 real 0m36.661s # Run 1 real 0m41.401s # Run 2 Regards, vishalc
Hello, Vishalc! > Hi Uladzislau, > > On 12/01/26 15:38, Uladzislau Rezki wrote: > > On Mon, Jan 12, 2026 at 03:13:33PM +0530, Vishal Chourasia wrote: > > > Performance data on a PPC64 system with 400 CPUs: > > > > > > + ppc64_cpu --smt=1 (SMT8 to SMT1) > > > Before: real 1m14.792s > > > After: real 0m03.205s # ~23x improvement > > > > > > + ppc64_cpu --smt=8 (SMT1 to SMT8) > > > Before: real 2m27.695s > > > After: real 0m02.510s # ~58x improvement > > > > > > Above numbers were collected on Linux 6.19.0-rc4-00310-g755bc1335e3b > > > > > > [1] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com > > > > > Also you can try: echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > to speedup regular synchronize_rcu() call. But i am not saying that it would beat > > your "expedited switch" improvement. > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > After setting, > > # time ppc64_cpu --smt=1; > real 1m10.726s # Run 1 > real 1m12.530s # Run 2 > > # time ppc64_cpu --smt=8 > real 0m36.661s # Run 1 > real 0m41.401s # Run 2 > Thanks. "ppc64_cpu --smt=1" is the same, i assume it is offlining. "ppc64_cpu --smt=8", whereas, onlining, sees the differences(~5x). But your real "0m02.510s" is hard to beat event by activating the "rcu_normal_wake_from_gp" option. -- Uladzislau Rezki
© 2016 - 2026 Red Hat, Inc.