[PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode

Vishal Chourasia posted 1 patch 1 month, 1 week ago
include/linux/rcupdate.h | 8 ++++++++
kernel/cpu.c             | 4 ++++
kernel/rcu/rcu.h         | 4 ----
3 files changed, 12 insertions(+), 4 deletions(-)
[PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode
Posted by Vishal Chourasia 1 month, 1 week ago
Hello All,

SMT mode switch operation on a large CPU count system takes close to an
hour to complete. Initial debugging root caused the delay to the CPU
hotplug subsystem being blocked on numerous synchronize_rcu() calls.
Simply enabling system-wide RCU expediting reduced the switch time to
5-6 minutes. Since then, different approaches have been explored, of
which some had their own side effects and others didn't work as
expected.

Approaches explored:

1. Expedited individual CPU hotplug operations by wrapping
_cpu_up()/_cpu_down() with rcu_expedite_gp()/rcu_unexpedite_gp() [0].
Peter suggested expediting only when SMT switch is triggered via the
sysfs control interface, not for individual hotplug operations [1].

2. Replacing synchronize_rcu() calls in the CPU hotplug codepath with
their expedited variants. This is not viable because one
synchronize_rcu() is invoked inside cpus_write_lock(), which is shared
with other kernel subsystems [5].

3. Hoisting cpus_write_lock() to be taken once for the entire SMT switch
operation instead of per-CPU [3][4]. On large systems where the SMT
switch can still take 5-6 minutes, holding the lock for that duration
causes hung task splats and starves other subsystems depending on the
read lock.

4. Peter has also suggested using rcu_sync_{enter|exit}() which as is
doesn't help as is, but can be paired the approach 2 from above.

Current approach: expedite RCU grace periods around the SMT switch
operation in the sysfs control interface path, per Peter's suggestion
[1], with Aboorva's analysis confirming synchronize_rcu() as the
bottleneck [2].

[0] https://lore.kernel.org/all/20260218083915.660252-2-vishalc@linux.ibm.com
[1] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
[3] https://lore.kernel.org/all/20260119114333.GI1890602@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/ba470918-0ad9-4548-9161-826948462f73@linux.ibm.com/
[5] https://lore.kernel.org/all/804E7B47-F515-4592-B12E-84AD251EB07D@nvidia.com/
[6] https://lore.kernel.org/all/e2cca734-9191-4073-ba9d-936014498645@linux.ibm.com/

Vishal Chourasia (1):
  cpuhp: Expedite RCU when toggling system-wide SMT mode

 include/linux/rcupdate.h | 8 ++++++++
 kernel/cpu.c             | 4 ++++
 kernel/rcu/rcu.h         | 4 ----
 3 files changed, 12 insertions(+), 4 deletions(-)

-- 
2.54.0
Re: [PATCH v4 0/1] cpuhp: Expedite RCU when toggling system-wide SMT mode
Posted by Vishal Chourasia 5 days, 3 hours ago
Hi All,

Gentle ping.
Should I send another version with the tags?

Thanks,
vishalc

On 07/05/26 11:09, Vishal Chourasia wrote:
> Hello All,
>
> SMT mode switch operation on a large CPU count system takes close to an
> hour to complete. Initial debugging root caused the delay to the CPU
> hotplug subsystem being blocked on numerous synchronize_rcu() calls.
> Simply enabling system-wide RCU expediting reduced the switch time to
> 5-6 minutes. Since then, different approaches have been explored, of
> which some had their own side effects and others didn't work as
> expected.
>
> Approaches explored:
>
> 1. Expedited individual CPU hotplug operations by wrapping
> _cpu_up()/_cpu_down() with rcu_expedite_gp()/rcu_unexpedite_gp() [0].
> Peter suggested expediting only when SMT switch is triggered via the
> sysfs control interface, not for individual hotplug operations [1].
>
> 2. Replacing synchronize_rcu() calls in the CPU hotplug codepath with
> their expedited variants. This is not viable because one
> synchronize_rcu() is invoked inside cpus_write_lock(), which is shared
> with other kernel subsystems [5].
>
> 3. Hoisting cpus_write_lock() to be taken once for the entire SMT switch
> operation instead of per-CPU [3][4]. On large systems where the SMT
> switch can still take 5-6 minutes, holding the lock for that duration
> causes hung task splats and starves other subsystems depending on the
> read lock.
>
> 4. Peter has also suggested using rcu_sync_{enter|exit}() which as is
> doesn't help as is, but can be paired the approach 2 from above.
>
> Current approach: expedite RCU grace periods around the SMT switch
> operation in the sysfs control interface path, per Peter's suggestion
> [1], with Aboorva's analysis confirming synchronize_rcu() as the
> bottleneck [2].
>
> [0] https://lore.kernel.org/all/20260218083915.660252-2-vishalc@linux.ibm.com
> [1] https://lore.kernel.org/all/20260113090153.GS830755@noisy.programming.kicks-ass.net/
> [2] https://lore.kernel.org/all/5f2ab8a44d685701fe36cdaa8042a1aef215d10d.camel@linux.vnet.ibm.com
> [3] https://lore.kernel.org/all/20260119114333.GI1890602@noisy.programming.kicks-ass.net/
> [4] https://lore.kernel.org/all/ba470918-0ad9-4548-9161-826948462f73@linux.ibm.com/
> [5] https://lore.kernel.org/all/804E7B47-F515-4592-B12E-84AD251EB07D@nvidia.com/
> [6] https://lore.kernel.org/all/e2cca734-9191-4073-ba9d-936014498645@linux.ibm.com/
>
> Vishal Chourasia (1):
>    cpuhp: Expedite RCU when toggling system-wide SMT mode
>
>   include/linux/rcupdate.h | 8 ++++++++
>   kernel/cpu.c             | 4 ++++
>   kernel/rcu/rcu.h         | 4 ----
>   3 files changed, 12 insertions(+), 4 deletions(-)
>