[v3] cpuhp: Improve SMT switch time via lock batching and RCU expedition

[PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations

Posted by Vishal Chourasia 4 weeks ago

Expedite synchronize_rcu during the SMT mode switch operation when
initiated via /sys/devices/system/cpu/smt/control interface

SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
others are user driven operations and therefore should complete as soon
as possible. Switching SMT states involves iterating over a list of CPUs
and performing hotplug operations. It was found these transitions took
significantly large amount of time to complete particularly on
high-core-count systems.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
 include/linux/rcupdate.h | 8 ++++++++
 kernel/cpu.c             | 4 ++++
 kernel/rcu/rcu.h         | 4 ----
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7729fef249e1..61b80c29d53b 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
 extern int rcu_expedited;
 extern int rcu_normal;
 
+#ifdef CONFIG_TINY_RCU
+static inline void rcu_expedite_gp(void) { }
+static inline void rcu_unexpedite_gp(void) { }
+#else
+void rcu_expedite_gp(void);
+void rcu_unexpedite_gp(void);
+#endif
+
 DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
 DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 62e209eda78c..1377a68d6f47 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 		ret = -EBUSY;
 		goto out;
 	}
+	rcu_expedite_gp();
 	/* Hold cpus_write_lock() for entire batch operation. */
 	cpus_write_lock();
 	for_each_online_cpu(cpu) {
@@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	if (!ret)
 		cpu_smt_control = ctrlval;
 	cpus_write_unlock();
+	rcu_unexpedite_gp();
 	arch_smt_update();
 out:
 	cpu_maps_update_done();
@@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
 	int cpu, ret = 0;
 
 	cpu_maps_update_begin();
+	rcu_expedite_gp();
 	/* Hold cpus_write_lock() for entire batch operation. */
 	cpus_write_lock();
 	cpu_smt_control = CPU_SMT_ENABLED;
@@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
 		cpuhp_online_cpu_device(cpu);
 	}
 	cpus_write_unlock();
+	rcu_unexpedite_gp();
 	arch_smt_update();
 	cpu_maps_update_done();
 	return ret;
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index dc5d614b372c..41a0d262e964 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -512,8 +512,6 @@ do {									\
 static inline bool rcu_gp_is_normal(void) { return true; }
 static inline bool rcu_gp_is_expedited(void) { return false; }
 static inline bool rcu_async_should_hurry(void) { return false; }
-static inline void rcu_expedite_gp(void) { }
-static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_async_hurry(void) { }
 static inline void rcu_async_relax(void) { }
 static inline bool rcu_cpu_online(int cpu) { return true; }
@@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 bool rcu_async_should_hurry(void);  /* Internal RCU use. */
-void rcu_expedite_gp(void);
-void rcu_unexpedite_gp(void);
 void rcu_async_hurry(void);
 void rcu_async_relax(void);
 void rcupdate_announce_bootup_oddness(void);
-- 
2.53.0

Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations

Posted by Joel Fernandes 2 weeks, 6 days ago

On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> Expedite synchronize_rcu during the SMT mode switch operation when
> initiated via /sys/devices/system/cpu/smt/control interface
> 
> SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
> others are user driven operations and therefore should complete as soon
> as possible. Switching SMT states involves iterating over a list of CPUs
> and performing hotplug operations. It was found these transitions took
> significantly large amount of time to complete particularly on
> high-core-count systems.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
>  include/linux/rcupdate.h | 8 ++++++++
>  kernel/cpu.c             | 4 ++++
>  kernel/rcu/rcu.h         | 4 ----
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 7729fef249e1..61b80c29d53b 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
>  extern int rcu_expedited;
>  extern int rcu_normal;
>  
> +#ifdef CONFIG_TINY_RCU
> +static inline void rcu_expedite_gp(void) { }
> +static inline void rcu_unexpedite_gp(void) { }
> +#else
> +void rcu_expedite_gp(void);
> +void rcu_unexpedite_gp(void);
> +#endif
> +
>  DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
>  DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
>  
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 62e209eda78c..1377a68d6f47 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  		ret = -EBUSY;
>  		goto out;
>  	}
> +	rcu_expedite_gp();

After the locking related changes in patch 1, is expediting still required? I
am just a bit concerned that we are papering over the real issue of over
usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
the patches that reducing the number of lock acquire/release was supposed to
help.)

Could you provide more justification of why expediting these sections is
required if the locking concerns were addressed? It would be great if you can
provide performance numbers with only the first patch and without the second
patch. That way we can quantify this patch.

thanks,

--
Joel Fernandes


>  	/* Hold cpus_write_lock() for entire batch operation. */
>  	cpus_write_lock();
>  	for_each_online_cpu(cpu) {
> @@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  	if (!ret)
>  		cpu_smt_control = ctrlval;
>  	cpus_write_unlock();
> +	rcu_unexpedite_gp();
>  	arch_smt_update();
>  out:
>  	cpu_maps_update_done();
> @@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
>  	int cpu, ret = 0;
>  
>  	cpu_maps_update_begin();
> +	rcu_expedite_gp();
>  	/* Hold cpus_write_lock() for entire batch operation. */
>  	cpus_write_lock();
>  	cpu_smt_control = CPU_SMT_ENABLED;
> @@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
>  		cpuhp_online_cpu_device(cpu);
>  	}
>  	cpus_write_unlock();
> +	rcu_unexpedite_gp();
>  	arch_smt_update();
>  	cpu_maps_update_done();
>  	return ret;
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index dc5d614b372c..41a0d262e964 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -512,8 +512,6 @@ do {									\
>  static inline bool rcu_gp_is_normal(void) { return true; }
>  static inline bool rcu_gp_is_expedited(void) { return false; }
>  static inline bool rcu_async_should_hurry(void) { return false; }
> -static inline void rcu_expedite_gp(void) { }
> -static inline void rcu_unexpedite_gp(void) { }
>  static inline void rcu_async_hurry(void) { }
>  static inline void rcu_async_relax(void) { }
>  static inline bool rcu_cpu_online(int cpu) { return true; }
> @@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
>  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>  bool rcu_async_should_hurry(void);  /* Internal RCU use. */
> -void rcu_expedite_gp(void);
> -void rcu_unexpedite_gp(void);
>  void rcu_async_hurry(void);
>  void rcu_async_relax(void);
>  void rcupdate_announce_bootup_oddness(void);
> -- 
> 2.53.0
>

Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations

Posted by Samir M 2 weeks, 2 days ago

On 27/02/26 6:43 am, Joel Fernandes wrote:
> On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
>> Expedite synchronize_rcu during the SMT mode switch operation when
>> initiated via /sys/devices/system/cpu/smt/control interface
>>
>> SMT mode switch operation i.e. between SMT 8 to SMT 1 or vice versa and
>> others are user driven operations and therefore should complete as soon
>> as possible. Switching SMT states involves iterating over a list of CPUs
>> and performing hotplug operations. It was found these transitions took
>> significantly large amount of time to complete particularly on
>> high-core-count systems.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
>> ---
>>   include/linux/rcupdate.h | 8 ++++++++
>>   kernel/cpu.c             | 4 ++++
>>   kernel/rcu/rcu.h         | 4 ----
>>   3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 7729fef249e1..61b80c29d53b 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -1190,6 +1190,14 @@ rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
>>   extern int rcu_expedited;
>>   extern int rcu_normal;
>>   
>> +#ifdef CONFIG_TINY_RCU
>> +static inline void rcu_expedite_gp(void) { }
>> +static inline void rcu_unexpedite_gp(void) { }
>> +#else
>> +void rcu_expedite_gp(void);
>> +void rcu_unexpedite_gp(void);
>> +#endif
>> +
>>   DEFINE_LOCK_GUARD_0(rcu, rcu_read_lock(), rcu_read_unlock())
>>   DECLARE_LOCK_GUARD_0_ATTRS(rcu, __acquires_shared(RCU), __releases_shared(RCU))
>>   
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 62e209eda78c..1377a68d6f47 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -2682,6 +2682,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>>   		ret = -EBUSY;
>>   		goto out;
>>   	}
>> +	rcu_expedite_gp();
> After the locking related changes in patch 1, is expediting still required? I
> am just a bit concerned that we are papering over the real issue of over
> usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> the patches that reducing the number of lock acquire/release was supposed to
> help.)
>
> Could you provide more justification of why expediting these sections is
> required if the locking concerns were addressed? It would be great if you can
> provide performance numbers with only the first patch and without the second
> patch. That way we can quantify this patch.
>
> thanks,
>
> --
> Joel Fernandes
>
Hi Vishal/Joel,


Configuration:
     •    Kernel version: 7.0.0-rc1
     •    Number of CPUs: 1536

I have verified the below two patches together and observed improvements,
Patch 1: 
https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/

Patch 2: 
https://lore.kernel.org/all/20260218083915.660252-6-vishalc@linux.ibm.com/
SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
------------------------------------------------------------------------|
SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |

When I tested the below patch independently, I did not observe any 
improvements for either smt=on or smt=off. However, in the smt=off 
scenario, I encountered hung task splats (with call traces), where some 
threads were blocked on cpus_read_lock. Please also refer to the 
attached call trace below.
Patch 1: 
https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/

SMT Mode    | Without Patch(Base) | just patch 1 applied   | % 
Improvement  |
----------------------------------------------------------------------------|
SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %    
    |
SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %    
    |


Call traces:
12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
[ 1477.612384] [  T8746] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1 
  tgid:1   ppid:0   task_flags:0x400100 flags:0x00040000
[ 1477.612397] [  T8746] Call Trace:
[ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000 
(unreliable)
[ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c] 
__switch_to+0x1dc/0x290
[ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac] 
__schedule+0x40c/0x1a70
[ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58] 
schedule+0x48/0x1a0
[ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8] 
percpu_rwsem_wait+0x198/0x200
[ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930] 
__percpu_down_read+0xb0/0x210
[ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400] 
cpus_read_lock+0xc0/0xd0
[ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398] 
cgroup_procs_write_start+0x328/0x410
[ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620] 
__cgroup_procs_write+0x70/0x2c0
[ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8] 
cgroup_procs_write+0x28/0x50
[ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624] 
cgroup_file_write+0xb4/0x240
[ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8] 
kernfs_fop_write_iter+0x1a8/0x2a0
[ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c] 
vfs_write+0x27c/0x540
[ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350] 
ksys_write+0x80/0x150
[ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898] 
system_call_exception+0x148/0x320
[ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0] 
system_call_common+0x160/0x2c4
[ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
[ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR: 
0000000000000000
[ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G    
   E    (7.0.0-rc1-150700.51-default-dirty)
[ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> 
CR: 28002288 XER: 00000000



Regards,
Samir
>>   	/* Hold cpus_write_lock() for entire batch operation. */
>>   	cpus_write_lock();
>>   	for_each_online_cpu(cpu) {
>> @@ -2714,6 +2715,7 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>>   	if (!ret)
>>   		cpu_smt_control = ctrlval;
>>   	cpus_write_unlock();
>> +	rcu_unexpedite_gp();
>>   	arch_smt_update();
>>   out:
>>   	cpu_maps_update_done();
>> @@ -2733,6 +2735,7 @@ int cpuhp_smt_enable(void)
>>   	int cpu, ret = 0;
>>   
>>   	cpu_maps_update_begin();
>> +	rcu_expedite_gp();
>>   	/* Hold cpus_write_lock() for entire batch operation. */
>>   	cpus_write_lock();
>>   	cpu_smt_control = CPU_SMT_ENABLED;
>> @@ -2749,6 +2752,7 @@ int cpuhp_smt_enable(void)
>>   		cpuhp_online_cpu_device(cpu);
>>   	}
>>   	cpus_write_unlock();
>> +	rcu_unexpedite_gp();
>>   	arch_smt_update();
>>   	cpu_maps_update_done();
>>   	return ret;
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index dc5d614b372c..41a0d262e964 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -512,8 +512,6 @@ do {									\
>>   static inline bool rcu_gp_is_normal(void) { return true; }
>>   static inline bool rcu_gp_is_expedited(void) { return false; }
>>   static inline bool rcu_async_should_hurry(void) { return false; }
>> -static inline void rcu_expedite_gp(void) { }
>> -static inline void rcu_unexpedite_gp(void) { }
>>   static inline void rcu_async_hurry(void) { }
>>   static inline void rcu_async_relax(void) { }
>>   static inline bool rcu_cpu_online(int cpu) { return true; }
>> @@ -521,8 +519,6 @@ static inline bool rcu_cpu_online(int cpu) { return true; }
>>   bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>>   bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>>   bool rcu_async_should_hurry(void);  /* Internal RCU use. */
>> -void rcu_expedite_gp(void);
>> -void rcu_unexpedite_gp(void);
>>   void rcu_async_hurry(void);
>>   void rcu_async_relax(void);
>>   void rcupdate_announce_bootup_oddness(void);
>> -- 
>> 2.53.0
>>

Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations

Posted by Vishal Chourasia 1 week, 5 days ago

On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> 
> On 27/02/26 6:43 am, Joel Fernandes wrote:
> > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > initiated via /sys/devices/system/cpu/smt/control interface
> > >
> > After the locking related changes in patch 1, is expediting still required? I
Yes.
> > am just a bit concerned that we are papering over the real issue of over
> > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > the patches that reducing the number of lock acquire/release was supposed to
> > help.)
At present, I am not sure about the underlying issue. So far what I have
found is when synchronize_rcu() is invoked, it marks the start of a new
grace period number, say A. Thread invoking synchronize_rcu() blocks
until all CPUs have reported QS for GP "A". There is a rcu grace period
kthread that runs periodically looping over a CPU list to figure out all
CPUs have reported QS. In the trace, I find some CPUs reporting QS for
sequence number way back in the past for ex. A - N where N is > 10.

> > 
> > Could you provide more justification of why expediting these sections is
> > required if the locking concerns were addressed? It would be great if you can
> > provide performance numbers with only the first patch and without the second
> > patch. That way we can quantify this patch.
> > 
> > 
> SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
> ------------------------------------------------------------------------|
> SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
> SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |
> 
> When I tested the below patch independently, I did not observe any
> improvements for either smt=on or smt=off. However, in the smt=off scenario,
> I encountered hung task splats (with call traces), where some threads were
> blocked on cpus_read_lock. Please also refer to the attached call trace
> below.
> Patch 1:
> https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> 
> SMT Mode    | Without Patch(Base) | just patch 1 applied   | % Improvement 
> |
> ----------------------------------------------------------------------------|
> SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %     
>  |
> SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %     
>  |
> 
> 
> Call traces:
> 12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
> [ 1477.612384] [  T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1  tgid:1 
>  ppid:0   task_flags:0x400100 flags:0x00040000
> [ 1477.612397] [  T8746] Call Trace:
> [ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> (unreliable)
> [ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> __switch_to+0x1dc/0x290
> [ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac]
> __schedule+0x40c/0x1a70
> [ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58]
> schedule+0x48/0x1a0
> [ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8]
> percpu_rwsem_wait+0x198/0x200
> [ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930]
> __percpu_down_read+0xb0/0x210
> [ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400]
> cpus_read_lock+0xc0/0xd0
> [ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398]
> cgroup_procs_write_start+0x328/0x410
> [ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620]
> __cgroup_procs_write+0x70/0x2c0
> [ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8]
> cgroup_procs_write+0x28/0x50
> [ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624]
> cgroup_file_write+0xb4/0x240
> [ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8]
> kernfs_fop_write_iter+0x1a8/0x2a0
> [ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c]
> vfs_write+0x27c/0x540
> [ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350]
> ksys_write+0x80/0x150
> [ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898]
> system_call_exception+0x148/0x320
> [ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0]
> system_call_common+0x160/0x2c4
> [ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> [ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> 0000000000000000
> [ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G     
> E    (7.0.0-rc1-150700.51-default-dirty)
> [ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> 28002288 XER: 00000000
> 
> 

Default timeout is set to 8 mins.

$ grep . /proc/sys/kernel/hung_task_timeout_secs
/proc/sys/kernel/hung_task_timeout_secs:480

Now that cpus_write_lock is taken once, and SMT mode switch can take
tens of minutes to complete and relinquish the lock, threads waiting on 
cpus_read_lock will be blocked for this entire duration.

Although there were no splats observed for "both patch applied" case
the issue still remains.

regards,
vishal

Re: [PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations

Posted by Paul E. McKenney 1 week, 5 days ago

On Fri, Mar 06, 2026 at 11:14:13AM +0530, Vishal Chourasia wrote:
> On Mon, Mar 02, 2026 at 05:17:16PM +0530, Samir M wrote:
> > 
> > On 27/02/26 6:43 am, Joel Fernandes wrote:
> > > On Wed, Feb 18, 2026 at 02:09:18PM +0530, Vishal Chourasia wrote:
> > > > Expedite synchronize_rcu during the SMT mode switch operation when
> > > > initiated via /sys/devices/system/cpu/smt/control interface
> > > >
> > > After the locking related changes in patch 1, is expediting still required? I
> Yes.
> > > am just a bit concerned that we are papering over the real issue of over
> > > usage of synchronize_rcu() (which IIRC we discussed in earlier versions of
> > > the patches that reducing the number of lock acquire/release was supposed to
> > > help.)
> At present, I am not sure about the underlying issue. So far what I have
> found is when synchronize_rcu() is invoked, it marks the start of a new
> grace period number, say A. Thread invoking synchronize_rcu() blocks
> until all CPUs have reported QS for GP "A". There is a rcu grace period
> kthread that runs periodically looping over a CPU list to figure out all
> CPUs have reported QS. In the trace, I find some CPUs reporting QS for
> sequence number way back in the past for ex. A - N where N is > 10.

This can happen when a CPU goes idle for multiple grace periods, then
wakes up in the middle of a later grace period.  This is (or at least is
supposed to be) harmless because a quiescent state was reported on that
CPU's behalf when RCU noticed that it was idle.  The report is quashed
when RCU notices that the quiescent state being reported is for a grace
period that has already completed.  Grace-period counter wrap is handled
by the infamous ->gpwrap field in the rcu_data structure.

I have seen N having four digits, with deep embedded devices being most
likely to have extremely large values of N.

							Thanx, Paul

> > > Could you provide more justification of why expediting these sections is
> > > required if the locking concerns were addressed? It would be great if you can
> > > provide performance numbers with only the first patch and without the second
> > > patch. That way we can quantify this patch.
> > > 
> > > 
> > SMT Mode    | Without Patch(Base) | both patch applied | % Improvement  |
> > ------------------------------------------------------------------------|
> > SMT=off     | 16m 13.956s         |     6m 18.435s     |  +61.14 %      |
> > SMT=on      | 12m 0.982s          |     5m 59.576s     |  +50.10 %      |
> > 
> > When I tested the below patch independently, I did not observe any
> > improvements for either smt=on or smt=off. However, in the smt=off scenario,
> > I encountered hung task splats (with call traces), where some threads were
> > blocked on cpus_read_lock. Please also refer to the attached call trace
> > below.
> > Patch 1:
> > https://lore.kernel.org/all/20260218083915.660252-4-vishalc@linux.ibm.com/
> > 
> > SMT Mode    | Without Patch(Base) | just patch 1 applied   | % Improvement 
> > |
> > ----------------------------------------------------------------------------|
> > SMT=off     | 16m 13.956s         |     16m 9.793s         |  +0.43 %     
> >  |
> > SMT=on      | 12m 0.982s          |     12m 19.494s        |  -2.57 %     
> >  |
> > 
> > 
> > Call traces:
> > 12377] [  T8746]    Tainted: G      E 7.0.0-rc1-150700.51-default-dirty #1
> > [ 1477.612384] [  T8746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1477.612389] [  T8746] task:systemd     state:D stack:0   pid:1  tgid:1 
> >  ppid:0   task_flags:0x400100 flags:0x00040000
> > [ 1477.612397] [  T8746] Call Trace:
> > [ 1477.612399] [  T8746] [c00000000cc0f4f0] [0000000000100000] 0x100000
> > (unreliable)
> > [ 1477.612416] [  T8746] [c00000000cc0f6a0] [c00000000001fe5c]
> > __switch_to+0x1dc/0x290
> > [ 1477.612425] [  T8746] [c00000000cc0f6f0] [c0000000012598ac]
> > __schedule+0x40c/0x1a70
> > [ 1477.612433] [  T8746] [c00000000cc0f840] [c00000000125af58]
> > schedule+0x48/0x1a0
> > [ 1477.612439] [  T8746] [c00000000cc0f870] [c0000000002e27b8]
> > percpu_rwsem_wait+0x198/0x200
> > [ 1477.612445] [  T8746] [c00000000cc0f8f0] [c000000001262930]
> > __percpu_down_read+0xb0/0x210
> > [ 1477.612449] [  T8746] [c00000000cc0f930] [c00000000022f400]
> > cpus_read_lock+0xc0/0xd0
> > [ 1477.612456] [  T8746] [c00000000cc0f950] [c0000000003a6398]
> > cgroup_procs_write_start+0x328/0x410
> > [ 1477.612462] [  T8746] [c00000000cc0fa00] [c0000000003a9620]
> > __cgroup_procs_write+0x70/0x2c0
> > [ 1477.612468] [  T8746] [c00000000cc0fac0] [c0000000003a98e8]
> > cgroup_procs_write+0x28/0x50
> > [ 1477.612473] [  T8746] [c00000000cc0faf0] [c0000000003a1624]
> > cgroup_file_write+0xb4/0x240
> > [ 1477.612478] [  T8746] [c00000000cc0fb50] [c000000000853ba8]
> > kernfs_fop_write_iter+0x1a8/0x2a0
> > [ 1477.612485] [  T8746] [c00000000cc0fba0] [c000000000733d5c]
> > vfs_write+0x27c/0x540
> > [ 1477.612491] [  T8746] [c00000000cc0fc50] [c000000000734350]
> > ksys_write+0x80/0x150
> > [ 1477.612495] [  T8746] [c00000000cc0fca0] [c000000000032898]
> > system_call_exception+0x148/0x320
> > [ 1477.612500] [  T8746] [c00000000cc0fe50] [c00000000000d6a0]
> > system_call_common+0x160/0x2c4
> > [ 1477.612506] [  T8746] ---- interrupt: c00 at 0x7fffa8f73df4
> > [ 1477.612509] [  T8746] NIP: 00007fffa8f73df4 LR: 00007fffa8eb6144 CTR:
> > 0000000000000000
> > [ 1477.612512] [  T8746] REGS: c00000000cc0fe80 TRAP: 0c00 Tainted: G     
> > E    (7.0.0-rc1-150700.51-default-dirty)
> > [ 1477.612515] [  T8746] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR:
> > 28002288 XER: 00000000
> > 
> > 
> 
> Default timeout is set to 8 mins.
> 
> $ grep . /proc/sys/kernel/hung_task_timeout_secs
> /proc/sys/kernel/hung_task_timeout_secs:480
> 
> Now that cpus_write_lock is taken once, and SMT mode switch can take
> tens of minutes to complete and relinquish the lock, threads waiting on 
> cpus_read_lock will be blocked for this entire duration.
> 
> Although there were no splats observed for "both patch applied" case
> the issue still remains.
> 
> regards,
> vishal

[PATCH v3 1/2] cpuhp: Optimize SMT switch operation by batching lock acquisition
[PATCH v3 2/2] cpuhp: Expedite RCU grace periods during SMT operations