[RFC PATCH 0/3] thermal: Add CPU hotplug cooling driver

John Madieu posted 3 patches 11 months ago
arch/arm64/boot/dts/renesas/r9a09g047.dtsi |  13 +
drivers/thermal/Kconfig                    |  12 +
drivers/thermal/Makefile                   |   1 +
drivers/thermal/cpuplug_cooling.c          | 363 +++++++++++++++++++++
drivers/thermal/thermal_of.c               |   1 +
drivers/thermal/thermal_trace.h            |   2 +
drivers/thermal/thermal_trip.c             |   1 +
include/uapi/linux/thermal.h               |   1 +
tools/thermal/tmon/tmon.h                  |   1 +
tools/thermal/tmon/tui.c                   |   3 +-
10 files changed, 397 insertions(+), 1 deletion(-)
create mode 100644 drivers/thermal/cpuplug_cooling.c
[RFC PATCH 0/3] thermal: Add CPU hotplug cooling driver
Posted by John Madieu 11 months ago
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This patch series introduces a new thermal cooling driver that implements CPU
hotplug-based thermal management. The driver dynamically takes CPUs offline
during thermal excursions to reduce power consumption and prevent overheating,
while maintaining system stability by keeping at least one CPU online. 

1- Problem Statement

Modern SoCs require robust thermal management to prevent overheating under heavy
workloads. Existing cooling mechanisms like frequency scaling may not always
provide sufficient thermal relief, especially in multi-core systems where
per-core thermal contributions can be significant. 

2- Solution Overview 

The driver:

 - Integrates with the Linux thermal framework as a cooling device  
 - Registers per-CPU cooling devices that respond to thermal trip points  
 - Uses CPU hotplug operations to reduce thermal load  
 - Maintains system stability by preserving the boot CPU from being put offline,
 regardless the CPUs that are specified in cooling device list. 
 - Implements proper state tracking and cleanup

Key Features:   

 - Dynamic CPU online/offline management based on thermal thresholds  
 - Device tree-based configuration via thermal zones and trip points  
 - Hysteresis support through thermal governor interactions  
 - Safe handling of CPU state transitions during module load/unload  
 - Compatibility with existing thermal management frameworks

Testing    

 - Verified on Renesas RZ/G3E platforms with multi-core CPU configurations  
 - Validated thermal response using artificial load generation (emul_temp)  
 - Confirmed proper interaction with other cooling devices
 - Verified support for 'plug' type trace events
 - Tested with step_wise governor

As the 'hot' type is already used for user space notification, I've choosen
'plug' for this new type. suggestions on this are welcome. Here is an example
of 'thermal-zone' that integrate 'plug' type:

```
thermal-zones {
	cpu-thermal {
		polling-delay = <1000>;
		polling-delay-passive = <250>;
		thermal-sensors = <&tsu>;

		cooling-maps {
			map0 {
				trip = <&target>;
				cooling-device = <&cpu0 0 3>, <&cpu3 0 3>;
				contribution = <1024>;
			};

			map1 {
				trip = <&trip_emergency>;
				cooling-device = <&cpu1 0 1>, <&cpu2 0 1>;
				contribution = <1024>;
			};

		};

		trips {
			target: trip-point {
				temperature = <95000>;
				hysteresis = <1000>;
				type = "passive";
			};

			trip_emergency: emergency {
				temperature = <110000>;
				hysteresis = <1000>;
				type = "plug";
			};

			sensor_crit: sensor-crit {
				temperature = <120000>;
				hysteresis = <1000>;
				type = "critical";
			};
		};
	};
};
```

Dependencies    

 - Requires standard thermal framework components (CONFIG_THERMAL)  
 - Depends on CPU hotplug support (CONFIG_HOTPLUG_CPU)  
 - Assumes device tree contains appropriate thermal zone definitions

This series also depends upon [1], more precisely on patch 6/7, 
arm64: dts: renesas: r9a09g047: Add TSU node.


3) Notes for Reviewers    

 - Focus areas: Thermal framework integration, CPU state management, and error handling  
 - Feedback on device tree binding requirements is particularly welcome  
 - Suggestions for interaction improvements with other governors are appreciated

I look forward to your feedback and guidance on this contribution.

[1] https://patchwork.kernel.org/project/linux-clk/cover/20250227122453.30480-1-john.madieu.xa@bp.renesas.com/

Regards,
John


John Madieu (3):
  thermal/cpuplog_cooling: Add CPU hotplug cooling driver
  tmon: Add support for THERMAL_TRIP_PLUG type
  arm64: dts: renesas: r9a09g047: Add thermal hotplug trip point

 arch/arm64/boot/dts/renesas/r9a09g047.dtsi |  13 +
 drivers/thermal/Kconfig                    |  12 +
 drivers/thermal/Makefile                   |   1 +
 drivers/thermal/cpuplug_cooling.c          | 363 +++++++++++++++++++++
 drivers/thermal/thermal_of.c               |   1 +
 drivers/thermal/thermal_trace.h            |   2 +
 drivers/thermal/thermal_trip.c             |   1 +
 include/uapi/linux/thermal.h               |   1 +
 tools/thermal/tmon/tmon.h                  |   1 +
 tools/thermal/tmon/tui.c                   |   3 +-
 10 files changed, 397 insertions(+), 1 deletion(-)
 create mode 100644 drivers/thermal/cpuplug_cooling.c

-- 
2.25.1
Re: [RFC PATCH 0/3] thermal: Add CPU hotplug cooling driver
Posted by Rafael J. Wysocki 10 months, 4 weeks ago
On Sun, Mar 9, 2025 at 1:13 PM John Madieu
<john.madieu.xa@bp.renesas.com> wrote:
>
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> This patch series introduces a new thermal cooling driver that implements CPU
> hotplug-based thermal management. The driver dynamically takes CPUs offline
> during thermal excursions to reduce power consumption and prevent overheating,
> while maintaining system stability by keeping at least one CPU online.

So as far as I am concerned, this is a total no-go.  CPU offline is
not designed to be triggered from within a driver.

> 1- Problem Statement
>
> Modern SoCs require robust thermal management to prevent overheating under heavy
> workloads. Existing cooling mechanisms like frequency scaling may not always
> provide sufficient thermal relief, especially in multi-core systems where
> per-core thermal contributions can be significant.

What about idle injection?

> 2- Solution Overview
>
> The driver:
>
>  - Integrates with the Linux thermal framework as a cooling device
>  - Registers per-CPU cooling devices that respond to thermal trip points
>  - Uses CPU hotplug operations to reduce thermal load
>  - Maintains system stability by preserving the boot CPU from being put offline,
>  regardless the CPUs that are specified in cooling device list.
>  - Implements proper state tracking and cleanup
>
> Key Features:
>
>  - Dynamic CPU online/offline management based on thermal thresholds
>  - Device tree-based configuration via thermal zones and trip points

So DT-only.  Not nice.

>  - Hysteresis support through thermal governor interactions

I'd rather not combine thermal governors with CPU offline.

>  - Safe handling of CPU state transitions during module load/unload

Are you sure that it is really safe?

>  - Compatibility with existing thermal management frameworks

I'm not sure about this.

So one of the things that CPU offline does, which you probably are not
aware of, is breaking CPU affinity which is a very brutal thing for
user space if it is not expecting that to happen.  Also it migrates
interrupts between CPUs that also may confuse things.  So don't do it
from the kernel, really.

Thanks, Rafael