[PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

Rafael J. Wysocki posted 1 patch 1 year, 5 months ago
drivers/thermal/thermal_core.c |    5 ++++-
drivers/thermal/thermal_core.h |    6 ++++++
2 files changed, 10 insertions(+), 1 deletion(-)
[PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Rafael J. Wysocki 1 year, 5 months ago
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
if zone temperature is invalid") caused __thermal_zone_device_update()
to return early if the current thermal zone temperature was invalid.

This was done to avoid running handle_thermal_trip() and governor
callbacks in that case which led to confusion.  However, it went too
far because monitor_thermal_zone() still needs to be called even when
the zone temperature is invalid to ensure that it will be updated
eventually in case thermal polling is enabled and the driver has no
other means to notify the core of zone temperature changes (for example,
it does not register an interrupt handler or ACPI notifier).

Also if the .set_trips() zone callback is expected to set up monitoring
interrupts for a thermal zone, it has to be provided with valid
boundaries and that can only happen if the zone temperature is known.

Accordingly, to ensure that __thermal_zone_device_update() will
run again after a failing zone temperature check, make it call
monitor_thermal_zone() regardless of whether or not the zone
temperature is valid and make the latter schedule a thermal zone
temperature update if the zone temperature is invalid even if
polling is not enabled for the thermal zone.

Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/thermal/thermal_core.c |    5 ++++-
 drivers/thermal/thermal_core.h |    6 ++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
 		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
 	else if (tz->polling_delay_jiffies)
 		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
+	else if (tz->temperature == THERMAL_TEMP_INVALID)
+		thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
 }
 
 static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
@@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
 	update_temperature(tz);
 
 	if (tz->temperature == THERMAL_TEMP_INVALID)
-		return;
+		goto monitor;
 
 	tz->notify_event = event;
 
@@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
 
 	thermal_debug_update_trip_stats(tz);
 
+monitor:
 	monitor_thermal_zone(tz);
 }
 
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -133,6 +133,12 @@ struct thermal_zone_device {
 	struct thermal_trip_desc trips[] __counted_by(num_trips);
 };
 
+/*
+ * Default delay after a failing thermal zone temperature check before
+ * attempting to check it again.
+ */
+#define THERMAL_RECHECK_DELAY_MS	100
+
 /* Default Thermal Governor */
 #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
 #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Neil Armstrong 1 year, 5 months ago
Hi,

On 28/06/2024 14:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> if zone temperature is invalid") caused __thermal_zone_device_update()
> to return early if the current thermal zone temperature was invalid.
> 
> This was done to avoid running handle_thermal_trip() and governor
> callbacks in that case which led to confusion.  However, it went too
> far because monitor_thermal_zone() still needs to be called even when
> the zone temperature is invalid to ensure that it will be updated
> eventually in case thermal polling is enabled and the driver has no
> other means to notify the core of zone temperature changes (for example,
> it does not register an interrupt handler or ACPI notifier).
> 
> Also if the .set_trips() zone callback is expected to set up monitoring
> interrupts for a thermal zone, it has to be provided with valid
> boundaries and that can only happen if the zone temperature is known.
> 
> Accordingly, to ensure that __thermal_zone_device_update() will
> run again after a failing zone temperature check, make it call
> monitor_thermal_zone() regardless of whether or not the zone
> temperature is valid and make the latter schedule a thermal zone
> temperature update if the zone temperature is invalid even if
> polling is not enabled for the thermal zone.
> 
> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>   drivers/thermal/thermal_core.c |    5 ++++-
>   drivers/thermal/thermal_core.h |    6 ++++++
>   2 files changed, 10 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/thermal/thermal_core.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.c
> +++ linux-pm/drivers/thermal/thermal_core.c
> @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
>   		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>   	else if (tz->polling_delay_jiffies)
>   		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> +	else if (tz->temperature == THERMAL_TEMP_INVALID)
> +		thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
>   }
>   
>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
>   	update_temperature(tz);
>   
>   	if (tz->temperature == THERMAL_TEMP_INVALID)
> -		return;
> +		goto monitor;
>   
>   	tz->notify_event = event;
>   
> @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
>   
>   	thermal_debug_update_trip_stats(tz);
>   
> +monitor:
>   	monitor_thermal_zone(tz);
>   }
>   
> Index: linux-pm/drivers/thermal/thermal_core.h
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.h
> +++ linux-pm/drivers/thermal/thermal_core.h
> @@ -133,6 +133,12 @@ struct thermal_zone_device {
>   	struct thermal_trip_desc trips[] __counted_by(num_trips);
>   };
>   
> +/*
> + * Default delay after a failing thermal zone temperature check before
> + * attempting to check it again.
> + */
> +#define THERMAL_RECHECK_DELAY_MS	100
> +
>   /* Default Thermal Governor */
>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> 
> 
> 
> 

This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, HDK8560, QRD8650 & HDK8650 output in loop:

thermal thermal_zoneXX: failed to read out thermal zone (-19)

Boot logs or ARM64 defconfig:
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152439#L1393
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152440#L2200
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152442#L2828
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152441#L1862
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152443#L1776
https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152444#L1723

Result of git bisect:
# bad: [82e4255305c554b0bb18b7ccf2db86041b4c8b6e] Add linux-next specific files for 20240702
# good: [22a40d14b572deb80c0648557f4bd502d7e83826] Linux 6.10-rc6
git bisect start 'FETCH_HEAD' 'v6.10-rc6'
# bad: [f6dfcf0e9567b57b93f2564966d9177f0d8dbe05] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect bad f6dfcf0e9567b57b93f2564966d9177f0d8dbe05
# good: [7f86ae0c2dc19fea7be1da29b2bf03f085463ae7] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git
git bisect good 7f86ae0c2dc19fea7be1da29b2bf03f085463ae7
# bad: [077d5bbd75dd12af2096c96846ffc78ab5dd65b1] Merge branch 'devfreq-next' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
git bisect bad 077d5bbd75dd12af2096c96846ffc78ab5dd65b1
# good: [271bcaf753d0afe2bd0386ab1e98132ee65b61ca] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git
git bisect good 271bcaf753d0afe2bd0386ab1e98132ee65b61ca
# good: [9758a2ee5316a6f8736ab4fd39a6f6176aa057ec] Merge branch 'hwmon-next' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git
git bisect good 9758a2ee5316a6f8736ab4fd39a6f6176aa057ec
# good: [e6bd69ea345045520bd63487b85a4b5676aff76b] Merge branch 'master' of git://linuxtv.org/mchehab/media-next.git
git bisect good e6bd69ea345045520bd63487b85a4b5676aff76b
# good: [46398edfb36e2882be5e86ea563b2db9138ae499] Merge branches 'pm-cpuidle' and 'pm-powercap' into linux-next
git bisect good 46398edfb36e2882be5e86ea563b2db9138ae499
# good: [d3927cbc52eed166f74ea7e031ed6384cc3d4d5f] Merge branch 'thermal-intel' into linux-next
git bisect good d3927cbc52eed166f74ea7e031ed6384cc3d4d5f
# good: [ce84b7beeb524e7b20983838687862454ba54df7] cpufreq: sti: add missing MODULE_DEVICE_TABLE entry for stih418
git bisect good ce84b7beeb524e7b20983838687862454ba54df7
# bad: [fcf61315d38d41f4e55856b179f9e5538e299ef4] Merge branch 'thermal-fixes' into linux-next
git bisect bad fcf61315d38d41f4e55856b179f9e5538e299ef4
# good: [4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e] dt-bindings: thermal: mediatek: Fix thermal zone definition for MT8186
git bisect good 4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e
# good: [7eeb114a635a04bea2fa7d57cedbf374c714d29e] dt-bindings: thermal: convert hisilicon-thermal.txt to dt-schema
git bisect good 7eeb114a635a04bea2fa7d57cedbf374c714d29e
# good: [107ac0d49ae6a86b4986146b9a612294f7e34406] Merge branch 'thermal/linux-next' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux into linux-next
git bisect good 107ac0d49ae6a86b4986146b9a612294f7e34406
# bad: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
git bisect bad 5725f40698b9ba7f84fbfee25b9059ba044c4b86
# first bad commit: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

#regzbot introduced: 5725f40698b9ba7f84fbfee25b9059ba044c4b86

Thanks,
Neil
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Daniel Lezcano 1 year, 5 months ago
Hi Neil,

it seems there is something wrong with the driver actually.

There can be a moment where the sensor is not yet initialized for 
different reason, so reading the temperature fails. The routine will 
just retry until the sensor gets ready.

Having these errors seem to me that the sensor for this specific thermal 
zone is never ready which may be the root cause of your issue. The 
change is spotting this problem IMO.


On 03/07/2024 12:54, Neil Armstrong wrote:
> Hi,
> 
> On 28/06/2024 14:10, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>> if zone temperature is invalid") caused __thermal_zone_device_update()
>> to return early if the current thermal zone temperature was invalid.
>>
>> This was done to avoid running handle_thermal_trip() and governor
>> callbacks in that case which led to confusion.  However, it went too
>> far because monitor_thermal_zone() still needs to be called even when
>> the zone temperature is invalid to ensure that it will be updated
>> eventually in case thermal polling is enabled and the driver has no
>> other means to notify the core of zone temperature changes (for example,
>> it does not register an interrupt handler or ACPI notifier).
>>
>> Also if the .set_trips() zone callback is expected to set up monitoring
>> interrupts for a thermal zone, it has to be provided with valid
>> boundaries and that can only happen if the zone temperature is known.
>>
>> Accordingly, to ensure that __thermal_zone_device_update() will
>> run again after a failing zone temperature check, make it call
>> monitor_thermal_zone() regardless of whether or not the zone
>> temperature is valid and make the latter schedule a thermal zone
>> temperature update if the zone temperature is invalid even if
>> polling is not enabled for the thermal zone.
>>
>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() 
>> if zone temperature is invalid")
>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>   drivers/thermal/thermal_core.c |    5 ++++-
>>   drivers/thermal/thermal_core.h |    6 ++++++
>>   2 files changed, 10 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/thermal/thermal_core.c
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>> +++ linux-pm/drivers/thermal/thermal_core.c
>> @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
>>           thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>       else if (tz->polling_delay_jiffies)
>>           thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>> +    else if (tz->temperature == THERMAL_TEMP_INVALID)
>> +        thermal_zone_device_set_polling(tz, 
>> msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
>>   }
>>   static struct thermal_governor *thermal_get_tz_governor(struct 
>> thermal_zone_device *tz)
>> @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
>>       update_temperature(tz);
>>       if (tz->temperature == THERMAL_TEMP_INVALID)
>> -        return;
>> +        goto monitor;
>>       tz->notify_event = event;
>> @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
>>       thermal_debug_update_trip_stats(tz);
>> +monitor:
>>       monitor_thermal_zone(tz);
>>   }
>> Index: linux-pm/drivers/thermal/thermal_core.h
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>> +++ linux-pm/drivers/thermal/thermal_core.h
>> @@ -133,6 +133,12 @@ struct thermal_zone_device {
>>       struct thermal_trip_desc trips[] __counted_by(num_trips);
>>   };
>> +/*
>> + * Default delay after a failing thermal zone temperature check before
>> + * attempting to check it again.
>> + */
>> +#define THERMAL_RECHECK_DELAY_MS    100
>> +
>>   /* Default Thermal Governor */
>>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>
>>
>>
>>
> 
> This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, 
> HDK8560, QRD8650 & HDK8650 output in loop:
> 
> thermal thermal_zoneXX: failed to read out thermal zone (-19)
> 
> Boot logs or ARM64 defconfig:
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152439#L1393
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152440#L2200
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152442#L2828
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152441#L1862
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152443#L1776
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152444#L1723
> 
> Result of git bisect:
> # bad: [82e4255305c554b0bb18b7ccf2db86041b4c8b6e] Add linux-next 
> specific files for 20240702
> # good: [22a40d14b572deb80c0648557f4bd502d7e83826] Linux 6.10-rc6
> git bisect start 'FETCH_HEAD' 'v6.10-rc6'
> # bad: [f6dfcf0e9567b57b93f2564966d9177f0d8dbe05] Merge branch 'master' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> git bisect bad f6dfcf0e9567b57b93f2564966d9177f0d8dbe05
> # good: [7f86ae0c2dc19fea7be1da29b2bf03f085463ae7] Merge branch 
> 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git
> git bisect good 7f86ae0c2dc19fea7be1da29b2bf03f085463ae7
> # bad: [077d5bbd75dd12af2096c96846ffc78ab5dd65b1] Merge branch 
> 'devfreq-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
> git bisect bad 077d5bbd75dd12af2096c96846ffc78ab5dd65b1
> # good: [271bcaf753d0afe2bd0386ab1e98132ee65b61ca] Merge branch 
> 'for-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git
> git bisect good 271bcaf753d0afe2bd0386ab1e98132ee65b61ca
> # good: [9758a2ee5316a6f8736ab4fd39a6f6176aa057ec] Merge branch 
> 'hwmon-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git
> git bisect good 9758a2ee5316a6f8736ab4fd39a6f6176aa057ec
> # good: [e6bd69ea345045520bd63487b85a4b5676aff76b] Merge branch 'master' 
> of git://linuxtv.org/mchehab/media-next.git
> git bisect good e6bd69ea345045520bd63487b85a4b5676aff76b
> # good: [46398edfb36e2882be5e86ea563b2db9138ae499] Merge branches 
> 'pm-cpuidle' and 'pm-powercap' into linux-next
> git bisect good 46398edfb36e2882be5e86ea563b2db9138ae499
> # good: [d3927cbc52eed166f74ea7e031ed6384cc3d4d5f] Merge branch 
> 'thermal-intel' into linux-next
> git bisect good d3927cbc52eed166f74ea7e031ed6384cc3d4d5f
> # good: [ce84b7beeb524e7b20983838687862454ba54df7] cpufreq: sti: add 
> missing MODULE_DEVICE_TABLE entry for stih418
> git bisect good ce84b7beeb524e7b20983838687862454ba54df7
> # bad: [fcf61315d38d41f4e55856b179f9e5538e299ef4] Merge branch 
> 'thermal-fixes' into linux-next
> git bisect bad fcf61315d38d41f4e55856b179f9e5538e299ef4
> # good: [4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e] dt-bindings: thermal: 
> mediatek: Fix thermal zone definition for MT8186
> git bisect good 4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e
> # good: [7eeb114a635a04bea2fa7d57cedbf374c714d29e] dt-bindings: thermal: 
> convert hisilicon-thermal.txt to dt-schema
> git bisect good 7eeb114a635a04bea2fa7d57cedbf374c714d29e
> # good: [107ac0d49ae6a86b4986146b9a612294f7e34406] Merge branch 
> 'thermal/linux-next' of 
> ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux into 
> linux-next
> git bisect good 107ac0d49ae6a86b4986146b9a612294f7e34406
> # bad: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call 
> monitor_thermal_zone() if zone temperature is invalid
> git bisect bad 5725f40698b9ba7f84fbfee25b9059ba044c4b86
> # first bad commit: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: 
> core: Call monitor_thermal_zone() if zone temperature is invalid
> 
> #regzbot introduced: 5725f40698b9ba7f84fbfee25b9059ba044c4b86
> 
> Thanks,
> Neil

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by neil.armstrong@linaro.org 1 year, 5 months ago
Hi,

On 03/07/2024 14:25, Daniel Lezcano wrote:
> 
> Hi Neil,
> 
> it seems there is something wrong with the driver actually.
> 
> There can be a moment where the sensor is not yet initialized for different reason, so reading the temperature fails. The routine will just retry until the sensor gets ready.
> 
> Having these errors seem to me that the sensor for this specific thermal zone is never ready which may be the root cause of your issue. The change is spotting this problem IMO.

Probably, but it gets printed every second until system shutdown, but only for a single thermal_zone.

Using v1 of Rafael's patch makes the message disappear completely.

Neil

> 
> 
> On 03/07/2024 12:54, Neil Armstrong wrote:
>> Hi,
>>
>> On 28/06/2024 14:10, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>>> if zone temperature is invalid") caused __thermal_zone_device_update()
>>> to return early if the current thermal zone temperature was invalid.
>>>
>>> This was done to avoid running handle_thermal_trip() and governor
>>> callbacks in that case which led to confusion.  However, it went too
>>> far because monitor_thermal_zone() still needs to be called even when
>>> the zone temperature is invalid to ensure that it will be updated
>>> eventually in case thermal polling is enabled and the driver has no
>>> other means to notify the core of zone temperature changes (for example,
>>> it does not register an interrupt handler or ACPI notifier).
>>>
>>> Also if the .set_trips() zone callback is expected to set up monitoring
>>> interrupts for a thermal zone, it has to be provided with valid
>>> boundaries and that can only happen if the zone temperature is known.
>>>
>>> Accordingly, to ensure that __thermal_zone_device_update() will
>>> run again after a failing zone temperature check, make it call
>>> monitor_thermal_zone() regardless of whether or not the zone
>>> temperature is valid and make the latter schedule a thermal zone
>>> temperature update if the zone temperature is invalid even if
>>> polling is not enabled for the thermal zone.
>>>
>>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>   drivers/thermal/thermal_core.c |    5 ++++-
>>>   drivers/thermal/thermal_core.h |    6 ++++++
>>>   2 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>>> +++ linux-pm/drivers/thermal/thermal_core.c
>>> @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
>>>           thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>>       else if (tz->polling_delay_jiffies)
>>>           thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>>> +    else if (tz->temperature == THERMAL_TEMP_INVALID)
>>> +        thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
>>>   }
>>>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>>> @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
>>>       update_temperature(tz);
>>>       if (tz->temperature == THERMAL_TEMP_INVALID)
>>> -        return;
>>> +        goto monitor;
>>>       tz->notify_event = event;
>>> @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
>>>       thermal_debug_update_trip_stats(tz);
>>> +monitor:
>>>       monitor_thermal_zone(tz);
>>>   }
>>> Index: linux-pm/drivers/thermal/thermal_core.h
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>>> +++ linux-pm/drivers/thermal/thermal_core.h
>>> @@ -133,6 +133,12 @@ struct thermal_zone_device {
>>>       struct thermal_trip_desc trips[] __counted_by(num_trips);
>>>   };
>>> +/*
>>> + * Default delay after a failing thermal zone temperature check before
>>> + * attempting to check it again.
>>> + */
>>> +#define THERMAL_RECHECK_DELAY_MS    100
>>> +
>>>   /* Default Thermal Governor */
>>>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>>
>>>
>>>
>>>
>>
>> This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, HDK8560, QRD8650 & HDK8650 output in loop:
>>
>> thermal thermal_zoneXX: failed to read out thermal zone (-19)
>>
>> Boot logs or ARM64 defconfig:
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152439#L1393
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152440#L2200
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152442#L2828
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152441#L1862
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152443#L1776
>> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152444#L1723
>>
>> Result of git bisect:
>> # bad: [82e4255305c554b0bb18b7ccf2db86041b4c8b6e] Add linux-next specific files for 20240702
>> # good: [22a40d14b572deb80c0648557f4bd502d7e83826] Linux 6.10-rc6
>> git bisect start 'FETCH_HEAD' 'v6.10-rc6'
>> # bad: [f6dfcf0e9567b57b93f2564966d9177f0d8dbe05] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
>> git bisect bad f6dfcf0e9567b57b93f2564966d9177f0d8dbe05
>> # good: [7f86ae0c2dc19fea7be1da29b2bf03f085463ae7] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git
>> git bisect good 7f86ae0c2dc19fea7be1da29b2bf03f085463ae7
>> # bad: [077d5bbd75dd12af2096c96846ffc78ab5dd65b1] Merge branch 'devfreq-next' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
>> git bisect bad 077d5bbd75dd12af2096c96846ffc78ab5dd65b1
>> # good: [271bcaf753d0afe2bd0386ab1e98132ee65b61ca] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git
>> git bisect good 271bcaf753d0afe2bd0386ab1e98132ee65b61ca
>> # good: [9758a2ee5316a6f8736ab4fd39a6f6176aa057ec] Merge branch 'hwmon-next' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git
>> git bisect good 9758a2ee5316a6f8736ab4fd39a6f6176aa057ec
>> # good: [e6bd69ea345045520bd63487b85a4b5676aff76b] Merge branch 'master' of git://linuxtv.org/mchehab/media-next.git
>> git bisect good e6bd69ea345045520bd63487b85a4b5676aff76b
>> # good: [46398edfb36e2882be5e86ea563b2db9138ae499] Merge branches 'pm-cpuidle' and 'pm-powercap' into linux-next
>> git bisect good 46398edfb36e2882be5e86ea563b2db9138ae499
>> # good: [d3927cbc52eed166f74ea7e031ed6384cc3d4d5f] Merge branch 'thermal-intel' into linux-next
>> git bisect good d3927cbc52eed166f74ea7e031ed6384cc3d4d5f
>> # good: [ce84b7beeb524e7b20983838687862454ba54df7] cpufreq: sti: add missing MODULE_DEVICE_TABLE entry for stih418
>> git bisect good ce84b7beeb524e7b20983838687862454ba54df7
>> # bad: [fcf61315d38d41f4e55856b179f9e5538e299ef4] Merge branch 'thermal-fixes' into linux-next
>> git bisect bad fcf61315d38d41f4e55856b179f9e5538e299ef4
>> # good: [4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e] dt-bindings: thermal: mediatek: Fix thermal zone definition for MT8186
>> git bisect good 4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e
>> # good: [7eeb114a635a04bea2fa7d57cedbf374c714d29e] dt-bindings: thermal: convert hisilicon-thermal.txt to dt-schema
>> git bisect good 7eeb114a635a04bea2fa7d57cedbf374c714d29e
>> # good: [107ac0d49ae6a86b4986146b9a612294f7e34406] Merge branch 'thermal/linux-next' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux into linux-next
>> git bisect good 107ac0d49ae6a86b4986146b9a612294f7e34406
>> # bad: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
>> git bisect bad 5725f40698b9ba7f84fbfee25b9059ba044c4b86
>> # first bad commit: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
>>
>> #regzbot introduced: 5725f40698b9ba7f84fbfee25b9059ba044c4b86
>>
>> Thanks,
>> Neil
> 

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Daniel Lezcano 1 year, 5 months ago
On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
> Hi,
> 
> On 03/07/2024 14:25, Daniel Lezcano wrote:
>>
>> Hi Neil,
>>
>> it seems there is something wrong with the driver actually.
>>
>> There can be a moment where the sensor is not yet initialized for 
>> different reason, so reading the temperature fails. The routine will 
>> just retry until the sensor gets ready.
>>
>> Having these errors seem to me that the sensor for this specific 
>> thermal zone is never ready which may be the root cause of your issue. 
>> The change is spotting this problem IMO.
> 
> Probably, but it gets printed every second until system shutdown, but 
> only for a single thermal_zone.
> 
> Using v1 of Rafael's patch makes the message disappear completely.

Yes, because you have probably the thermal zone polling delay set to 
zero, thus it fails the first time and does no longer try to set it up 
again. The V1 is an incomplete fix.

Very likely the problem is in the sensor platform driver, or in the 
thermal zone description in the device tree which describes a non 
functional thermal zone.


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Rafael J. Wysocki 1 year, 5 months ago
On Wed, Jul 3, 2024 at 4:00 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
> On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
> > Hi,
> >
> > On 03/07/2024 14:25, Daniel Lezcano wrote:
> >>
> >> Hi Neil,
> >>
> >> it seems there is something wrong with the driver actually.
> >>
> >> There can be a moment where the sensor is not yet initialized for
> >> different reason, so reading the temperature fails. The routine will
> >> just retry until the sensor gets ready.
> >>
> >> Having these errors seem to me that the sensor for this specific
> >> thermal zone is never ready which may be the root cause of your issue.
> >> The change is spotting this problem IMO.
> >
> > Probably, but it gets printed every second until system shutdown, but
> > only for a single thermal_zone.
> >
> > Using v1 of Rafael's patch makes the message disappear completely.
>
> Yes, because you have probably the thermal zone polling delay set to
> zero, thus it fails the first time and does no longer try to set it up
> again. The V1 is an incomplete fix.
>
> Very likely the problem is in the sensor platform driver, or in the
> thermal zone description in the device tree which describes a non
> functional thermal zone.

I agree, but polling this useless thermal zone forever is not
particularly useful.

I was kind of afraid that something like this would happen, but then I
didn't want to complicate the patch unnecessarily until I knew that it
really would happen.

So attached is a modification of the $subject patch that will double
the temperature recheck delay after every failed attempt to get the
zone temperature and it will give up eventually (in this particular
version, after the recheck delay exceeds 30 s).

I would appreciate giving it a go (obviously, by replacing the
$subject one with it).
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by neil.armstrong@linaro.org 1 year, 5 months ago
On 03/07/2024 16:00, Daniel Lezcano wrote:
> On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
>> Hi,
>>
>> On 03/07/2024 14:25, Daniel Lezcano wrote:
>>>
>>> Hi Neil,
>>>
>>> it seems there is something wrong with the driver actually.
>>>
>>> There can be a moment where the sensor is not yet initialized for different reason, so reading the temperature fails. The routine will just retry until the sensor gets ready.
>>>
>>> Having these errors seem to me that the sensor for this specific thermal zone is never ready which may be the root cause of your issue. The change is spotting this problem IMO.
>>
>> Probably, but it gets printed every second until system shutdown, but only for a single thermal_zone.
>>
>> Using v1 of Rafael's patch makes the message disappear completely.
> 
> Yes, because you have probably the thermal zone polling delay set to zero, thus it fails the first time and does no longer try to set it up again. The V1 is an incomplete fix.
> 
> Very likely the problem is in the sensor platform driver, or in the thermal zone description in the device tree which describes a non functional thermal zone.
> 

It was at 0 but the delay was removed recently:
https://lore.kernel.org/all/20240510-topic-msm-polling-cleanup-v2-0-436ca4218da2@linaro.org/

That doesn't explain it because only the last platforms have this error message printed.

Neil
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Daniel Lezcano 1 year, 5 months ago
On 03/07/2024 16:42, neil.armstrong@linaro.org wrote:
> On 03/07/2024 16:00, Daniel Lezcano wrote:
>> On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
>>> Hi,
>>>
>>> On 03/07/2024 14:25, Daniel Lezcano wrote:
>>>>
>>>> Hi Neil,
>>>>
>>>> it seems there is something wrong with the driver actually.
>>>>
>>>> There can be a moment where the sensor is not yet initialized for 
>>>> different reason, so reading the temperature fails. The routine will 
>>>> just retry until the sensor gets ready.
>>>>
>>>> Having these errors seem to me that the sensor for this specific 
>>>> thermal zone is never ready which may be the root cause of your 
>>>> issue. The change is spotting this problem IMO.
>>>
>>> Probably, but it gets printed every second until system shutdown, but 
>>> only for a single thermal_zone.
>>>
>>> Using v1 of Rafael's patch makes the message disappear completely.
>>
>> Yes, because you have probably the thermal zone polling delay set to 
>> zero, thus it fails the first time and does no longer try to set it up 
>> again. The V1 is an incomplete fix.
>>
>> Very likely the problem is in the sensor platform driver, or in the 
>> thermal zone description in the device tree which describes a non 
>> functional thermal zone.
>>
> 
> It was at 0 but the delay was removed recently:
> https://lore.kernel.org/all/20240510-topic-msm-polling-cleanup-v2-0-436ca4218da2@linaro.org/

Yes, these changes are because another change did:

commit 488164006a281986d95abbc4b26e340c19c4c85b
Author: Konrad Dybcio <konrad.dybcio@linaro.org>

     thermal/of: Assume polling-delay(-passive) 0 when absent

diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c

> That doesn't explain it because only the last platforms have this error 
> message printed.

Let me recap.

It has been reported if a thermal-zone with zero delay fails to 
initialize because the sensor returns an error, then there is no more 
attempt to initialize it and the thermal zone won't be functional.

The provided fix will periodically read the sensor temperature until 
there is a valid temperature. When there is a valid temperature, then 
the interrupts are set for the previous and the next temperature 
thresholds. That leads to the end of the routine of initializing the 
thermal zone and cancels the timer.

The platforms you reported, the delay is zero (before and after the 
'polling cleanup').

My hypothesis is the following:

The thermal-zone29 describes a sensor which does not operate.

Before the patch:

First attempt to initialize it, the temperature is invalid, then because 
the delay is zero, the routine stops, and there is no more attempts to 
initialize it. Nothing will happen to this thermal zone and it will stay 
stuck silently. So at this point, the thermal zone is broken and you 
don't notice it.

After the patch:

The initialization routine is constantly retrying to init the thermal zone.

-------------------

If you revert the fix and you try to read the thermal zone 29, it should 
always fail to return an error.

If I'm correct, then I suggest to identify what thermal zone is 29 (type 
file), identify the node name in the DT, find the tsens channel and 
double check if it really describes an existing sensor



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by neil.armstrong@linaro.org 1 year, 5 months ago
On 03/07/2024 17:12, Daniel Lezcano wrote:
> On 03/07/2024 16:42, neil.armstrong@linaro.org wrote:
>> On 03/07/2024 16:00, Daniel Lezcano wrote:
>>> On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
>>>> Hi,
>>>>
>>>> On 03/07/2024 14:25, Daniel Lezcano wrote:
>>>>>
>>>>> Hi Neil,
>>>>>
>>>>> it seems there is something wrong with the driver actually.
>>>>>
>>>>> There can be a moment where the sensor is not yet initialized for different reason, so reading the temperature fails. The routine will just retry until the sensor gets ready.
>>>>>
>>>>> Having these errors seem to me that the sensor for this specific thermal zone is never ready which may be the root cause of your issue. The change is spotting this problem IMO.
>>>>
>>>> Probably, but it gets printed every second until system shutdown, but only for a single thermal_zone.
>>>>
>>>> Using v1 of Rafael's patch makes the message disappear completely.
>>>
>>> Yes, because you have probably the thermal zone polling delay set to zero, thus it fails the first time and does no longer try to set it up again. The V1 is an incomplete fix.
>>>
>>> Very likely the problem is in the sensor platform driver, or in the thermal zone description in the device tree which describes a non functional thermal zone.
>>>
>>
>> It was at 0 but the delay was removed recently:
>> https://lore.kernel.org/all/20240510-topic-msm-polling-cleanup-v2-0-436ca4218da2@linaro.org/
> 
> Yes, these changes are because another change did:
> 
> commit 488164006a281986d95abbc4b26e340c19c4c85b
> Author: Konrad Dybcio <konrad.dybcio@linaro.org>
> 
>      thermal/of: Assume polling-delay(-passive) 0 when absent
> 
> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> 
>> That doesn't explain it because only the last platforms have this error message printed.
> 
> Let me recap.
> 
> It has been reported if a thermal-zone with zero delay fails to initialize because the sensor returns an error, then there is no more attempt to initialize it and the thermal zone won't be functional.
> 
> The provided fix will periodically read the sensor temperature until there is a valid temperature. When there is a valid temperature, then the interrupts are set for the previous and the next temperature thresholds. That leads to the end of the routine of initializing the thermal zone and cancels the timer.
> 
> The platforms you reported, the delay is zero (before and after the 'polling cleanup').
> 
> My hypothesis is the following:
> 
> The thermal-zone29 describes a sensor which does not operate.
> 
> Before the patch:
> 
> First attempt to initialize it, the temperature is invalid, then because the delay is zero, the routine stops, and there is no more attempts to initialize it. Nothing will happen to this thermal zone and it will stay stuck silently. So at this point, the thermal zone is broken and you don't notice it.
> 
> After the patch:
> 
> The initialization routine is constantly retrying to init the thermal zone.
> 
> -------------------
> 
> If you revert the fix and you try to read the thermal zone 29, it should always fail to return an error.
> 
> If I'm correct, then I suggest to identify what thermal zone is 29 (type file), identify the node name in the DT, find the tsens channel and double check if it really describes an existing sensor
> 
> 
> 
OK I just found out, it's the `qcom-battmgr-bat` thermal zone, and in CI we do not have the firmwares so the
temperature is never available, this is why it fails in a loop.

Before this patch it would fail silently, but would be useless if we start the firmware too late.

So since it's firmware based, valid data could arrive very late in the boot stage, and sending an
error message in a loop until the firmware isn't started doesn't seem right.

I think Rafael's new patch is good, but perhaps it should send an error when it finally stops monitoring.

Neil


Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Rafael J. Wysocki 1 year, 5 months ago
On Thu, Jul 4, 2024 at 9:39 AM <neil.armstrong@linaro.org> wrote:
>
> On 03/07/2024 17:12, Daniel Lezcano wrote:
> > On 03/07/2024 16:42, neil.armstrong@linaro.org wrote:
> >> On 03/07/2024 16:00, Daniel Lezcano wrote:
> >>> On 03/07/2024 14:43, neil.armstrong@linaro.org wrote:
> >>>> Hi,
> >>>>
> >>>> On 03/07/2024 14:25, Daniel Lezcano wrote:
> >>>>>
> >>>>> Hi Neil,
> >>>>>
> >>>>> it seems there is something wrong with the driver actually.
> >>>>>
> >>>>> There can be a moment where the sensor is not yet initialized for different reason, so reading the temperature fails. The routine will just retry until the sensor gets ready.
> >>>>>
> >>>>> Having these errors seem to me that the sensor for this specific thermal zone is never ready which may be the root cause of your issue. The change is spotting this problem IMO.
> >>>>
> >>>> Probably, but it gets printed every second until system shutdown, but only for a single thermal_zone.
> >>>>
> >>>> Using v1 of Rafael's patch makes the message disappear completely.
> >>>
> >>> Yes, because you have probably the thermal zone polling delay set to zero, thus it fails the first time and does no longer try to set it up again. The V1 is an incomplete fix.
> >>>
> >>> Very likely the problem is in the sensor platform driver, or in the thermal zone description in the device tree which describes a non functional thermal zone.
> >>>
> >>
> >> It was at 0 but the delay was removed recently:
> >> https://lore.kernel.org/all/20240510-topic-msm-polling-cleanup-v2-0-436ca4218da2@linaro.org/
> >
> > Yes, these changes are because another change did:
> >
> > commit 488164006a281986d95abbc4b26e340c19c4c85b
> > Author: Konrad Dybcio <konrad.dybcio@linaro.org>
> >
> >      thermal/of: Assume polling-delay(-passive) 0 when absent
> >
> > diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> >
> >> That doesn't explain it because only the last platforms have this error message printed.
> >
> > Let me recap.
> >
> > It has been reported if a thermal-zone with zero delay fails to initialize because the sensor returns an error, then there is no more attempt to initialize it and the thermal zone won't be functional.
> >
> > The provided fix will periodically read the sensor temperature until there is a valid temperature. When there is a valid temperature, then the interrupts are set for the previous and the next temperature thresholds. That leads to the end of the routine of initializing the thermal zone and cancels the timer.
> >
> > The platforms you reported, the delay is zero (before and after the 'polling cleanup').
> >
> > My hypothesis is the following:
> >
> > The thermal-zone29 describes a sensor which does not operate.
> >
> > Before the patch:
> >
> > First attempt to initialize it, the temperature is invalid, then because the delay is zero, the routine stops, and there is no more attempts to initialize it. Nothing will happen to this thermal zone and it will stay stuck silently. So at this point, the thermal zone is broken and you don't notice it.
> >
> > After the patch:
> >
> > The initialization routine is constantly retrying to init the thermal zone.
> >
> > -------------------
> >
> > If you revert the fix and you try to read the thermal zone 29, it should always fail to return an error.
> >
> > If I'm correct, then I suggest to identify what thermal zone is 29 (type file), identify the node name in the DT, find the tsens channel and double check if it really describes an existing sensor
> >
> >
> >
> OK I just found out, it's the `qcom-battmgr-bat` thermal zone, and in CI we do not have the firmwares so the
> temperature is never available, this is why it fails in a loop.
>
> Before this patch it would fail silently, but would be useless if we start the firmware too late.
>
> So since it's firmware based, valid data could arrive very late in the boot stage, and sending an
> error message in a loop until the firmware isn't started doesn't seem right.
>
> I think Rafael's new patch is good, but perhaps it should send an error when it finally stops monitoring.

Do you mean do something in addition to printing the message?  It can
do a couple of things.  For instance, it could disable the thermal
zone which would also cause a netlink message to be sent.  However,
I'd rather send another patch for this for the next cycle because we
are late in the current one and I'd rather stay on the conservative
side of things ATM.

Or do you mean the pr_info() log level is too low for this message?

Anyway, I'm going to submit the patch officially as is and please feel
free to send comments on that submission.
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Daniel Lezcano 1 year, 5 months ago
On 04/07/2024 09:39, neil.armstrong@linaro.org wrote:

[ ... ]

> OK I just found out, it's the `qcom-battmgr-bat` thermal zone, and in CI 
> we do not have the firmwares so the
> temperature is never available, this is why it fails in a loop.
> 
> Before this patch it would fail silently, but would be useless if we 
> start the firmware too late.
> 
> So since it's firmware based, valid data could arrive very late in the 
> boot stage, and sending an
> error message in a loop until the firmware isn't started doesn't seem 
> right.

Yeah, there was a similar bug with iwlwifi. They fixed it by registering 
the thermal zone after the firmware was successfully loaded.

Is that possible to do the same ?

> I think Rafael's new patch is good, but perhaps it should send an error 
> when it finally stops monitoring.



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by neil.armstrong@linaro.org 1 year, 5 months ago
On 04/07/2024 09:57, Daniel Lezcano wrote:
> On 04/07/2024 09:39, neil.armstrong@linaro.org wrote:
> 
> [ ... ]
> 
>> OK I just found out, it's the `qcom-battmgr-bat` thermal zone, and in CI we do not have the firmwares so the
>> temperature is never available, this is why it fails in a loop.
>>
>> Before this patch it would fail silently, but would be useless if we start the firmware too late.
>>
>> So since it's firmware based, valid data could arrive very late in the boot stage, and sending an
>> error message in a loop until the firmware isn't started doesn't seem right.
> 
> Yeah, there was a similar bug with iwlwifi. They fixed it by registering the thermal zone after the firmware was successfully loaded.
> 
> Is that possible to do the same ?

The thermal zone is indirect, it's registered via power_supply_core.

A tentative was done to delay registering the power supply , since it caused issues in suspend/resume,
but it was reverted because it would require much more work:
https://lore.kernel.org/all/20240123160053.18331-1-johan+linaro@kernel.org/

Seems we should instead return -EAGAIN instead of -ENODEV in qcom_battmgr_bat_get_property(),

But I think power_supply_read_temp() should return -EAGAIN on -ENODEV, since it's the return
code for when a power supply isn't initialized.

Neil

> 
>> I think Rafael's new patch is good, but perhaps it should send an error when it finally stops monitoring.
> 
> 
>
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Rafael J. Wysocki 1 year, 5 months ago
Hi,

On Wed, Jul 3, 2024 at 1:04 PM Neil Armstrong <neil.armstrong@linaro.org> wrote:
>
> Hi,
>
> On 28/06/2024 14:10, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion.  However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it has to be provided with valid
> > boundaries and that can only happen if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone.
> >
> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >   drivers/thermal/thermal_core.c |    5 ++++-
> >   drivers/thermal/thermal_core.h |    6 ++++++
> >   2 files changed, 10 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/thermal/thermal_core.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.c
> > +++ linux-pm/drivers/thermal/thermal_core.c
> > @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
> >               thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> >       else if (tz->polling_delay_jiffies)
> >               thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> > +     else if (tz->temperature == THERMAL_TEMP_INVALID)
> > +             thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
> >   }
> >
> >   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> > @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
> >       update_temperature(tz);
> >
> >       if (tz->temperature == THERMAL_TEMP_INVALID)
> > -             return;
> > +             goto monitor;
> >
> >       tz->notify_event = event;
> >
> > @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
> >
> >       thermal_debug_update_trip_stats(tz);
> >
> > +monitor:
> >       monitor_thermal_zone(tz);
> >   }
> >
> > Index: linux-pm/drivers/thermal/thermal_core.h
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.h
> > +++ linux-pm/drivers/thermal/thermal_core.h
> > @@ -133,6 +133,12 @@ struct thermal_zone_device {
> >       struct thermal_trip_desc trips[] __counted_by(num_trips);
> >   };
> >
> > +/*
> > + * Default delay after a failing thermal zone temperature check before
> > + * attempting to check it again.
> > + */
> > +#define THERMAL_RECHECK_DELAY_MS     100
> > +
> >   /* Default Thermal Governor */
> >   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
> >   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> >
> >
> >
> >
>
> This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, HDK8560, QRD8650 & HDK8650 output in loop:
>
> thermal thermal_zoneXX: failed to read out thermal zone (-19)

Is the loop endless?  If not, how many times does the message get printed?

If I'm not mistaken, it would be printed at least once without the
commit in question.  Can you please check that?

Also, can you check the previous version of the patch in question:

https://lore.kernel.org/linux-pm/2745114.mvXUDI8C0e@rjwysocki.net/

and see if it has the same problem (just apply it instead of the $subject one).

Thanks!
Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by neil.armstrong@linaro.org 1 year, 5 months ago
Hi,

On 03/07/2024 13:29, Rafael J. Wysocki wrote:
> Hi,
> 
> On Wed, Jul 3, 2024 at 1:04 PM Neil Armstrong <neil.armstrong@linaro.org> wrote:
>>
>> Hi,
>>
>> On 28/06/2024 14:10, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>>> if zone temperature is invalid") caused __thermal_zone_device_update()
>>> to return early if the current thermal zone temperature was invalid.
>>>
>>> This was done to avoid running handle_thermal_trip() and governor
>>> callbacks in that case which led to confusion.  However, it went too
>>> far because monitor_thermal_zone() still needs to be called even when
>>> the zone temperature is invalid to ensure that it will be updated
>>> eventually in case thermal polling is enabled and the driver has no
>>> other means to notify the core of zone temperature changes (for example,
>>> it does not register an interrupt handler or ACPI notifier).
>>>
>>> Also if the .set_trips() zone callback is expected to set up monitoring
>>> interrupts for a thermal zone, it has to be provided with valid
>>> boundaries and that can only happen if the zone temperature is known.
>>>
>>> Accordingly, to ensure that __thermal_zone_device_update() will
>>> run again after a failing zone temperature check, make it call
>>> monitor_thermal_zone() regardless of whether or not the zone
>>> temperature is valid and make the latter schedule a thermal zone
>>> temperature update if the zone temperature is invalid even if
>>> polling is not enabled for the thermal zone.
>>>
>>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>    drivers/thermal/thermal_core.c |    5 ++++-
>>>    drivers/thermal/thermal_core.h |    6 ++++++
>>>    2 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>>> +++ linux-pm/drivers/thermal/thermal_core.c
>>> @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
>>>                thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>>        else if (tz->polling_delay_jiffies)
>>>                thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>>> +     else if (tz->temperature == THERMAL_TEMP_INVALID)
>>> +             thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
>>>    }
>>>
>>>    static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>>> @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
>>>        update_temperature(tz);
>>>
>>>        if (tz->temperature == THERMAL_TEMP_INVALID)
>>> -             return;
>>> +             goto monitor;
>>>
>>>        tz->notify_event = event;
>>>
>>> @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
>>>
>>>        thermal_debug_update_trip_stats(tz);
>>>
>>> +monitor:
>>>        monitor_thermal_zone(tz);
>>>    }
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.h
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>>> +++ linux-pm/drivers/thermal/thermal_core.h
>>> @@ -133,6 +133,12 @@ struct thermal_zone_device {
>>>        struct thermal_trip_desc trips[] __counted_by(num_trips);
>>>    };
>>>
>>> +/*
>>> + * Default delay after a failing thermal zone temperature check before
>>> + * attempting to check it again.
>>> + */
>>> +#define THERMAL_RECHECK_DELAY_MS     100
>>> +
>>>    /* Default Thermal Governor */
>>>    #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>>    #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>>
>>>
>>>
>>>
>>
>> This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, HDK8560, QRD8650 & HDK8650 output in loop:
>>
>> thermal thermal_zoneXX: failed to read out thermal zone (-19)
> 
> Is the loop endless?  If not, how many times does the message get printed?

It get printed indefinitely

> 
> If I'm not mistaken, it would be printed at least once without the
> commit in question.  Can you please check that?
> 
> Also, can you check the previous version of the patch in question:
> 
> https://lore.kernel.org/linux-pm/2745114.mvXUDI8C0e@rjwysocki.net/
> 
> and see if it has the same problem (just apply it instead of the $subject one).

I reverted this one a applied v1 and the message disappeared completely.

Neil

> 
> Thanks!

Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
Posted by Daniel Lezcano 1 year, 5 months ago
On 28/06/2024 14:10, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> if zone temperature is invalid") caused __thermal_zone_device_update()
> to return early if the current thermal zone temperature was invalid.
> 
> This was done to avoid running handle_thermal_trip() and governor
> callbacks in that case which led to confusion.  However, it went too
> far because monitor_thermal_zone() still needs to be called even when
> the zone temperature is invalid to ensure that it will be updated
> eventually in case thermal polling is enabled and the driver has no
> other means to notify the core of zone temperature changes (for example,
> it does not register an interrupt handler or ACPI notifier).
> 
> Also if the .set_trips() zone callback is expected to set up monitoring
> interrupts for a thermal zone, it has to be provided with valid
> boundaries and that can only happen if the zone temperature is known.
> 
> Accordingly, to ensure that __thermal_zone_device_update() will
> run again after a failing zone temperature check, make it call
> monitor_thermal_zone() regardless of whether or not the zone
> temperature is valid and make the latter schedule a thermal zone
> temperature update if the zone temperature is invalid even if
> polling is not enabled for the thermal zone.
> 
> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog