[PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()

Yaxiong Tian posted 1 patch 8 months ago
There is a newer version of this series
kernel/power/energy_model.c | 4 ++++
1 file changed, 4 insertions(+)
[PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 8 months ago
From: Yaxiong Tian <tianyaxiong@kylinos.cn>

When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().

Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.

Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
---
 kernel/power/energy_model.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9b7e2b38c7a..41606247c277 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	unsigned long prev_cost = ULONG_MAX;
 	int i, ret;
 
+	/* This is needed only for CPUs and EAS skip other devices */
+	if (!_is_cpu_device(dev))
+		return 0;
+
 	/* Compute the cost of each performance state. */
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
-- 
2.25.1
Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 8 months ago

On 4/17/25 02:07, Yaxiong Tian wrote:
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> 
> When the device is of a non-CPU type, table[i].performance won't be
> initialized in the previous em_init_performance(), resulting in division
> by zero when calculating costs in em_compute_costs().
> 
> Since the 'cost' algorithm is only used for EAS energy efficiency
> calculations and is currently not utilized by other device drivers, we
> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> issue.
> 
> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> ---
>   kernel/power/energy_model.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index d9b7e2b38c7a..41606247c277 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   	unsigned long prev_cost = ULONG_MAX;
>   	int i, ret;
>   
> +	/* This is needed only for CPUs and EAS skip other devices */
> +	if (!_is_cpu_device(dev))
> +		return 0;
> +
>   	/* Compute the cost of each performance state. */
>   	for (i = nr_states - 1; i >= 0; i--) {
>   		unsigned long power_res, cost;


Please stop for a while. I have to check what happened that you
faced the issue in the first place. I have been testing the GPU
EMs and there was no issues...

Let me debug that today.
Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 8 months ago

在 2025/4/17 13:57, Lukasz Luba 写道:
> 
> 
> On 4/17/25 02:07, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> When the device is of a non-CPU type, table[i].performance won't be
>> initialized in the previous em_init_performance(), resulting in division
>> by zero when calculating costs in em_compute_costs().
>>
>> Since the 'cost' algorithm is only used for EAS energy efficiency
>> calculations and is currently not utilized by other device drivers, we
>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>> issue.
>>
>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>> division")
>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>> ---
>>   kernel/power/energy_model.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index d9b7e2b38c7a..41606247c277 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>> struct em_perf_state *table,
>>       unsigned long prev_cost = ULONG_MAX;
>>       int i, ret;
>> +    /* This is needed only for CPUs and EAS skip other devices */
>> +    if (!_is_cpu_device(dev))
>> +        return 0;
>> +
>>       /* Compute the cost of each performance state. */
>>       for (i = nr_states - 1; i >= 0; i--) {
>>           unsigned long power_res, cost;
> 
> 
> Please stop for a while. I have to check what happened that you
> faced the issue in the first place. I have been testing the GPU
> EMs and there was no issues...
> 
> Let me debug that today.

Of course. Since I don't have actual hardware, I can only logically
deduce that this issue might exist.

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 8 months ago

On 4/17/25 08:43, Yaxiong Tian wrote:
> 
> 
> 在 2025/4/17 13:57, Lukasz Luba 写道:
>>
>>
>> On 4/17/25 02:07, Yaxiong Tian wrote:
>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>
>>> When the device is of a non-CPU type, table[i].performance won't be
>>> initialized in the previous em_init_performance(), resulting in division
>>> by zero when calculating costs in em_compute_costs().
>>>
>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>> calculations and is currently not utilized by other device drivers, we
>>> should add the _is_cpu_device(dev) check to prevent this 
>>> division-by-zero
>>> issue.
>>>
>>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>>> division")
>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>> ---
>>>   kernel/power/energy_model.c | 4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>> index d9b7e2b38c7a..41606247c277 100644
>>> --- a/kernel/power/energy_model.c
>>> +++ b/kernel/power/energy_model.c
>>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>>> struct em_perf_state *table,
>>>       unsigned long prev_cost = ULONG_MAX;
>>>       int i, ret;
>>> +    /* This is needed only for CPUs and EAS skip other devices */
>>> +    if (!_is_cpu_device(dev))
>>> +        return 0;
>>> +
>>>       /* Compute the cost of each performance state. */
>>>       for (i = nr_states - 1; i >= 0; i--) {
>>>           unsigned long power_res, cost;
>>
>>
>> Please stop for a while. I have to check what happened that you
>> faced the issue in the first place. I have been testing the GPU
>> EMs and there was no issues...
>>
>> Let me debug that today.
> 
> Of course. Since I don't have actual hardware, I can only logically
> deduce that this issue might exist.
> 
> 

I have run with the GPU EM registered in the boot:

-------------------------------------------------------
[    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
[    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 minor 
0x0 status 0x0
[    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
issues: 00000000,24040400
[    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
[    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 
150000000Hz, actual 148500000HZ div = 0)
[    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1

root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags 

0x1
root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790

--------------------------------------------------------

The EM for the GPU is not modified during the boot like the CPUs'
EM are, thus this code is not triggered. Although, the API is
open and in theory the GPU EM can be modified at runtime
as well and it will reach that em_compute_costs() issue
with 'performance' field having value 0.

So this v4 patch would be needed in this case.

Please re-send this v4 patch as a completely new message.

Thanks for looking at that code path and the fix for potential
issue.

You can also add my:

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

Regrds,
Lukasz

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 8 months ago

在 2025/4/17 21:27, Lukasz Luba 写道:
> 
> 
> On 4/17/25 08:43, Yaxiong Tian wrote:
>>
>>
>> 在 2025/4/17 13:57, Lukasz Luba 写道:
>>>
>>>
>>> On 4/17/25 02:07, Yaxiong Tian wrote:
>>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>>
>>>> When the device is of a non-CPU type, table[i].performance won't be
>>>> initialized in the previous em_init_performance(), resulting in 
>>>> division
>>>> by zero when calculating costs in em_compute_costs().
>>>>
>>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>>> calculations and is currently not utilized by other device drivers, we
>>>> should add the _is_cpu_device(dev) check to prevent this 
>>>> division-by-zero
>>>> issue.
>>>>
>>>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>>>> division")
>>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>> ---
>>>>   kernel/power/energy_model.c | 4 ++++
>>>>   1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..41606247c277 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>>>> struct em_perf_state *table,
>>>>       unsigned long prev_cost = ULONG_MAX;
>>>>       int i, ret;
>>>> +    /* This is needed only for CPUs and EAS skip other devices */
>>>> +    if (!_is_cpu_device(dev))
>>>> +        return 0;
>>>> +
>>>>       /* Compute the cost of each performance state. */
>>>>       for (i = nr_states - 1; i >= 0; i--) {
>>>>           unsigned long power_res, cost;
>>>
>>>
>>> Please stop for a while. I have to check what happened that you
>>> faced the issue in the first place. I have been testing the GPU
>>> EMs and there was no issues...
>>>
>>> Let me debug that today.
>>
>> Of course. Since I don't have actual hardware, I can only logically
>> deduce that this issue might exist.
>>
>>
> 
> I have run with the GPU EM registered in the boot:
> 
> -------------------------------------------------------
> [    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
> [    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 minor 
> 0x0 status 0x0
> [    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
> issues: 00000000,24040400
> [    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
> Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
> [    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 
> 150000000Hz, actual 148500000HZ div = 0)
> [    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1
> 
> root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags
> 0x1
> root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790
> 
> --------------------------------------------------------
> 
> The EM for the GPU is not modified during the boot like the CPUs'
> EM are, thus this code is not triggered. Although, the API is
> open and in theory the GPU EM can be modified at runtime
> as well and it will reach that em_compute_costs() issue
> with 'performance' field having value 0.
> 
> So this v4 patch would be needed in this case.
> 
> Please re-send this v4 patch as a completely new message.
> 
> Thanks for looking at that code path and the fix for potential
> issue.
> 
> You can also add my:
> 
> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> 
> Regrds,
> Lukasz

Got it - patch resent with Reviewed-by.

https://lore.kernel.org/all/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com/

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 7 months, 2 weeks ago
在 2025/4/18 09:26, Yaxiong Tian 写道:

> 在 2025/4/17 21:27, Lukasz Luba 写道:
>> I have run with the GPU EM registered in the boot:
>>
>> -------------------------------------------------------
>> [    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
>> [    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 
>> minor 0x0 status 0x0
>> [    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
>> issues: 00000000,24040400
>> [    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
>> Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
>> [    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot 
>> req 150000000Hz, actual 148500000HZ div = 0)
>> [    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1
>>
>> root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags
>> 0x1
>> root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790
>>
>> --------------------------------------------------------
>>
>> The EM for the GPU is not modified during the boot like the CPUs'
>> EM are, thus this code is not triggered. Although, the API is
>> open and in theory the GPU EM can be modified at runtime
>> as well and it will reach that em_compute_costs() issue
>> with 'performance' field having value 0.
>>
>> So this v4 patch would be needed in this case.
>>
>> Please re-send this v4 patch as a completely new message.
>>
>> Thanks for looking at that code path and the fix for potential
>> issue.
>>
>> You can also add my:
>>
>> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
>>
>> Regrds,
>> Lukasz
>
> Got it - patch resent with Reviewed-by.
>
> https://lore.kernel.org/all/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com/ 
>

Hi Rafael:

This bug appears to be stalled. Are there any alternative fixes we could 
consider?