[PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()

Yaxiong Tian posted 1 patch 4 weeks ago
There is a newer version of this series
kernel/power/energy_model.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 4 weeks ago
From: Yaxiong Tian <tianyaxiong@kylinos.cn>

When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().

Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.

Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9b7e2b38c7a..d1fa7e8787b5 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 					cost, ret);
 				return -EINVAL;
 			}
-		} else {
+		} else if (_is_cpu_device(dev)) {
 			/* increase resolution of 'cost' precision */
 			power_res = table[i].power * 10;
 			cost = power_res / table[i].performance;
-- 
2.25.1
Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 3 weeks, 3 days ago
Hi Yaxiong,

On 4/11/25 02:28, Yaxiong Tian wrote:
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> 
> When the device is of a non-CPU type, table[i].performance won't be
> initialized in the previous em_init_performance(), resulting in division
> by zero when calculating costs in em_compute_costs().
> 
> Since the 'cost' algorithm is only used for EAS energy efficiency
> calculations and is currently not utilized by other device drivers, we
> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> issue.
> 
> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> ---
>   kernel/power/energy_model.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index d9b7e2b38c7a..d1fa7e8787b5 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   					cost, ret);
>   				return -EINVAL;
>   			}
> -		} else {
> +		} else if (_is_cpu_device(dev)) {
>   			/* increase resolution of 'cost' precision */
>   			power_res = table[i].power * 10;
>   			cost = power_res / table[i].performance;


As the test robot pointed out, please set the 'cost' to 0
where it's declared.

The rest should be fine.

Regards,
Lukasz
Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 3 days ago
在 2025/4/14 16:08, Lukasz Luba 写道:
> Hi Yaxiong,
> 
> On 4/11/25 02:28, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> When the device is of a non-CPU type, table[i].performance won't be
>> initialized in the previous em_init_performance(), resulting in division
>> by zero when calculating costs in em_compute_costs().
>>
>> Since the 'cost' algorithm is only used for EAS energy efficiency
>> calculations and is currently not utilized by other device drivers, we
>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>> issue.
>>
>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove 
>> division")
>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>> ---
>>   kernel/power/energy_model.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index d9b7e2b38c7a..d1fa7e8787b5 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, 
>> struct em_perf_state *table,
>>                       cost, ret);
>>                   return -EINVAL;
>>               }
>> -        } else {
>> +        } else if (_is_cpu_device(dev)) {
>>               /* increase resolution of 'cost' precision */
>>               power_res = table[i].power * 10;
>>               cost = power_res / table[i].performance;
> 
> 
> As the test robot pointed out, please set the 'cost' to 0
> where it's declared.
> 
> The rest should be fine.
> 
> Regards,
> Lukasz

Sorry, the V3 version with cost=0 still has issues.

I noticed that if the cost is set to 0, the condition "if (table[i].cost
  >= prev_cost)" in the following code will always evaluate to true. This
  will incorrectly set the flags to EM_PERF_STATE_INEFFICIENT.

Should we change ">=" to ">"?






Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 3 days ago
在 2025/4/15 09:12, Yaxiong Tian 写道:
> 
> 
> 在 2025/4/14 16:08, Lukasz Luba 写道:
>> Hi Yaxiong,
>>
>> On 4/11/25 02:28, Yaxiong Tian wrote:
>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>
>>> When the device is of a non-CPU type, table[i].performance won't be
>>> initialized in the previous em_init_performance(), resulting in division
>>> by zero when calculating costs in em_compute_costs().
>>>
>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>> calculations and is currently not utilized by other device drivers, we
>>> should add the _is_cpu_device(dev) check to prevent this 
>>> division-by-zero
>>> issue.
>>>
>>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove 
>>> division")
>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>> ---
>>>   kernel/power/energy_model.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>> index d9b7e2b38c7a..d1fa7e8787b5 100644
>>> --- a/kernel/power/energy_model.c
>>> +++ b/kernel/power/energy_model.c
>>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, 
>>> struct em_perf_state *table,
>>>                       cost, ret);
>>>                   return -EINVAL;
>>>               }
>>> -        } else {
>>> +        } else if (_is_cpu_device(dev)) {
>>>               /* increase resolution of 'cost' precision */
>>>               power_res = table[i].power * 10;
>>>               cost = power_res / table[i].performance;
>>
>>
>> As the test robot pointed out, please set the 'cost' to 0
>> where it's declared.
>>
>> The rest should be fine.
>>
>> Regards,
>> Lukasz
> 
> Sorry, the V3 version with cost=0 still has issues.
> 
> I noticed that if the cost is set to 0, the condition "if (table[i].cost
>   >= prev_cost)" in the following code will always evaluate to true. This
>   will incorrectly set the flags to EM_PERF_STATE_INEFFICIENT.
> 
> Should we change ">=" to ">"?
> 

Sorry Again, Setting EM_PERF_STATE_INEFFICIENT in this case is correct.
Earlier, I misunderstood the definition/usage of EM_PERF_STATE_INEFFICIENT.


Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Rafael J. Wysocki 3 weeks, 2 days ago
On Tue, Apr 15, 2025 at 4:03 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>
>
>
> 在 2025/4/15 09:12, Yaxiong Tian 写道:
> >
> >
> > 在 2025/4/14 16:08, Lukasz Luba 写道:
> >> Hi Yaxiong,
> >>
> >> On 4/11/25 02:28, Yaxiong Tian wrote:
> >>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> >>>
> >>> When the device is of a non-CPU type, table[i].performance won't be
> >>> initialized in the previous em_init_performance(), resulting in division
> >>> by zero when calculating costs in em_compute_costs().
> >>>
> >>> Since the 'cost' algorithm is only used for EAS energy efficiency
> >>> calculations and is currently not utilized by other device drivers, we
> >>> should add the _is_cpu_device(dev) check to prevent this
> >>> division-by-zero
> >>> issue.
> >>>
> >>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove
> >>> division")
> >>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> >>> ---
> >>>   kernel/power/energy_model.c | 2 +-
> >>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >>> index d9b7e2b38c7a..d1fa7e8787b5 100644
> >>> --- a/kernel/power/energy_model.c
> >>> +++ b/kernel/power/energy_model.c
> >>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev,
> >>> struct em_perf_state *table,
> >>>                       cost, ret);
> >>>                   return -EINVAL;
> >>>               }
> >>> -        } else {
> >>> +        } else if (_is_cpu_device(dev)) {
> >>>               /* increase resolution of 'cost' precision */
> >>>               power_res = table[i].power * 10;
> >>>               cost = power_res / table[i].performance;
> >>
> >>
> >> As the test robot pointed out, please set the 'cost' to 0
> >> where it's declared.
> >>
> >> The rest should be fine.
> >>
> >> Regards,
> >> Lukasz
> >
> > Sorry, the V3 version with cost=0 still has issues.
> >
> > I noticed that if the cost is set to 0, the condition "if (table[i].cost
> >   >= prev_cost)" in the following code will always evaluate to true. This
> >   will incorrectly set the flags to EM_PERF_STATE_INEFFICIENT.
> >
> > Should we change ">=" to ">"?
> >
>
> Sorry Again, Setting EM_PERF_STATE_INEFFICIENT in this case is correct.
> Earlier, I misunderstood the definition/usage of EM_PERF_STATE_INEFFICIENT.

Well, EM_PERF_STATE_INEFFICIENT is only looked at in CPU energy
models, so setting it in a non-CPU one is redundant.
Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 3 days ago
在 2025/4/14 16:08, Lukasz Luba 写道:
> Hi Yaxiong,
> 
> On 4/11/25 02:28, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> When the device is of a non-CPU type, table[i].performance won't be
>> initialized in the previous em_init_performance(), resulting in division
>> by zero when calculating costs in em_compute_costs().
>>
>> Since the 'cost' algorithm is only used for EAS energy efficiency
>> calculations and is currently not utilized by other device drivers, we
>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>> issue.
>>
>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove 
>> division")
>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>> ---
>>   kernel/power/energy_model.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index d9b7e2b38c7a..d1fa7e8787b5 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, 
>> struct em_perf_state *table,
>>                       cost, ret);
>>                   return -EINVAL;
>>               }
>> -        } else {
>> +        } else if (_is_cpu_device(dev)) {
>>               /* increase resolution of 'cost' precision */
>>               power_res = table[i].power * 10;
>>               cost = power_res / table[i].performance;
> 
> 
> As the test robot pointed out, please set the 'cost' to 0
> where it's declared.
> 
> The rest should be fine.
> 
> Regards,
> Lukasz

Okay.
I wasn’t sure whether the new patch should reuse the current Message-ID
or create a new one, so I checked with ChatGPT and kept the original
Message-ID for v2. If a new Message-ID is required, I can resend the
email.

Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Geert Uytterhoeven 3 weeks ago
Hi Yaxiong,

On Mon, 14 Apr 2025 at 11:14, Yaxiong Tian <iambestgod@qq.com> wrote:
> I wasn’t sure whether the new patch should reuse the current Message-ID
> or create a new one, so I checked with ChatGPT and kept the original
> Message-ID for v2. If a new Message-ID is required, I can resend the
> email.

Different (revisions of) patches must have different Message-IDs,
else they can no longer be distinguished by our tooling.
Fortunately it seems like you didn't succeed (which is good ;-),
as all four revisions do have different Message-IDs:

https://lore.kernel.org/all/tencent_8478BF8F2549630842D323E7394CB6F49D08@qq.com/
https://lore.kernel.org/all/tencent_EE27C7D1D6BDB3EE57A2C467CC59A866C405@qq.com/
https://lore.kernel.org/all/tencent_6D2374392DB66C9D23BF6E2546638A42EC08@qq.com/
https://lore.kernel.org/all/tencent_2256A7C02F7849F1D89390E488704E826D06@qq.com/

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks ago
在 2025/4/17 15:20, Geert Uytterhoeven 写道:
> Hi Yaxiong,
> 
> On Mon, 14 Apr 2025 at 11:14, Yaxiong Tian <iambestgod@qq.com> wrote:
>> I wasn’t sure whether the new patch should reuse the current Message-ID
>> or create a new one, so I checked with ChatGPT and kept the original
>> Message-ID for v2. If a new Message-ID is required, I can resend the
>> email.
> 
> Different (revisions of) patches must have different Message-IDs,
> else they can no longer be distinguished by our tooling.
> Fortunately it seems like you didn't succeed (which is good ;-),
> as all four revisions do have different Message-IDs:
> 
> https://lore.kernel.org/all/tencent_8478BF8F2549630842D323E7394CB6F49D08@qq.com/
> https://lore.kernel.org/all/tencent_EE27C7D1D6BDB3EE57A2C467CC59A866C405@qq.com/
> https://lore.kernel.org/all/tencent_6D2374392DB66C9D23BF6E2546638A42EC08@qq.com/
> https://lore.kernel.org/all/tencent_2256A7C02F7849F1D89390E488704E826D06@qq.com/
> 
> Gr{oetje,eeting}s,
> 
>                          Geert
> 
Thank you for your explanation. My previous description might have been
a bit unclear. Actually, what I meant was whether it should be --in-
reply-to=original Message-ID. I think if it's a patch, replying to the
  original Message-ID would be better.

Re: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by kernel test robot 3 weeks, 6 days ago
Hi Yaxiong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on amd-pstate/linux-next]
[also build test WARNING on amd-pstate/bleeding-edge linus/master v6.15-rc1 next-20250411]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yaxiong-Tian/PM-EM-Fix-potential-division-by-zero-error-in-em_compute_costs/20250411-093452
base:   https://git.kernel.org/pub/scm/linux/kernel/git/superm1/linux.git linux-next
patch link:    https://lore.kernel.org/r/tencent_EE27C7D1D6BDB3EE57A2C467CC59A866C405%40qq.com
patch subject: [PATCH v2] PM: EM: Fix potential division-by-zero error in em_compute_costs()
config: i386-buildonly-randconfig-003-20250412 (https://download.01.org/0day-ci/archive/20250412/202504120921.Wqmipx6e-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250412/202504120921.Wqmipx6e-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504120921.Wqmipx6e-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/power/energy_model.c:247:14: warning: variable 'cost' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
     247 |                 } else if (_is_cpu_device(dev)) {
         |                            ^~~~~~~~~~~~~~~~~~~
   kernel/power/energy_model.c:253:19: note: uninitialized use occurs here
     253 |                 table[i].cost = cost;
         |                                 ^~~~
   kernel/power/energy_model.c:247:10: note: remove the 'if' if its condition is always true
     247 |                 } else if (_is_cpu_device(dev)) {
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/power/energy_model.c:238:32: note: initialize the variable 'cost' to silence this warning
     238 |                 unsigned long power_res, cost;
         |                                              ^
         |                                               = 0
   1 warning generated.


vim +247 kernel/power/energy_model.c

   228	
   229	static int em_compute_costs(struct device *dev, struct em_perf_state *table,
   230				    const struct em_data_callback *cb, int nr_states,
   231				    unsigned long flags)
   232	{
   233		unsigned long prev_cost = ULONG_MAX;
   234		int i, ret;
   235	
   236		/* Compute the cost of each performance state. */
   237		for (i = nr_states - 1; i >= 0; i--) {
   238			unsigned long power_res, cost;
   239	
   240			if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
   241				ret = cb->get_cost(dev, table[i].frequency, &cost);
   242				if (ret || !cost || cost > EM_MAX_POWER) {
   243					dev_err(dev, "EM: invalid cost %lu %d\n",
   244						cost, ret);
   245					return -EINVAL;
   246				}
 > 247			} else if (_is_cpu_device(dev)) {
   248				/* increase resolution of 'cost' precision */
   249				power_res = table[i].power * 10;
   250				cost = power_res / table[i].performance;
   251			}
   252	
   253			table[i].cost = cost;
   254	
   255			if (table[i].cost >= prev_cost) {
   256				table[i].flags = EM_PERF_STATE_INEFFICIENT;
   257				dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
   258					table[i].frequency);
   259			} else {
   260				prev_cost = table[i].cost;
   261			}
   262		}
   263	
   264		return 0;
   265	}
   266	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
[PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 1 day ago
From: Yaxiong Tian <tianyaxiong@kylinos.cn>

When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().

Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.

Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
---
 kernel/power/energy_model.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9b7e2b38c7a..41606247c277 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	unsigned long prev_cost = ULONG_MAX;
 	int i, ret;
 
+	/* This is needed only for CPUs and EAS skip other devices */
+	if (!_is_cpu_device(dev))
+		return 0;
+
 	/* Compute the cost of each performance state. */
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
-- 
2.25.1
Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 3 weeks ago
On 4/17/25 02:07, Yaxiong Tian wrote:
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> 
> When the device is of a non-CPU type, table[i].performance won't be
> initialized in the previous em_init_performance(), resulting in division
> by zero when calculating costs in em_compute_costs().
> 
> Since the 'cost' algorithm is only used for EAS energy efficiency
> calculations and is currently not utilized by other device drivers, we
> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> issue.
> 
> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division")
> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> ---
>   kernel/power/energy_model.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index d9b7e2b38c7a..41606247c277 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   	unsigned long prev_cost = ULONG_MAX;
>   	int i, ret;
>   
> +	/* This is needed only for CPUs and EAS skip other devices */
> +	if (!_is_cpu_device(dev))
> +		return 0;
> +
>   	/* Compute the cost of each performance state. */
>   	for (i = nr_states - 1; i >= 0; i--) {
>   		unsigned long power_res, cost;


Please stop for a while. I have to check what happened that you
faced the issue in the first place. I have been testing the GPU
EMs and there was no issues...

Let me debug that today.
Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks ago
在 2025/4/17 13:57, Lukasz Luba 写道:
> 
> 
> On 4/17/25 02:07, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> When the device is of a non-CPU type, table[i].performance won't be
>> initialized in the previous em_init_performance(), resulting in division
>> by zero when calculating costs in em_compute_costs().
>>
>> Since the 'cost' algorithm is only used for EAS energy efficiency
>> calculations and is currently not utilized by other device drivers, we
>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>> issue.
>>
>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>> division")
>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>> ---
>>   kernel/power/energy_model.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index d9b7e2b38c7a..41606247c277 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>> struct em_perf_state *table,
>>       unsigned long prev_cost = ULONG_MAX;
>>       int i, ret;
>> +    /* This is needed only for CPUs and EAS skip other devices */
>> +    if (!_is_cpu_device(dev))
>> +        return 0;
>> +
>>       /* Compute the cost of each performance state. */
>>       for (i = nr_states - 1; i >= 0; i--) {
>>           unsigned long power_res, cost;
> 
> 
> Please stop for a while. I have to check what happened that you
> faced the issue in the first place. I have been testing the GPU
> EMs and there was no issues...
> 
> Let me debug that today.

Of course. Since I don't have actual hardware, I can only logically
deduce that this issue might exist.

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 3 weeks ago
On 4/17/25 08:43, Yaxiong Tian wrote:
> 
> 
> 在 2025/4/17 13:57, Lukasz Luba 写道:
>>
>>
>> On 4/17/25 02:07, Yaxiong Tian wrote:
>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>
>>> When the device is of a non-CPU type, table[i].performance won't be
>>> initialized in the previous em_init_performance(), resulting in division
>>> by zero when calculating costs in em_compute_costs().
>>>
>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>> calculations and is currently not utilized by other device drivers, we
>>> should add the _is_cpu_device(dev) check to prevent this 
>>> division-by-zero
>>> issue.
>>>
>>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>>> division")
>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>> ---
>>>   kernel/power/energy_model.c | 4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>> index d9b7e2b38c7a..41606247c277 100644
>>> --- a/kernel/power/energy_model.c
>>> +++ b/kernel/power/energy_model.c
>>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>>> struct em_perf_state *table,
>>>       unsigned long prev_cost = ULONG_MAX;
>>>       int i, ret;
>>> +    /* This is needed only for CPUs and EAS skip other devices */
>>> +    if (!_is_cpu_device(dev))
>>> +        return 0;
>>> +
>>>       /* Compute the cost of each performance state. */
>>>       for (i = nr_states - 1; i >= 0; i--) {
>>>           unsigned long power_res, cost;
>>
>>
>> Please stop for a while. I have to check what happened that you
>> faced the issue in the first place. I have been testing the GPU
>> EMs and there was no issues...
>>
>> Let me debug that today.
> 
> Of course. Since I don't have actual hardware, I can only logically
> deduce that this issue might exist.
> 
> 

I have run with the GPU EM registered in the boot:

-------------------------------------------------------
[    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
[    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 minor 
0x0 status 0x0
[    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
issues: 00000000,24040400
[    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
[    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 
150000000Hz, actual 148500000HZ div = 0)
[    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1

root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags 

0x1
root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
/sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790

--------------------------------------------------------

The EM for the GPU is not modified during the boot like the CPUs'
EM are, thus this code is not triggered. Although, the API is
open and in theory the GPU EM can be modified at runtime
as well and it will reach that em_compute_costs() issue
with 'performance' field having value 0.

So this v4 patch would be needed in this case.

Please re-send this v4 patch as a completely new message.

Thanks for looking at that code path and the fix for potential
issue.

You can also add my:

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

Regrds,
Lukasz

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks ago
在 2025/4/17 21:27, Lukasz Luba 写道:
> 
> 
> On 4/17/25 08:43, Yaxiong Tian wrote:
>>
>>
>> 在 2025/4/17 13:57, Lukasz Luba 写道:
>>>
>>>
>>> On 4/17/25 02:07, Yaxiong Tian wrote:
>>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>>
>>>> When the device is of a non-CPU type, table[i].performance won't be
>>>> initialized in the previous em_init_performance(), resulting in 
>>>> division
>>>> by zero when calculating costs in em_compute_costs().
>>>>
>>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>>> calculations and is currently not utilized by other device drivers, we
>>>> should add the _is_cpu_device(dev) check to prevent this 
>>>> division-by-zero
>>>> issue.
>>>>
>>>> Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove 
>>>> division")
>>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>> ---
>>>>   kernel/power/energy_model.c | 4 ++++
>>>>   1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..41606247c277 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -233,6 +233,10 @@ static int em_compute_costs(struct device *dev, 
>>>> struct em_perf_state *table,
>>>>       unsigned long prev_cost = ULONG_MAX;
>>>>       int i, ret;
>>>> +    /* This is needed only for CPUs and EAS skip other devices */
>>>> +    if (!_is_cpu_device(dev))
>>>> +        return 0;
>>>> +
>>>>       /* Compute the cost of each performance state. */
>>>>       for (i = nr_states - 1; i >= 0; i--) {
>>>>           unsigned long power_res, cost;
>>>
>>>
>>> Please stop for a while. I have to check what happened that you
>>> faced the issue in the first place. I have been testing the GPU
>>> EMs and there was no issues...
>>>
>>> Let me debug that today.
>>
>> Of course. Since I don't have actual hardware, I can only logically
>> deduce that this issue might exist.
>>
>>
> 
> I have run with the GPU EM registered in the boot:
> 
> -------------------------------------------------------
> [    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
> [    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 minor 
> 0x0 status 0x0
> [    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
> issues: 00000000,24040400
> [    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
> Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
> [    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 
> 150000000Hz, actual 148500000HZ div = 0)
> [    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1
> 
> root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags
> 0x1
> root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790
> 
> --------------------------------------------------------
> 
> The EM for the GPU is not modified during the boot like the CPUs'
> EM are, thus this code is not triggered. Although, the API is
> open and in theory the GPU EM can be modified at runtime
> as well and it will reach that em_compute_costs() issue
> with 'performance' field having value 0.
> 
> So this v4 patch would be needed in this case.
> 
> Please re-send this v4 patch as a completely new message.
> 
> Thanks for looking at that code path and the fix for potential
> issue.
> 
> You can also add my:
> 
> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> 
> Regrds,
> Lukasz

Got it - patch resent with Reviewed-by.

https://lore.kernel.org/all/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com/

Re: [PATCH v4] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 1 week, 4 days ago
在 2025/4/18 09:26, Yaxiong Tian 写道:

> 在 2025/4/17 21:27, Lukasz Luba 写道:
>> I have run with the GPU EM registered in the boot:
>>
>> -------------------------------------------------------
>> [    2.753333] panfrost ff9a0000.gpu: EM: created perf domain
>> [    2.759863] panfrost ff9a0000.gpu: mali-t860 id 0x860 major 0x2 
>> minor 0x0 status 0x0
>> [    2.768530] panfrost ff9a0000.gpu: features: 00000000,00000407, 
>> issues: 00000000,24040400
>> [    2.777678] panfrost ff9a0000.gpu: Features: L2:0x07120206 
>> Shader:0x00000000 Tiler:0x00000809 Mem:0x1 MMU:0x00002830 AS:0xff JS:0x7
>> [    2.780746] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot 
>> req 150000000Hz, actual 148500000HZ div = 0)
>> [    2.790905] panfrost ff9a0000.gpu: shader_present=0xf l2_present=0x1
>>
>> root@arm:~# cat /sys/kernel/debug/energy_model/ff9a0000.gpu/flags
>> 0x1
>> root@arm:~# grep . /sys/kernel/debug/energy_model/ff9a0000.gpu/ps*/*
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/frequency:200000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:200000/power:404250
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/frequency:300000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:300000/power:606375
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/frequency:400000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/inefficient:1
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:400000/power:808500
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/cost:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/frequency:600000
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/inefficient:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/performance:0
>> /sys/kernel/debug/energy_model/ff9a0000.gpu/ps:600000/power:1505790
>>
>> --------------------------------------------------------
>>
>> The EM for the GPU is not modified during the boot like the CPUs'
>> EM are, thus this code is not triggered. Although, the API is
>> open and in theory the GPU EM can be modified at runtime
>> as well and it will reach that em_compute_costs() issue
>> with 'performance' field having value 0.
>>
>> So this v4 patch would be needed in this case.
>>
>> Please re-send this v4 patch as a completely new message.
>>
>> Thanks for looking at that code path and the fix for potential
>> issue.
>>
>> You can also add my:
>>
>> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
>>
>> Regrds,
>> Lukasz
>
> Got it - patch resent with Reviewed-by.
>
> https://lore.kernel.org/all/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com/ 
>

Hi Rafael:

This bug appears to be stalled. Are there any alternative fixes we could 
consider?
[PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 3 days ago
From: Yaxiong Tian <tianyaxiong@kylinos.cn>

When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().

Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.

Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
---
 kernel/power/energy_model.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9b7e2b38c7a..fc972cc1fc12 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 
 	/* Compute the cost of each performance state. */
 	for (i = nr_states - 1; i >= 0; i--) {
-		unsigned long power_res, cost;
+		unsigned long power_res, cost = 0;
 
 		if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
 			ret = cb->get_cost(dev, table[i].frequency, &cost);
@@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 					cost, ret);
 				return -EINVAL;
 			}
-		} else {
+		} else if (_is_cpu_device(dev)) {
 			/* increase resolution of 'cost' precision */
 			power_res = table[i].power * 10;
 			cost = power_res / table[i].performance;
-- 
2.25.1
Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Rafael J. Wysocki 3 weeks, 2 days ago
On Mon, Apr 14, 2025 at 11:09 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>
> When the device is of a non-CPU type, table[i].performance won't be
> initialized in the previous em_init_performance(), resulting in division
> by zero when calculating costs in em_compute_costs().
>
> Since the 'cost' algorithm is only used for EAS energy efficiency
> calculations and is currently not utilized by other device drivers, we
> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> issue.
>
> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")

Please look at the Fixes: tags in the kernel git history.  They don't
look like the one above.

> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> ---
>  kernel/power/energy_model.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index d9b7e2b38c7a..fc972cc1fc12 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>
>         /* Compute the cost of each performance state. */
>         for (i = nr_states - 1; i >= 0; i--) {
> -               unsigned long power_res, cost;
> +               unsigned long power_res, cost = 0;
>
>                 if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
>                         ret = cb->get_cost(dev, table[i].frequency, &cost);
> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>                                         cost, ret);
>                                 return -EINVAL;
>                         }
> -               } else {
> +               } else if (_is_cpu_device(dev)) {

Can't you just check this upfront at the beginning of the function and
make it bail out if dev is not a CPU device?

>                         /* increase resolution of 'cost' precision */
>                         power_res = table[i].power * 10;
>                         cost = power_res / table[i].performance;
> --
Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 1 day ago
在 2025/4/16 01:17, Rafael J. Wysocki 写道:
> On Mon, Apr 14, 2025 at 11:09 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>>
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> When the device is of a non-CPU type, table[i].performance won't be
>> initialized in the previous em_init_performance(), resulting in division
>> by zero when calculating costs in em_compute_costs().
>>
>> Since the 'cost' algorithm is only used for EAS energy efficiency
>> calculations and is currently not utilized by other device drivers, we
>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>> issue.
>>
>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
> 
> Please look at the Fixes: tags in the kernel git history.  They don't
> look like the one above.
> 
Yes, there's an extra '<>' here.

>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>> ---
>>   kernel/power/energy_model.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index d9b7e2b38c7a..fc972cc1fc12 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>
>>          /* Compute the cost of each performance state. */
>>          for (i = nr_states - 1; i >= 0; i--) {
>> -               unsigned long power_res, cost;
>> +               unsigned long power_res, cost = 0;
>>
>>                  if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
>>                          ret = cb->get_cost(dev, table[i].frequency, &cost);
>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>                                          cost, ret);
>>                                  return -EINVAL;
>>                          }
>> -               } else {
>> +               } else if (_is_cpu_device(dev)) {
> 
> Can't you just check this upfront at the beginning of the function and
> make it bail out if dev is not a CPU device?
> 
Sure, But the current implementation applies em_compute_costs() to both 
non-CPU devices and CPU devices. After carefully reviewing the latest code,
I've found this issue has expanded in scope.

There are currently three call paths for invoking em_compute_costs():

1) Registering performance domains (for both non-CPU and CPU devices)
em_dev_register_perf_domain() → em_create_pd() →
em_create_perf_table() → em_compute_costs()

2)EM update paths (CPU devices only)

Periodic 1000ms update check via em_update_work work item:
em_check_capacity_update() → em_adjust_new_capacity() → 
em_recalc_and_update() → em_compute_costs()

Exynos-chip initialization:
em_dev_update_chip_binning() → em_recalc_and_update() → em_compute_costs()

3) Device cost computation (non-CPU devices only - currently unused)
em_dev_compute_costs() → em_compute_costs()

Note: In em_dev_compute_costs(), when calling em_compute_costs(),
neither the callback (cb) nor flags are set.In fact, it either does
nothing at all or performs incorrect operations.

Therefore, should we mandate that non-CPU devices must provide a
get_cost callback?

So Should we add a check at the beginning of the em_compute_costs() to:

	if (!_is_cpu_device(dev) && !cb->get_cost) {
		dev_dbg(dev, "EM: No get_cost provided, cost unset.\n");
		return 0;
	}
And Modify em_dev_compute_costs() to require callers to provide the cb
callback function,Also need to update its corresponding comments.


>>                          /* increase resolution of 'cost' precision */
>>                          power_res = table[i].power * 10;
>>                          cost = power_res / table[i].performance;
>> --

Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Rafael J. Wysocki 3 weeks, 1 day ago
On Wed, Apr 16, 2025 at 4:57 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>
> 在 2025/4/16 01:17, Rafael J. Wysocki 写道:
> > On Mon, Apr 14, 2025 at 11:09 AM Yaxiong Tian <iambestgod@qq.com> wrote:
> >>
> >> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> >>
> >> When the device is of a non-CPU type, table[i].performance won't be
> >> initialized in the previous em_init_performance(), resulting in division
> >> by zero when calculating costs in em_compute_costs().
> >>
> >> Since the 'cost' algorithm is only used for EAS energy efficiency
> >> calculations and is currently not utilized by other device drivers, we
> >> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> >> issue.
> >>
> >> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
> >
> > Please look at the Fixes: tags in the kernel git history.  They don't
> > look like the one above.
> >
> Yes, there's an extra '<>' here.
>
> >> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> >> ---
> >>   kernel/power/energy_model.c | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index d9b7e2b38c7a..fc972cc1fc12 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> >>
> >>          /* Compute the cost of each performance state. */
> >>          for (i = nr_states - 1; i >= 0; i--) {
> >> -               unsigned long power_res, cost;
> >> +               unsigned long power_res, cost = 0;
> >>
> >>                  if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
> >>                          ret = cb->get_cost(dev, table[i].frequency, &cost);
> >> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> >>                                          cost, ret);
> >>                                  return -EINVAL;
> >>                          }
> >> -               } else {
> >> +               } else if (_is_cpu_device(dev)) {
> >
> > Can't you just check this upfront at the beginning of the function and
> > make it bail out if dev is not a CPU device?
> >
> Sure, But the current implementation applies em_compute_costs() to both
> non-CPU devices and CPU devices.

Maybe it shouldn't do that for non-CPU ones?

> After carefully reviewing the latest code,
> I've found this issue has expanded in scope.
>
> There are currently three call paths for invoking em_compute_costs():
>
> 1) Registering performance domains (for both non-CPU and CPU devices)
> em_dev_register_perf_domain() → em_create_pd() →
> em_create_perf_table() → em_compute_costs()
>
> 2)EM update paths (CPU devices only)
>
> Periodic 1000ms update check via em_update_work work item:
> em_check_capacity_update() → em_adjust_new_capacity() →
> em_recalc_and_update() → em_compute_costs()
>
> Exynos-chip initialization:
> em_dev_update_chip_binning() → em_recalc_and_update() → em_compute_costs()
>
> 3) Device cost computation (non-CPU devices only - currently unused)
> em_dev_compute_costs() → em_compute_costs()

So because this one is unused and AFAICS the cost values are never
used for non-CPU devices, it's better to just avoid computing them at
all.

> Note: In em_dev_compute_costs(), when calling em_compute_costs(),
> neither the callback (cb) nor flags are set.In fact, it either does
> nothing at all or performs incorrect operations.
>
> Therefore, should we mandate that non-CPU devices must provide a
> get_cost callback?

Why would that be an improvement?

> So Should we add a check at the beginning of the em_compute_costs() to:
>
>         if (!_is_cpu_device(dev) && !cb->get_cost) {
>                 dev_dbg(dev, "EM: No get_cost provided, cost unset.\n");
>                 return 0;
>         }
> And Modify em_dev_compute_costs() to require callers to provide the cb
> callback function,Also need to update its corresponding comments.
>
>
> >>                          /* increase resolution of 'cost' precision */
> >>                          power_res = table[i].power * 10;
> >>                          cost = power_res / table[i].performance;
> >> --

I think until there is a user of em_dev_compute_costs() this is all
moot and hard to figure out.

I would drop em_dev_compute_costs() altogether for now and put a
_is_cpu_device(dev) upfront check into em_compute_costs().
Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Yaxiong Tian 3 weeks, 1 day ago
在 2025/4/16 19:58, Rafael J. Wysocki 写道:
> On Wed, Apr 16, 2025 at 4:57 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>>
>> 在 2025/4/16 01:17, Rafael J. Wysocki 写道:
>>> On Mon, Apr 14, 2025 at 11:09 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>>>>
>>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>>
>>>> When the device is of a non-CPU type, table[i].performance won't be
>>>> initialized in the previous em_init_performance(), resulting in division
>>>> by zero when calculating costs in em_compute_costs().
>>>>
>>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>>> calculations and is currently not utilized by other device drivers, we
>>>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>>>> issue.
>>>>
>>>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
>>>
>>> Please look at the Fixes: tags in the kernel git history.  They don't
>>> look like the one above.
>>>
>> Yes, there's an extra '<>' here.
>>
>>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>> ---
>>>>    kernel/power/energy_model.c | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..fc972cc1fc12 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>>>
>>>>           /* Compute the cost of each performance state. */
>>>>           for (i = nr_states - 1; i >= 0; i--) {
>>>> -               unsigned long power_res, cost;
>>>> +               unsigned long power_res, cost = 0;
>>>>
>>>>                   if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
>>>>                           ret = cb->get_cost(dev, table[i].frequency, &cost);
>>>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>>>                                           cost, ret);
>>>>                                   return -EINVAL;
>>>>                           }
>>>> -               } else {
>>>> +               } else if (_is_cpu_device(dev)) {
>>>
>>> Can't you just check this upfront at the beginning of the function and
>>> make it bail out if dev is not a CPU device?
>>>
>> Sure, But the current implementation applies em_compute_costs() to both
>> non-CPU devices and CPU devices.
> 
> Maybe it shouldn't do that for non-CPU ones?
> 
>> After carefully reviewing the latest code,
>> I've found this issue has expanded in scope.
>>
>> There are currently three call paths for invoking em_compute_costs():
>>
>> 1) Registering performance domains (for both non-CPU and CPU devices)
>> em_dev_register_perf_domain() → em_create_pd() →
>> em_create_perf_table() → em_compute_costs()
>>
>> 2)EM update paths (CPU devices only)
>>
>> Periodic 1000ms update check via em_update_work work item:
>> em_check_capacity_update() → em_adjust_new_capacity() →
>> em_recalc_and_update() → em_compute_costs()
>>
>> Exynos-chip initialization:
>> em_dev_update_chip_binning() → em_recalc_and_update() → em_compute_costs()
>>
>> 3) Device cost computation (non-CPU devices only - currently unused)
>> em_dev_compute_costs() → em_compute_costs()
> 
> So because this one is unused and AFAICS the cost values are never
> used for non-CPU devices, it's better to just avoid computing them at
> all.
> 
>> Note: In em_dev_compute_costs(), when calling em_compute_costs(),
>> neither the callback (cb) nor flags are set.In fact, it either does
>> nothing at all or performs incorrect operations.
>>
>> Therefore, should we mandate that non-CPU devices must provide a
>> get_cost callback?
> 
> Why would that be an improvement?
> 
>> So Should we add a check at the beginning of the em_compute_costs() to:
>>
>>          if (!_is_cpu_device(dev) && !cb->get_cost) {
>>                  dev_dbg(dev, "EM: No get_cost provided, cost unset.\n");
>>                  return 0;
>>          }
>> And Modify em_dev_compute_costs() to require callers to provide the cb
>> callback function,Also need to update its corresponding comments.
>>
>>
>>>>                           /* increase resolution of 'cost' precision */
>>>>                           power_res = table[i].power * 10;
>>>>                           cost = power_res / table[i].performance;
>>>> --
> 
> I think until there is a user of em_dev_compute_costs() this is all
> moot and hard to figure out.
> 
> I would drop em_dev_compute_costs() altogether for now and put a
> _is_cpu_device(dev) upfront check into em_compute_costs().

Yes, I agree with your point. Currently no non-CPU devices are using
'cost'. The best approach would be to just add the _is_cpu_device check. 
I'll update it in V4.

By the way, em_dev_compute_costs should only apply to CPU devices. I was
mistaken earlier—it’s really hard to tell just from the function name.

Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 3 weeks, 1 day ago
On 4/16/25 12:58, Rafael J. Wysocki wrote:
> On Wed, Apr 16, 2025 at 4:57 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>>
>> 在 2025/4/16 01:17, Rafael J. Wysocki 写道:
>>> On Mon, Apr 14, 2025 at 11:09 AM Yaxiong Tian <iambestgod@qq.com> wrote:
>>>>
>>>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>>
>>>> When the device is of a non-CPU type, table[i].performance won't be
>>>> initialized in the previous em_init_performance(), resulting in division
>>>> by zero when calculating costs in em_compute_costs().
>>>>
>>>> Since the 'cost' algorithm is only used for EAS energy efficiency
>>>> calculations and is currently not utilized by other device drivers, we
>>>> should add the _is_cpu_device(dev) check to prevent this division-by-zero
>>>> issue.
>>>>
>>>> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
>>>
>>> Please look at the Fixes: tags in the kernel git history.  They don't
>>> look like the one above.
>>>
>> Yes, there's an extra '<>' here.
>>
>>>> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>>> ---
>>>>    kernel/power/energy_model.c | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..fc972cc1fc12 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>>>
>>>>           /* Compute the cost of each performance state. */
>>>>           for (i = nr_states - 1; i >= 0; i--) {
>>>> -               unsigned long power_res, cost;
>>>> +               unsigned long power_res, cost = 0;
>>>>
>>>>                   if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
>>>>                           ret = cb->get_cost(dev, table[i].frequency, &cost);
>>>> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>>>>                                           cost, ret);
>>>>                                   return -EINVAL;
>>>>                           }
>>>> -               } else {
>>>> +               } else if (_is_cpu_device(dev)) {
>>>
>>> Can't you just check this upfront at the beginning of the function and
>>> make it bail out if dev is not a CPU device?
>>>
>> Sure, But the current implementation applies em_compute_costs() to both
>> non-CPU devices and CPU devices.
> 
> Maybe it shouldn't do that for non-CPU ones?

It shouldn't call this cost computation for non-CPU devices.
Let me check that.
Re: [PATCH v3] PM: EM: Fix potential division-by-zero error in em_compute_costs()
Posted by Lukasz Luba 3 weeks, 2 days ago
On 4/14/25 10:04, Yaxiong Tian wrote:
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> 
> When the device is of a non-CPU type, table[i].performance won't be
> initialized in the previous em_init_performance(), resulting in division
> by zero when calculating costs in em_compute_costs().
> 
> Since the 'cost' algorithm is only used for EAS energy efficiency
> calculations and is currently not utilized by other device drivers, we
> should add the _is_cpu_device(dev) check to prevent this division-by-zero
> issue.
> 
> Fixes: <1b600da51073> ("PM: EM: Optimize em_cpu_energy() and remove division")
> Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
> ---
>   kernel/power/energy_model.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index d9b7e2b38c7a..fc972cc1fc12 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -235,7 +235,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   
>   	/* Compute the cost of each performance state. */
>   	for (i = nr_states - 1; i >= 0; i--) {
> -		unsigned long power_res, cost;
> +		unsigned long power_res, cost = 0;
>   
>   		if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
>   			ret = cb->get_cost(dev, table[i].frequency, &cost);
> @@ -244,7 +244,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   					cost, ret);
>   				return -EINVAL;
>   			}
> -		} else {
> +		} else if (_is_cpu_device(dev)) {
>   			/* increase resolution of 'cost' precision */
>   			power_res = table[i].power * 10;
>   			cost = power_res / table[i].performance;

LGTM,

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>