From: Jie Zhan <zhanjie9@hisilicon.com>
Perf counters could be 0 if the cpu is in a low-power idle state. Just try
it again next time and update the frequency scale when the cpu is active
and perf counters successfully return.
Also, remove the FIE source on an actual failure.
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
---
drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 904006027df2..e95844d3d366 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work)
struct cppc_cpudata *cpu_data;
unsigned long local_freq_scale;
u64 perf;
+ int ret;
cppc_fi = container_of(work, struct cppc_freq_invariance, work);
cpu_data = cppc_fi->cpu_data;
- if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) {
+ ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs);
+ /*
+ * Perf counters could be 0 if the cpu is in a low-power idle state.
+ * Just try it again next time.
+ */
+ if (ret == -EFAULT)
+ return;
+
+ if (ret) {
pr_warn("%s: failed to read perf counters\n", __func__);
+ topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC,
+ cpu_data->shared_cpu_map);
return;
}
--
2.33.0
Hi Bowen, Jie On Wed, Jul 30, 2025 at 11:23:12AM +0800, Bowen Yu wrote: > From: Jie Zhan <zhanjie9@hisilicon.com> > > Perf counters could be 0 if the cpu is in a low-power idle state. Just try > it again next time and update the frequency scale when the cpu is active > and perf counters successfully return. > > Also, remove the FIE source on an actual failure. > > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > --- > drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > index 904006027df2..e95844d3d366 100644 > --- a/drivers/cpufreq/cppc_cpufreq.c > +++ b/drivers/cpufreq/cppc_cpufreq.c > @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) > struct cppc_cpudata *cpu_data; > unsigned long local_freq_scale; > u64 perf; > + int ret; > > cppc_fi = container_of(work, struct cppc_freq_invariance, work); > cpu_data = cppc_fi->cpu_data; > > - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { > + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); > + /* > + * Perf counters could be 0 if the cpu is in a low-power idle state. > + * Just try it again next time. > + */ > + if (ret == -EFAULT) > + return; Which counters are we actually talking about here ? > + > + if (ret) { > pr_warn("%s: failed to read perf counters\n", __func__); > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, > + cpu_data->shared_cpu_map); > return; > } And the real error here would be ... ? That makes me wonder why this has been registered as the source of the freq scale in the first place if we are to hit some serious issue. Would you be able to give an example of any? --- BR Beata > > -- > 2.33.0 > >
On 31/07/2025 16:19, Beata Michalska wrote: > Hi Bowen, Jie > On Wed, Jul 30, 2025 at 11:23:12AM +0800, Bowen Yu wrote: >> From: Jie Zhan <zhanjie9@hisilicon.com> >> >> Perf counters could be 0 if the cpu is in a low-power idle state. Just try >> it again next time and update the frequency scale when the cpu is active >> and perf counters successfully return. >> >> Also, remove the FIE source on an actual failure. >> >> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> >> --- >> drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- >> 1 file changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c >> index 904006027df2..e95844d3d366 100644 >> --- a/drivers/cpufreq/cppc_cpufreq.c >> +++ b/drivers/cpufreq/cppc_cpufreq.c >> @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) >> struct cppc_cpudata *cpu_data; >> unsigned long local_freq_scale; >> u64 perf; >> + int ret; >> >> cppc_fi = container_of(work, struct cppc_freq_invariance, work); >> cpu_data = cppc_fi->cpu_data; >> >> - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { >> + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); >> + /* >> + * Perf counters could be 0 if the cpu is in a low-power idle state. >> + * Just try it again next time. >> + */ >> + if (ret == -EFAULT) >> + return; > Which counters are we actually talking about here ? Delivered performance counter and reference performance counter. They are actually AMU CPU_CYCLES and CNT_CYCLES event counters. >> + >> + if (ret) { >> pr_warn("%s: failed to read perf counters\n", __func__); >> + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, >> + cpu_data->shared_cpu_map); >> return; >> } > And the real error here would be ... ? > That makes me wonder why this has been registered as the source of the freq > scale in the first place if we are to hit some serious issue. Would you be able > to give an example of any? If it gets here, that would be -ENODEV or -EIO from cppc_get_perf_ctrs(), which could possibly come from data corruption (no CPC descriptor) or a PCC failure. I can't easily fake an error here, but the above -EFAULT path could happen when it luckily passes the FIE init. Jie > > --- > BR > Beata >> >> -- >> 2.33.0 >> >> >
On Thu, Jul 31, 2025 at 04:52:05PM +0800, Jie Zhan wrote: > > > On 31/07/2025 16:19, Beata Michalska wrote: > > Hi Bowen, Jie > > On Wed, Jul 30, 2025 at 11:23:12AM +0800, Bowen Yu wrote: > >> From: Jie Zhan <zhanjie9@hisilicon.com> > >> > >> Perf counters could be 0 if the cpu is in a low-power idle state. Just try > >> it again next time and update the frequency scale when the cpu is active > >> and perf counters successfully return. > >> > >> Also, remove the FIE source on an actual failure. > >> > >> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > >> --- > >> drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- > >> 1 file changed, 12 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > >> index 904006027df2..e95844d3d366 100644 > >> --- a/drivers/cpufreq/cppc_cpufreq.c > >> +++ b/drivers/cpufreq/cppc_cpufreq.c > >> @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) > >> struct cppc_cpudata *cpu_data; > >> unsigned long local_freq_scale; > >> u64 perf; > >> + int ret; > >> > >> cppc_fi = container_of(work, struct cppc_freq_invariance, work); > >> cpu_data = cppc_fi->cpu_data; > >> > >> - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { > >> + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); > >> + /* > >> + * Perf counters could be 0 if the cpu is in a low-power idle state. > >> + * Just try it again next time. > >> + */ > >> + if (ret == -EFAULT) > >> + return; > > Which counters are we actually talking about here ? > > Delivered performance counter and reference performance counter. > They are actually AMU CPU_CYCLES and CNT_CYCLES event counters. That does track then. > > >> + > >> + if (ret) { > >> pr_warn("%s: failed to read perf counters\n", __func__); > >> + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, > >> + cpu_data->shared_cpu_map); > >> return; > >> } > > And the real error here would be ... ? > > That makes me wonder why this has been registered as the source of the freq > > scale in the first place if we are to hit some serious issue. Would you be able > > to give an example of any? > If it gets here, that would be -ENODEV or -EIO from cppc_get_perf_ctrs(), > which could possibly come from data corruption (no CPC descriptor) or a PCC > failure. > > I can't easily fake an error here, but the above -EFAULT path could > happen when it luckily passes the FIE init. > The change seems reasonable. Though I am wondering if some other errors might be rather transient as well ? Like -EIO ? Note, I'm not an expert here. --- BR Beata > Jie > > > > --- > > BR > > Beata > >> > >> -- > >> 2.33.0 > >> > >> > >
On 31/07/2025 17:42, Beata Michalska wrote: > On Thu, Jul 31, 2025 at 04:52:05PM +0800, Jie Zhan wrote: >> >> >> On 31/07/2025 16:19, Beata Michalska wrote: >>> Hi Bowen, Jie >>> On Wed, Jul 30, 2025 at 11:23:12AM +0800, Bowen Yu wrote: >>>> From: Jie Zhan <zhanjie9@hisilicon.com> >>>> >>>> Perf counters could be 0 if the cpu is in a low-power idle state. Just try >>>> it again next time and update the frequency scale when the cpu is active >>>> and perf counters successfully return. >>>> >>>> Also, remove the FIE source on an actual failure. >>>> >>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> >>>> --- >>>> drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- >>>> 1 file changed, 12 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c >>>> index 904006027df2..e95844d3d366 100644 >>>> --- a/drivers/cpufreq/cppc_cpufreq.c >>>> +++ b/drivers/cpufreq/cppc_cpufreq.c >>>> @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) >>>> struct cppc_cpudata *cpu_data; >>>> unsigned long local_freq_scale; >>>> u64 perf; >>>> + int ret; >>>> >>>> cppc_fi = container_of(work, struct cppc_freq_invariance, work); >>>> cpu_data = cppc_fi->cpu_data; >>>> >>>> - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { >>>> + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); >>>> + /* >>>> + * Perf counters could be 0 if the cpu is in a low-power idle state. >>>> + * Just try it again next time. >>>> + */ >>>> + if (ret == -EFAULT) >>>> + return; >>> Which counters are we actually talking about here ? >> >> Delivered performance counter and reference performance counter. >> They are actually AMU CPU_CYCLES and CNT_CYCLES event counters. > That does track then. >> >>>> + >>>> + if (ret) { >>>> pr_warn("%s: failed to read perf counters\n", __func__); >>>> + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, >>>> + cpu_data->shared_cpu_map); >>>> return; >>>> } >>> And the real error here would be ... ? >>> That makes me wonder why this has been registered as the source of the freq >>> scale in the first place if we are to hit some serious issue. Would you be able >>> to give an example of any? >> If it gets here, that would be -ENODEV or -EIO from cppc_get_perf_ctrs(), >> which could possibly come from data corruption (no CPC descriptor) or a PCC >> failure. >> >> I can't easily fake an error here, but the above -EFAULT path could >> happen when it luckily passes the FIE init. >> > The change seems reasonable. Though I am wondering if some other errors might be > rather transient as well ? Like -EIO ? > Note, I'm not an expert here. The -EIO from PCC contains much more error cases than this.
> Perf counters could be 0 if the cpu is in a low-power idle state. Just try > it again next time and update the frequency scale when the cpu is active > and perf counters successfully return. > > Also, remove the FIE source on an actual failure. How do you think about to add any tags (like “Fixes” and “Cc”) accordingly? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.16#n145 Regards, Markus
On 31/07/2025 02:38, Markus Elfring wrote: >> Perf counters could be 0 if the cpu is in a low-power idle state. Just try >> it again next time and update the frequency scale when the cpu is active >> and perf counters successfully return. >> >> Also, remove the FIE source on an actual failure. > > How do you think about to add any tags (like “Fixes” and “Cc”) accordingly? > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.16#n145 > > Regards, > Markus Hi Markus, We don't think it's necessarily a bugfix that should be backported to old/stable kernels. This is mainly an extra error handling for cases that may happen on our platform when a CPU is in a deep idle state. Similar patches were merged before. https://lore.kernel.org/linux-pm/20240929033214.1039485-2-zhanjie9@hisilicon.com/ Perhaps the 'Fix' in the patch title is a bit misleading. I'll write a more appropriate patch description in the following version. Thanks, Jie
> We don't think it's necessarily a bugfix that should be backported to > old/stable kernels. This is mainly an extra error handling for cases that > may happen on our platform when a CPU is in a deep idle state. I find mentioned tags relevant also according to proposed completion of error handling. Regards, Markus
+ Prashant/Beata/Ionela On 30-07-25, 11:23, Bowen Yu wrote: > From: Jie Zhan <zhanjie9@hisilicon.com> > > Perf counters could be 0 if the cpu is in a low-power idle state. Just try > it again next time and update the frequency scale when the cpu is active > and perf counters successfully return. > > Also, remove the FIE source on an actual failure. > > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > --- > drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > index 904006027df2..e95844d3d366 100644 > --- a/drivers/cpufreq/cppc_cpufreq.c > +++ b/drivers/cpufreq/cppc_cpufreq.c > @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) > struct cppc_cpudata *cpu_data; > unsigned long local_freq_scale; > u64 perf; > + int ret; > > cppc_fi = container_of(work, struct cppc_freq_invariance, work); > cpu_data = cppc_fi->cpu_data; > > - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { > + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); > + /* > + * Perf counters could be 0 if the cpu is in a low-power idle state. > + * Just try it again next time. > + */ > + if (ret == -EFAULT) > + return; > + > + if (ret) { > pr_warn("%s: failed to read perf counters\n", __func__); > + topology_clear_scale_freq_source(SCALE_FREQ_SOURCE_CPPC, > + cpu_data->shared_cpu_map); > return; > } > > -- > 2.33.0 -- viresh
Thanks for adding me, Viresh. On Tue, 29 Jul 2025 at 23:39, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > + Prashant/Beata/Ionela > > On 30-07-25, 11:23, Bowen Yu wrote: > > From: Jie Zhan <zhanjie9@hisilicon.com> > > > > Perf counters could be 0 if the cpu is in a low-power idle state. Just try > > it again next time and update the frequency scale when the cpu is active > > and perf counters successfully return. > > > > Also, remove the FIE source on an actual failure. > > > > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > > --- > > drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > > index 904006027df2..e95844d3d366 100644 > > --- a/drivers/cpufreq/cppc_cpufreq.c > > +++ b/drivers/cpufreq/cppc_cpufreq.c > > @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) > > struct cppc_cpudata *cpu_data; > > unsigned long local_freq_scale; > > u64 perf; > > + int ret; > > > > cppc_fi = container_of(work, struct cppc_freq_invariance, work); > > cpu_data = cppc_fi->cpu_data; > > > > - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { > > + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); > > + /* > > + * Perf counters could be 0 if the cpu is in a low-power idle state. > > + * Just try it again next time. > > + */ FWIU the performance counters shouldn't be returning 0 in an idle state. Per the UEFI spec [1], they increment any time the CPU is active, so they should just return their last counter value before they went into idle (of course in the FFH case an IPI is performed on the target CPU, so even if the CPU was idle, it will get woken up). As such it is better to either : - Check for idle_cpu() directly and return (see [2] for the function) or - Always clear the source on encountering an error return value. [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html?highlight=cppc#performance-counters [2] https://lore.kernel.org/linux-pm/20250619000925.415528-2-pmalani@google.com/ HTH, -Prashant
On 31/07/2025 06:34, Prashant Malani wrote: > Thanks for adding me, Viresh. > > On Tue, 29 Jul 2025 at 23:39, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> >> + Prashant/Beata/Ionela >> >> On 30-07-25, 11:23, Bowen Yu wrote: >>> From: Jie Zhan <zhanjie9@hisilicon.com> >>> >>> Perf counters could be 0 if the cpu is in a low-power idle state. Just try >>> it again next time and update the frequency scale when the cpu is active >>> and perf counters successfully return. >>> >>> Also, remove the FIE source on an actual failure. >>> >>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> >>> --- >>> drivers/cpufreq/cppc_cpufreq.c | 13 ++++++++++++- >>> 1 file changed, 12 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c >>> index 904006027df2..e95844d3d366 100644 >>> --- a/drivers/cpufreq/cppc_cpufreq.c >>> +++ b/drivers/cpufreq/cppc_cpufreq.c >>> @@ -78,12 +78,23 @@ static void cppc_scale_freq_workfn(struct kthread_work *work) >>> struct cppc_cpudata *cpu_data; >>> unsigned long local_freq_scale; >>> u64 perf; >>> + int ret; >>> >>> cppc_fi = container_of(work, struct cppc_freq_invariance, work); >>> cpu_data = cppc_fi->cpu_data; >>> >>> - if (cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs)) { >>> + ret = cppc_get_perf_ctrs(cppc_fi->cpu, &fb_ctrs); >>> + /* >>> + * Perf counters could be 0 if the cpu is in a low-power idle state. >>> + * Just try it again next time. >>> + */ > > FWIU the performance counters shouldn't be returning 0 in an idle state. > Per the UEFI spec [1], they increment any time the CPU is active, > so they should just return their last counter value before they went into idle > (of course in the FFH case an IPI is performed on the target CPU, so even > if the CPU was idle, it will get woken up). Hi Prashant, The perf counters could return 0 when a CPU is enters a low-power idle state, e.g. reset or powered down, and the perf counters are in the system memory space (the target CPU is not woken up unfortunately). On our platform, and I suppose so on most ARM64 platforms, perf counters are mapped to AMU counters. Per ARM spec, AMEVCNTR0 is 0 on reset. BTW, that's also why ARM Trusted Firmware needs to save and restore AMU counters before and after powering down. https://github.com/ARM-software/arm-trusted-firmware/blob/master/lib/extensions/amu/aarch64/amu.c I hope this explains your confusion. Then we can carry on discussion if we reach on this consensus. Thanks for taking a look! Jie > > As such it is better to either : > - Check for idle_cpu() directly and return (see [2] for the function) > or > - Always clear the source on encountering an error return value. > > [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html?highlight=cppc#performance-counters > [2] https://lore.kernel.org/linux-pm/20250619000925.415528-2-pmalani@google.com/ > > HTH, > > -Prashant
Hi Jie, On Thu, 31 Jul 2025 at 01:32, Jie Zhan <zhanjie9@hisilicon.com> wrote: > > > > On 31/07/2025 06:34, Prashant Malani wrote: > > Hi Prashant, > > The perf counters could return 0 when a CPU is enters a low-power idle > state, e.g. reset or powered down, and the perf counters are in the system > memory space (the target CPU is not woken up unfortunately). > Thanks for the clarification. Reset and powered down are not typically considered "low-power idle states". Please re-word your commit message to specifically call out the "reset and powered-down" CPU states. This begs the question: why is this work function being scheduled for CPUs that are in reset or offline/powered-down at all? IANAE but it sounds like it would be better to add logic to ensure this work function doesn't get scheduled/executed for CPUs that are truly offline/powered-down or in reset. BR, -- -Prashant
On 01/08/2025 16:58, Prashant Malani wrote: > Hi Jie, > > On Thu, 31 Jul 2025 at 01:32, Jie Zhan <zhanjie9@hisilicon.com> wrote: >> >> >> >> On 31/07/2025 06:34, Prashant Malani wrote: >> >> Hi Prashant, >> >> The perf counters could return 0 when a CPU is enters a low-power idle >> state, e.g. reset or powered down, and the perf counters are in the system >> memory space (the target CPU is not woken up unfortunately). >> > > Thanks for the clarification. Reset and powered down are not typically > considered "low-power idle states". Actually, power down and reset is a common low-power idle state defined in ACPI spec and implemented on Intel chips. Some quick references: [1] ACPI spec 6.5, 8 PROCESSOR CONFIGURATION AND CONTROL https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf Also, the C1, C2, C3 states are refered as Clock Gated, Retention, and Power Down respectively in 8.4.3 Lower Power Idle States. [2] Intel 3rd gen Xeon Scalable, the C6 state. https://www.intel.com/content/www/us/en/content-details/637748/power-management-technology-overview-technology-guide.html?wapkw=c-state Nevertheless, it's still not widely supported in practice, so I understand what you said. > Please re-word your commit message to specifically call out the "reset and > powered-down" CPU states. Sure. I'll try to make it a bit more straightforward in V2. > > This begs the question: why is this work function being scheduled > for CPUs that are in reset or offline/powered-down at all? > IANAE but it sounds like it would be better to add logic to ensure this > work function doesn't get scheduled/executed for CPUs that > are truly offline/powered-down or in reset. Yeah good question. We may discuss that on your thread. > > BR, > Thanks! Jie
On Sun, 3 Aug 2025 at 23:21, Jie Zhan <zhanjie9@hisilicon.com> wrote > On 01/08/2025 16:58, Prashant Malani wrote: > > This begs the question: why is this work function being scheduled > > for CPUs that are in reset or offline/powered-down at all? > > IANAE but it sounds like it would be better to add logic to ensure this > > work function doesn't get scheduled/executed for CPUs that > > are truly offline/powered-down or in reset. > Yeah good question. We may discuss that on your thread. OK. Quickly looking around, it sounds having in the CPPC tick function [1] might be a better option (one probably doesn't want to lift it beyond the CPPC layer, since other drivers might have different behaviour). One can add a cpu_online/cpu_enabled check there. [1] https://elixir.bootlin.com/linux/v6.16/source/include/linux/cpumask.h#L1233 -- -Prashant
On Mon, 4 Aug 2025 at 18:12, Prashant Malani <pmalani@google.com> wrote: > > On Sun, 3 Aug 2025 at 23:21, Jie Zhan <zhanjie9@hisilicon.com> wrote > > On 01/08/2025 16:58, Prashant Malani wrote: > > > This begs the question: why is this work function being scheduled > > > for CPUs that are in reset or offline/powered-down at all? > > > IANAE but it sounds like it would be better to add logic to ensure this > > > work function doesn't get scheduled/executed for CPUs that > > > are truly offline/powered-down or in reset. > > Yeah good question. We may discuss that on your thread. > > OK. > Quickly looking around, it sounds having in the CPPC tick function [1] > might be a better option (one probably doesn't want to lift it beyond the > CPPC layer, since other drivers might have different behaviour). > One can add a cpu_online/cpu_enabled check there. Fixed link: [1] https://elixir.bootlin.com/linux/v6.13/source/drivers/cpufreq/cppc_cpufreq.c#L125
On 05/08/2025 12:58, Prashant Malani wrote: > On Mon, 4 Aug 2025 at 18:12, Prashant Malani <pmalani@google.com> wrote: >> >> On Sun, 3 Aug 2025 at 23:21, Jie Zhan <zhanjie9@hisilicon.com> wrote >>> On 01/08/2025 16:58, Prashant Malani wrote: >>>> This begs the question: why is this work function being scheduled >>>> for CPUs that are in reset or offline/powered-down at all? >>>> IANAE but it sounds like it would be better to add logic to ensure this >>>> work function doesn't get scheduled/executed for CPUs that >>>> are truly offline/powered-down or in reset. >>> Yeah good question. We may discuss that on your thread. >> >> OK. >> Quickly looking around, it sounds having in the CPPC tick function [1] >> might be a better option (one probably doesn't want to lift it beyond the >> CPPC layer, since other drivers might have different behaviour). >> One can add a cpu_online/cpu_enabled check there. > > Fixed link: > [1] https://elixir.bootlin.com/linux/v6.13/source/drivers/cpufreq/cppc_cpufreq.c#L125 I don't think a cpu_online/cpu_enabled check there would help. Offlined CPUs don't make cppc_scale_freq_workfn() fail because they won't have FIE triggered. It fails from accessing perf counters on powered-down CPUs. Perhaps the CPPC FIE needs a bit rework. AFAICS, FIE is meant to run in ticks, but currently the CPPC FIE eventually runs in a thread due to the possible PCC path when reading CPC regs I guess.
On Wed, Aug 13, 2025 at 03:15:12PM +0800, Jie Zhan wrote: > > > On 05/08/2025 12:58, Prashant Malani wrote: > > On Mon, 4 Aug 2025 at 18:12, Prashant Malani <pmalani@google.com> wrote: > >> > >> On Sun, 3 Aug 2025 at 23:21, Jie Zhan <zhanjie9@hisilicon.com> wrote > >>> On 01/08/2025 16:58, Prashant Malani wrote: > >>>> This begs the question: why is this work function being scheduled > >>>> for CPUs that are in reset or offline/powered-down at all? > >>>> IANAE but it sounds like it would be better to add logic to ensure this > >>>> work function doesn't get scheduled/executed for CPUs that > >>>> are truly offline/powered-down or in reset. > >>> Yeah good question. We may discuss that on your thread. > >> > >> OK. > >> Quickly looking around, it sounds having in the CPPC tick function [1] > >> might be a better option (one probably doesn't want to lift it beyond the > >> CPPC layer, since other drivers might have different behaviour). > >> One can add a cpu_online/cpu_enabled check there. > > > > Fixed link: > > [1] https://elixir.bootlin.com/linux/v6.13/source/drivers/cpufreq/cppc_cpufreq.c#L125 > I don't think a cpu_online/cpu_enabled check there would help. > > Offlined CPUs don't make cppc_scale_freq_workfn() fail because they won't > have FIE triggered. It fails from accessing perf counters on powered-down > CPUs. > > Perhaps the CPPC FIE needs a bit rework. AFAICS, FIE is meant to run in > ticks, but currently the CPPC FIE eventually runs in a thread due to the > possible PCC path when reading CPC regs I guess. Just for my benefit: the tick is being fired on a given CPU which is when an irq_work is being queued. Then before this goes through the kworker and finally ends up in 'cppc_scale_freq_workfn' that CPU is entering a deeper idle state ? Could the cppc driver register for pm notifications and cancel any pending work for a CPU that is actually going down, directly or by setting some flag or smth so that the final worker function is either not triggered or knows it has to bail out early ? (Note this is a rough idea and needs verification) --- BR Beata
On 13/08/2025 17:30, Beata Michalska wrote: > On Wed, Aug 13, 2025 at 03:15:12PM +0800, Jie Zhan wrote: >> >> >> On 05/08/2025 12:58, Prashant Malani wrote: >>> On Mon, 4 Aug 2025 at 18:12, Prashant Malani <pmalani@google.com> wrote: >>>> >>>> On Sun, 3 Aug 2025 at 23:21, Jie Zhan <zhanjie9@hisilicon.com> wrote >>>>> On 01/08/2025 16:58, Prashant Malani wrote: >>>>>> This begs the question: why is this work function being scheduled >>>>>> for CPUs that are in reset or offline/powered-down at all? >>>>>> IANAE but it sounds like it would be better to add logic to ensure this >>>>>> work function doesn't get scheduled/executed for CPUs that >>>>>> are truly offline/powered-down or in reset. >>>>> Yeah good question. We may discuss that on your thread. >>>> >>>> OK. >>>> Quickly looking around, it sounds having in the CPPC tick function [1] >>>> might be a better option (one probably doesn't want to lift it beyond the >>>> CPPC layer, since other drivers might have different behaviour). >>>> One can add a cpu_online/cpu_enabled check there. >>> >>> Fixed link: >>> [1] https://elixir.bootlin.com/linux/v6.13/source/drivers/cpufreq/cppc_cpufreq.c#L125 >> I don't think a cpu_online/cpu_enabled check there would help. >> >> Offlined CPUs don't make cppc_scale_freq_workfn() fail because they won't >> have FIE triggered. It fails from accessing perf counters on powered-down >> CPUs. >> >> Perhaps the CPPC FIE needs a bit rework. AFAICS, FIE is meant to run in >> ticks, but currently the CPPC FIE eventually runs in a thread due to the >> possible PCC path when reading CPC regs I guess. > Just for my benefit: the tick is being fired on a given CPU which is when an > irq_work is being queued. Then before this goes through the kworker and finally > ends up in 'cppc_scale_freq_workfn' that CPU is entering a deeper idle state ? Yeah. > Could the cppc driver register for pm notifications and cancel any pending work > for a CPU that is actually going down, directly or by setting some flag or smth > so that the final worker function is either not triggered or knows it has to > bail out early ? > (Note this is a rough idea and needs verification) That could be a feasible workaround, but I prefer to rework the CPPC FIE rather than try to fix it, i.e. FIE can run in ticks for non-PCC regs I suppose. > > --- > BR > Beata >
© 2016 - 2025 Red Hat, Inc.