We have found significant differences in the latency of cpc_read() between
regular scenarios and scenarios with high memory access pressure. Ignoring
this error can result in getting rate interface occasionally returning
absurd values.
Here provides a high memory access sample test by stress-ng. My local
testing platform includes 160 CPUs, the CPC registers is accessed by mmio
method, and the cpuidle feature is disabled (the AMU always works online):
~~~
./stress-ng --memrate 160 --timeout 180
~~~
The following data is sourced from ftrace statistics towards
cppc_get_perf_ctrs():
Regular scenarios || High memory access pressure scenarios
104) | cppc_get_perf_ctrs() { || 133) | cppc_get_perf_ctrs() {
104) 0.800 us | cpc_read.isra.0(); || 133) 4.580 us | cpc_read.isra.0();
104) 0.640 us | cpc_read.isra.0(); || 133) 7.780 us | cpc_read.isra.0();
104) 0.450 us | cpc_read.isra.0(); || 133) 2.550 us | cpc_read.isra.0();
104) 0.430 us | cpc_read.isra.0(); || 133) 0.570 us | cpc_read.isra.0();
104) 4.610 us | } || 133) ! 157.610 us | }
104) | cppc_get_perf_ctrs() { || 133) | cppc_get_perf_ctrs() {
104) 0.720 us | cpc_read.isra.0(); || 133) 0.760 us | cpc_read.isra.0();
104) 0.720 us | cpc_read.isra.0(); || 133) 4.480 us | cpc_read.isra.0();
104) 0.510 us | cpc_read.isra.0(); || 133) 0.520 us | cpc_read.isra.0();
104) 0.500 us | cpc_read.isra.0(); || 133) + 10.100 us | cpc_read.isra.0();
104) 3.460 us | } || 133) ! 120.850 us | }
108) | cppc_get_perf_ctrs() { || 87) | cppc_get_perf_ctrs() {
108) 0.820 us | cpc_read.isra.0(); || 87) ! 255.200 us | cpc_read.isra.0();
108) 0.850 us | cpc_read.isra.0(); || 87) 2.910 us | cpc_read.isra.0();
108) 0.590 us | cpc_read.isra.0(); || 87) 5.160 us | cpc_read.isra.0();
108) 0.610 us | cpc_read.isra.0(); || 87) 4.340 us | cpc_read.isra.0();
108) 5.080 us | } || 87) ! 315.790 us | }
108) | cppc_get_perf_ctrs() { || 87) | cppc_get_perf_ctrs() {
108) 0.630 us | cpc_read.isra.0(); || 87) 0.800 us | cpc_read.isra.0();
108) 0.630 us | cpc_read.isra.0(); || 87) 6.310 us | cpc_read.isra.0();
108) 0.420 us | cpc_read.isra.0(); || 87) 1.190 us | cpc_read.isra.0();
108) 0.430 us | cpc_read.isra.0(); || 87) + 11.620 us | cpc_read.isra.0();
108) 3.780 us | } || 87) ! 207.010 us | }
My local testing platform works under 3000000hz, but the cpuinfo_cur_freq
interface returns values that are not even close to the actual frequency:
[root@localhost ~]# cd /sys/devices/system/cpu
[root@localhost cpu]# for i in {0..159}; do cat cpu$i/cpufreq/cpuinfo_cur_freq; done
5127812
2952127
3069001
3496183
922989768
2419194
3427042
2331869
3594611
8238499
...
The reason is when under heavy memory access pressure, the execution of
cpc_read() delay has increased from sub-microsecond to several hundred
microseconds. Moving the cpc_read function into a critical section by irq
disable/enable has minimal impact on the result.
cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
/ \ / \
cpc_read cpc_read cpc_read cpc_read
ref[0] delivered[0] ref[1] delivered[1]
| | | |
v v v v
-----------------------------------------------------------------------> time
<--delta[0]--> <------sample_period------> <-----delta[1]----->
Since that,
freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
and
delivered[1] - delivered[0] = freq * (delta[1] + sample_period),
ref[1] - ref[0] = ref_freq * (delta[0] + sample_period)
To eliminate the impact of system memory access latency, setting a
sampling period of 2us is far from sufficient. Consequently, we suggest
cppc_cpufreq_get_rate() only can be called in the process context, and
adopt a longer sampling period to neutralize the impact of random latency.
Here we call the cond_resched() function instead of sleep-like functions
to ensure that `taskset -c $i cat cpu$i/cpufreq/cpuinfo_cur_freq` could
work when cpuidle feature is enabled.
Reported-by: Yang Shi <yang@os.amperecomputing.com>
Link: https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
---
drivers/cpufreq/cppc_cpufreq.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 321a9dc9484d..a7c5418bcda7 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -851,12 +851,26 @@ static int cppc_get_perf_ctrs_pair(void *val)
struct fb_ctr_pair *fb_ctrs = val;
int cpu = fb_ctrs->cpu;
int ret;
+ unsigned long timeout;
ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
if (ret)
return ret;
- udelay(2); /* 2usec delay between sampling */
+ if (likely(!irqs_disabled())) {
+ /*
+ * Set 1ms as sampling interval, but never schedule
+ * to the idle task to prevent the AMU counters from
+ * stopping working.
+ */
+ timeout = jiffies + msecs_to_jiffies(1);
+ while (!time_after(jiffies, timeout))
+ cond_resched();
+
+ } else {
+ pr_warn_once("CPU%d: Get rate in atomic context", cpu);
+ udelay(2); /* 2usec delay between sampling */
+ }
return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
}
--
2.25.1
On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
> We have found significant differences in the latency of cpc_read() between
> regular scenarios and scenarios with high memory access pressure. Ignoring
> this error can result in getting rate interface occasionally returning
> absurd values.
>
> Here provides a high memory access sample test by stress-ng. My local
> testing platform includes 160 CPUs, the CPC registers is accessed by mmio
> method, and the cpuidle feature is disabled (the AMU always works online):
>
> ~~~
> ./stress-ng --memrate 160 --timeout 180
> ~~~
>
> The following data is sourced from ftrace statistics towards
> cppc_get_perf_ctrs():
>
> Regular scenarios || High memory access pressure scenarios
> 104) | cppc_get_perf_ctrs() { || 133) | cppc_get_perf_ctrs() {
> 104) 0.800 us | cpc_read.isra.0(); || 133) 4.580 us | cpc_read.isra.0();
> 104) 0.640 us | cpc_read.isra.0(); || 133) 7.780 us | cpc_read.isra.0();
> 104) 0.450 us | cpc_read.isra.0(); || 133) 2.550 us | cpc_read.isra.0();
> 104) 0.430 us | cpc_read.isra.0(); || 133) 0.570 us | cpc_read.isra.0();
> 104) 4.610 us | } || 133) ! 157.610 us | }
> 104) | cppc_get_perf_ctrs() { || 133) | cppc_get_perf_ctrs() {
> 104) 0.720 us | cpc_read.isra.0(); || 133) 0.760 us | cpc_read.isra.0();
> 104) 0.720 us | cpc_read.isra.0(); || 133) 4.480 us | cpc_read.isra.0();
> 104) 0.510 us | cpc_read.isra.0(); || 133) 0.520 us | cpc_read.isra.0();
> 104) 0.500 us | cpc_read.isra.0(); || 133) + 10.100 us | cpc_read.isra.0();
> 104) 3.460 us | } || 133) ! 120.850 us | }
> 108) | cppc_get_perf_ctrs() { || 87) | cppc_get_perf_ctrs() {
> 108) 0.820 us | cpc_read.isra.0(); || 87) ! 255.200 us | cpc_read.isra.0();
> 108) 0.850 us | cpc_read.isra.0(); || 87) 2.910 us | cpc_read.isra.0();
> 108) 0.590 us | cpc_read.isra.0(); || 87) 5.160 us | cpc_read.isra.0();
> 108) 0.610 us | cpc_read.isra.0(); || 87) 4.340 us | cpc_read.isra.0();
> 108) 5.080 us | } || 87) ! 315.790 us | }
> 108) | cppc_get_perf_ctrs() { || 87) | cppc_get_perf_ctrs() {
> 108) 0.630 us | cpc_read.isra.0(); || 87) 0.800 us | cpc_read.isra.0();
> 108) 0.630 us | cpc_read.isra.0(); || 87) 6.310 us | cpc_read.isra.0();
> 108) 0.420 us | cpc_read.isra.0(); || 87) 1.190 us | cpc_read.isra.0();
> 108) 0.430 us | cpc_read.isra.0(); || 87) + 11.620 us | cpc_read.isra.0();
> 108) 3.780 us | } || 87) ! 207.010 us | }
>
> My local testing platform works under 3000000hz, but the cpuinfo_cur_freq
> interface returns values that are not even close to the actual frequency:
>
> [root@localhost ~]# cd /sys/devices/system/cpu
> [root@localhost cpu]# for i in {0..159}; do cat cpu$i/cpufreq/cpuinfo_cur_freq; done
> 5127812
> 2952127
> 3069001
> 3496183
> 922989768
> 2419194
> 3427042
> 2331869
> 3594611
> 8238499
> ...
>
> The reason is when under heavy memory access pressure, the execution of
> cpc_read() delay has increased from sub-microsecond to several hundred
> microseconds. Moving the cpc_read function into a critical section by irq
> disable/enable has minimal impact on the result.
>
> cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
> / \ / \
> cpc_read cpc_read cpc_read cpc_read
> ref[0] delivered[0] ref[1] delivered[1]
> | | | |
> v v v v
> -----------------------------------------------------------------------> time
> <--delta[0]--> <------sample_period------> <-----delta[1]----->
>
> Since that,
> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
> and
> delivered[1] - delivered[0] = freq * (delta[1] + sample_period),
> ref[1] - ref[0] = ref_freq * (delta[0] + sample_period)
>
> To eliminate the impact of system memory access latency, setting a
> sampling period of 2us is far from sufficient. Consequently, we suggest
> cppc_cpufreq_get_rate() only can be called in the process context, and
> adopt a longer sampling period to neutralize the impact of random latency.
>
> Here we call the cond_resched() function instead of sleep-like functions
> to ensure that `taskset -c $i cat cpu$i/cpufreq/cpuinfo_cur_freq` could
> work when cpuidle feature is enabled.
>
> Reported-by: Yang Shi <yang@os.amperecomputing.com>
> Link: https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
> Signed-off-by: Zeng Heng <zengheng4@huawei.com>
> ---
> drivers/cpufreq/cppc_cpufreq.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 321a9dc9484d..a7c5418bcda7 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -851,12 +851,26 @@ static int cppc_get_perf_ctrs_pair(void *val)
The previous patch added this function, and calls it with smp_call_on_cpu(),
where it'll run in IRQ context with IRQs disabled...
> struct fb_ctr_pair *fb_ctrs = val;
> int cpu = fb_ctrs->cpu;
> int ret;
> + unsigned long timeout;
>
> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
> if (ret)
> return ret;
>
> - udelay(2); /* 2usec delay between sampling */
> + if (likely(!irqs_disabled())) {
> + /*
> + * Set 1ms as sampling interval, but never schedule
> + * to the idle task to prevent the AMU counters from
> + * stopping working.
> + */
> + timeout = jiffies + msecs_to_jiffies(1);
> + while (!time_after(jiffies, timeout))
> + cond_resched();
> +
> + } else {
... so we'll enter this branch of the if-else ...
> + pr_warn_once("CPU%d: Get rate in atomic context", cpu);
... and pr_warn_once() for something that's apparently normal and outside of
the user's control?
That doesn't make much sense to me.
Mark.
> + udelay(2); /* 2usec delay between sampling */
> + }
>
> return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
> }
> --
> 2.25.1
>
在 2023/10/25 19:01, Mark Rutland 写道:
> On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
>
> The previous patch added this function, and calls it with smp_call_on_cpu(),
> where it'll run in IRQ context with IRQs disabled...
smp_call_on_cpu() puts the work to the bind-cpu worker.
And this function will be called in task context, and IRQs is certainly enabled.
Zeng Heng
>> struct fb_ctr_pair *fb_ctrs = val;
>> int cpu = fb_ctrs->cpu;
>> int ret;
>> + unsigned long timeout;
>>
>> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
>> if (ret)
>> return ret;
>>
>> - udelay(2); /* 2usec delay between sampling */
>> + if (likely(!irqs_disabled())) {
>> + /*
>> + * Set 1ms as sampling interval, but never schedule
>> + * to the idle task to prevent the AMU counters from
>> + * stopping working.
>> + */
>> + timeout = jiffies + msecs_to_jiffies(1);
>> + while (!time_after(jiffies, timeout))
>> + cond_resched();
>> +
>> + } else {
> ... so we'll enter this branch of the if-else ...
>
>> + pr_warn_once("CPU%d: Get rate in atomic context", cpu);
> ... and pr_warn_once() for something that's apparently normal and outside of
> the user's control?
>
> That doesn't make much sense to me.
>
> Mark.
>
>> + udelay(2); /* 2usec delay between sampling */
>> + }
>>
>> return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
>> }
>> --
>> 2.25.1
>>
On Thu, Oct 26, 2023 at 09:55:39AM +0800, Zeng Heng wrote:
>
> 在 2023/10/25 19:01, Mark Rutland 写道:
> > On Wed, Oct 25, 2023 at 05:38:47PM +0800, Zeng Heng wrote:
> >
> > The previous patch added this function, and calls it with smp_call_on_cpu(),
> > where it'll run in IRQ context with IRQs disabled...
>
> smp_call_on_cpu() puts the work to the bind-cpu worker.
Ah, sorry -- I had confused this with the smp_call_function*() family, which do
this in IRQ context.
> And this function will be called in task context, and IRQs is certainly enabled.
Understood; given that, please ignore my comments below.
Mark.
>
>
> Zeng Heng
>
> > > struct fb_ctr_pair *fb_ctrs = val;
> > > int cpu = fb_ctrs->cpu;
> > > int ret;
> > > + unsigned long timeout;
> > > ret = cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t0);
> > > if (ret)
> > > return ret;
> > > - udelay(2); /* 2usec delay between sampling */
> > > + if (likely(!irqs_disabled())) {
> > > + /*
> > > + * Set 1ms as sampling interval, but never schedule
> > > + * to the idle task to prevent the AMU counters from
> > > + * stopping working.
> > > + */
> > > + timeout = jiffies + msecs_to_jiffies(1);
> > > + while (!time_after(jiffies, timeout))
> > > + cond_resched();
> > > +
> > > + } else {
> > ... so we'll enter this branch of the if-else ...
> >
> > > + pr_warn_once("CPU%d: Get rate in atomic context", cpu);
> > ... and pr_warn_once() for something that's apparently normal and outside of
> > the user's control?
> >
> > That doesn't make much sense to me.
> >
> > Mark.
> >
> > > + udelay(2); /* 2usec delay between sampling */
> > > + }
> > > return cppc_get_perf_ctrs(cpu, &fb_ctrs->fb_ctrs_t1);
> > > }
> > > --
> > > 2.25.1
> > >
© 2016 - 2025 Red Hat, Inc.