[PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples

Andi Kleen posted 1 patch 2 weeks ago
There is a newer version of this series
arch/x86/kernel/cpu/aperfmperf.c | 36 ++++++++++----------------------
1 file changed, 11 insertions(+), 25 deletions(-)
[PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by Andi Kleen 2 weeks ago
The APERF and MPERF MSRs get read together and the ratio
between the two is used to scale the scheduler capacity with frequency.

Since e2b0d619b400 when there is ever an over/underflow of
the APERF/MPERF computation the sampling gets completely
disabled, under the assumption that there is a problem with
the hardware.

However this can happen without any malfunction when there is
a long enough interruption between the two MSR reads, for
example due to an unlucky NMI or SMI or other system event
causing delays. We saw it when a delay resulted in
Acnt_Delta << Mcnt_Delta (about ~4k for acnt_delta and
2M for MCnt_Delta)

In this case the ratio computation underflows, which is detected,
but then APERF/MPERF usage gets incorrectly disabled forever.

Remove the code to completely disable APERF/MPERF on
a bad sample. Instead when any over/underflow happens
return the fallback full capacity.

In theory could have a threshold to disable, but since
delays could happen randomly it's unclear what a good
threshold would be. If the hardware is truly broken
this will result in using a few more cycles to read
the bogus samples, but they will be all still rejected.

Cc: ggherdovich@suse.cz
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rafael.j.wysocki@intel.com
Fixes: e2b0d619b400 ("x86, sched: check for counters overflow ...")
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/aperfmperf.c | 36 ++++++++++----------------------
 1 file changed, 11 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index a315b0627dfb..7f4210e1082b 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -330,23 +330,6 @@ static void __init bp_init_freq_invariance(void)
 	}
 }
 
-static void disable_freq_invariance_workfn(struct work_struct *work)
-{
-	int cpu;
-
-	static_branch_disable(&arch_scale_freq_key);
-
-	/*
-	 * Set arch_freq_scale to a default value on all cpus
-	 * This negates the effect of scaling
-	 */
-	for_each_possible_cpu(cpu)
-		per_cpu(arch_freq_scale, cpu) = SCHED_CAPACITY_SCALE;
-}
-
-static DECLARE_WORK(disable_freq_invariance_work,
-		    disable_freq_invariance_workfn);
-
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
 EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
 
@@ -437,30 +420,33 @@ static void scale_freq_tick(u64 acnt, u64 mcnt)
 	if (!arch_scale_freq_invariant())
 		return;
 
+	/*
+	 * On any over/underflow just ignore the sample. It could
+	 * be due to an unlucky NMI or similar between the
+	 * APERF and MPERF reads.
+	 */
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
-		goto error;
+		goto out;
 
 	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
 		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
 	else
 		freq_ratio = arch_max_freq_ratio;
 
+	freq_scale = SCHED_CAPACITY_SCALE;
+
 	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
-		goto error;
+		goto out;
 
 	freq_scale = div64_u64(acnt, mcnt);
 	if (!freq_scale)
-		goto error;
+		goto out;
 
 	if (freq_scale > SCHED_CAPACITY_SCALE)
 		freq_scale = SCHED_CAPACITY_SCALE;
 
+out:
 	this_cpu_write(arch_freq_scale, freq_scale);
-	return;
-
-error:
-	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
-	schedule_work(&disable_freq_invariance_work);
 }
 #else
 static inline void bp_init_freq_invariance(void) { }
-- 
2.51.1
Re: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by Peter Zijlstra 1 week, 6 days ago
On Thu, Dec 04, 2025 at 10:09:14AM -0800, Andi Kleen wrote:
> The APERF and MPERF MSRs get read together and the ratio
> between the two is used to scale the scheduler capacity with frequency.
> 
> Since e2b0d619b400 when there is ever an over/underflow of
> the APERF/MPERF computation the sampling gets completely
> disabled, under the assumption that there is a problem with
> the hardware.
> 
> However this can happen without any malfunction when there is
> a long enough interruption between the two MSR reads, for
> example due to an unlucky NMI or SMI or other system event
> causing delays. We saw it when a delay resulted in
> Acnt_Delta << Mcnt_Delta (about ~4k for acnt_delta and
> 2M for MCnt_Delta)
> 
> In this case the ratio computation underflows, which is detected,
> but then APERF/MPERF usage gets incorrectly disabled forever.
> 
> Remove the code to completely disable APERF/MPERF on
> a bad sample. Instead when any over/underflow happens
> return the fallback full capacity.

So what systems are actually showing this bad behaviour and what are we
doing to cure the problem rather than fight the symptom?

Also, a system where this is systematically buggered would really be
better off disabling it, no?
Re: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by Andi Kleen 1 week, 4 days ago
On Fri, Dec 05, 2025 at 05:10:52PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 04, 2025 at 10:09:14AM -0800, Andi Kleen wrote:
> > The APERF and MPERF MSRs get read together and the ratio
> > between the two is used to scale the scheduler capacity with frequency.
> > 
> > Since e2b0d619b400 when there is ever an over/underflow of
> > the APERF/MPERF computation the sampling gets completely
> > disabled, under the assumption that there is a problem with
> > the hardware.
> > 
> > However this can happen without any malfunction when there is
> > a long enough interruption between the two MSR reads, for
> > example due to an unlucky NMI or SMI or other system event
> > causing delays. We saw it when a delay resulted in
> > Acnt_Delta << Mcnt_Delta (about ~4k for acnt_delta and
> > 2M for MCnt_Delta)
> > 
> > In this case the ratio computation underflows, which is detected,
> > but then APERF/MPERF usage gets incorrectly disabled forever.
> > 
> > Remove the code to completely disable APERF/MPERF on
> > a bad sample. Instead when any over/underflow happens
> > return the fallback full capacity.
> 
> So what systems are actually showing this bad behaviour and what are we
> doing to cure the problem rather than fight the symptom?

We saw it with an artificial stress test on an Intel internal system,
but like I (and Andrew) explained it is unavoidable and general:
Delays can always happen due to many reasons on any system: NMIs, SMIs,
virtualization, other random system issues.

> Also, a system where this is systematically buggered would really be
> better off disabling it, no?

The particular failure case here if it was common (lots of very long
execution delays) would make the system fairly unusable anyways.
The scheduler doing a slightly worse job is the least of your troubles
in such a case.

For other failures I'm not aware of a system (perhaps short of a
hypervisor that doesn't save/restore when switching underlying cpus)
that actually has broken APERF/MPERF. So breaking good systems 
just for a hypothetical bad case doesn't seem like a good trade off.

The main difference to the old strategy is that if there is a really
bad case but it results in bad samples that don't under/overflow
they would still be used, while the old one would stop on any
bad sample.But any attempt to handle this without impacting
good cases would either need complexity or magic threshold numbers.

So it seems better to not even try.

-Andi
Re: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by kernel test robot 2 weeks ago
Hi Andi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on tip/master peterz-queue/sched/core linus/master v6.18 next-20251204]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andi-Kleen/x86-aperfmperf-Don-t-disable-scheduler-APERF-MPERF-on-bad-samples/20251205-021657
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20251204180914.1855553-1-ak%40linux.intel.com
patch subject: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20251205/202512050606.C8pVgbHT-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251205/202512050606.C8pVgbHT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512050606.C8pVgbHT-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/x86/kernel/cpu/aperfmperf.c:428:6: warning: variable 'freq_scale' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
     428 |         if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/overflow.h:198:37: note: expanded from macro 'check_shl_overflow'
     198 | #define check_shl_overflow(a, s, d) __must_check_overflow(({            \
         |                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     199 |         typeof(a) _a = a;                                               \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     200 |         typeof(s) _s = s;                                               \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     201 |         typeof(d) _d = d;                                               \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     202 |         unsigned long long _a_full = _a;                                \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     203 |         unsigned int _to_shift =                                        \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     204 |                 is_non_negative(_s) && _s < 8 * sizeof(*d) ? _s : 0;    \
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     205 |         *_d = (_a_full << _to_shift);                                   \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     206 |         (_to_shift != _s || is_negative(*_d) || is_negative(_a) ||      \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     207 |         (*_d >> _to_shift) != _a);                                      \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     208 | }))
         | ~~~
   arch/x86/kernel/cpu/aperfmperf.c:449:34: note: uninitialized use occurs here
     449 |         this_cpu_write(arch_freq_scale, freq_scale);
         |                                         ^~~~~~~~~~
   include/linux/percpu-defs.h:500:73: note: expanded from macro 'this_cpu_write'
     500 | #define this_cpu_write(pcp, val)        __pcpu_size_call(this_cpu_write_, pcp, val)
         |                                                                                ^~~
   include/linux/percpu-defs.h:372:29: note: expanded from macro '__pcpu_size_call'
     372 |                 case 8: stem##8(variable, __VA_ARGS__);break;           \
         |                                           ^~~~~~~~~~~
   arch/x86/include/asm/percpu.h:528:72: note: expanded from macro 'this_cpu_write_8'
     528 | #define this_cpu_write_8(pcp, val)                      __raw_cpu_write(8, volatile, pcp, val)
         |                                                                                           ^~~
   arch/x86/include/asm/percpu.h:165:52: note: expanded from macro '__raw_cpu_write'
     165 |         __pcpu_type_##size pto_val__ = __pcpu_cast_##size(_val);        \
         |                                                           ^~~~
   arch/x86/include/asm/percpu.h:118:35: note: expanded from macro '__pcpu_cast_8'
     118 | #define __pcpu_cast_8(val)      ((u64)(val))
         |                                        ^~~
   arch/x86/kernel/cpu/aperfmperf.c:428:2: note: remove the 'if' if its condition is always false
     428 |         if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     429 |                 goto out;
         |                 ~~~~~~~~
   arch/x86/kernel/cpu/aperfmperf.c:418:16: note: initialize the variable 'freq_scale' to silence this warning
     418 |         u64 freq_scale, freq_ratio;
         |                       ^
         |                        = 0
   1 warning generated.


vim +428 arch/x86/kernel/cpu/aperfmperf.c

5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  415  
73a5fa7d51366a5 Thomas Gleixner   2022-04-15  416  static void scale_freq_tick(u64 acnt, u64 mcnt)
55cb0b70749361d Thomas Gleixner   2022-04-15  417  {
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  418  	u64 freq_scale, freq_ratio;
55cb0b70749361d Thomas Gleixner   2022-04-15  419  
bb6e89df9028b2f Thomas Gleixner   2022-04-15  420  	if (!arch_scale_freq_invariant())
bb6e89df9028b2f Thomas Gleixner   2022-04-15  421  		return;
bb6e89df9028b2f Thomas Gleixner   2022-04-15  422  
b4366dd91793d58 Andi Kleen        2025-12-04  423  	/*
b4366dd91793d58 Andi Kleen        2025-12-04  424  	 * On any over/underflow just ignore the sample. It could
b4366dd91793d58 Andi Kleen        2025-12-04  425  	 * be due to an unlucky NMI or similar between the
b4366dd91793d58 Andi Kleen        2025-12-04  426  	 * APERF and MPERF reads.
b4366dd91793d58 Andi Kleen        2025-12-04  427  	 */
55cb0b70749361d Thomas Gleixner   2022-04-15 @428  	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
b4366dd91793d58 Andi Kleen        2025-12-04  429  		goto out;
55cb0b70749361d Thomas Gleixner   2022-04-15  430  
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  431  	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  432  		freq_ratio = READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  433  	else
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  434  		freq_ratio = arch_max_freq_ratio;
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  435  
b4366dd91793d58 Andi Kleen        2025-12-04  436  	freq_scale = SCHED_CAPACITY_SCALE;
b4366dd91793d58 Andi Kleen        2025-12-04  437  
5a9d10145a54f7a Rafael J. Wysocki 2024-08-28  438  	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
b4366dd91793d58 Andi Kleen        2025-12-04  439  		goto out;
55cb0b70749361d Thomas Gleixner   2022-04-15  440  
55cb0b70749361d Thomas Gleixner   2022-04-15  441  	freq_scale = div64_u64(acnt, mcnt);
55cb0b70749361d Thomas Gleixner   2022-04-15  442  	if (!freq_scale)
b4366dd91793d58 Andi Kleen        2025-12-04  443  		goto out;
55cb0b70749361d Thomas Gleixner   2022-04-15  444  
55cb0b70749361d Thomas Gleixner   2022-04-15  445  	if (freq_scale > SCHED_CAPACITY_SCALE)
55cb0b70749361d Thomas Gleixner   2022-04-15  446  		freq_scale = SCHED_CAPACITY_SCALE;
55cb0b70749361d Thomas Gleixner   2022-04-15  447  
b4366dd91793d58 Andi Kleen        2025-12-04  448  out:
55cb0b70749361d Thomas Gleixner   2022-04-15  449  	this_cpu_write(arch_freq_scale, freq_scale);
55cb0b70749361d Thomas Gleixner   2022-04-15  450  }
bb6e89df9028b2f Thomas Gleixner   2022-04-15  451  #else
bb6e89df9028b2f Thomas Gleixner   2022-04-15  452  static inline void bp_init_freq_invariance(void) { }
bb6e89df9028b2f Thomas Gleixner   2022-04-15  453  static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
bb6e89df9028b2f Thomas Gleixner   2022-04-15  454  #endif /* CONFIG_X86_64 && CONFIG_SMP */
73a5fa7d51366a5 Thomas Gleixner   2022-04-15  455  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by Andrew Cooper 2 weeks ago
> However this can happen without any malfunction when there is
> a long enough interruption between the two MSR reads, for
> example due to an unlucky NMI or SMI or other system event
> causing delays.

The list of problems is even longer under virt.  The vCPU can lose it's
timeslice (-> arbitrary delay), or be scheduled onto a different CPU
between the A and M reads (-> calculate the wrong ratio, and not in a
way you could exclude it as a bad sample).

Despite this, some people insist on having A/MPERF inside VMs.

~Andrew
Re: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples
Posted by Andi Kleen 2 weeks ago
On Thu, Dec 04, 2025 at 08:02:09PM +0000, Andrew Cooper wrote:
> > However this can happen without any malfunction when there is
> > a long enough interruption between the two MSR reads, for
> > example due to an unlucky NMI or SMI or other system event
> > causing delays.
> 
> The list of problems is even longer under virt.  The vCPU can lose it's
> timeslice (-> arbitrary delay), or be scheduled onto a different CPU
> between the A and M reads (-> calculate the wrong ratio, and not in a
> way you could exclude it as a bad sample).

You could avoid that in the hypervisor by saving/restoring the counts (the
MSRs are writable), but it's likely rare and harmless enough that 
it's not worth the performance overhead.

-Andi