From nobody Thu Apr  9 19:16:16 2026
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BC4C38424E
	for <linux-kernel@vger.kernel.org>; Tue,  3 Mar 2026 20:22:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=192.198.163.14
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772569335; cv=none;
 b=O2CLtHL9SA0UcaXQwPpM3WEs8cHahFvQsJRk8mjVWtKqbapjc3xPha1QdCNb0l6DA/kwGRseMtu+GcB8TfRempdt7zAt3Dgcu/y/L68333phxGl04ajf5aMiE8KYDYIng+lcHuIGZ06cWibtRKb1PixFOsw4YlfTZMRzhVnnvfE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772569335; c=relaxed/simple;
	bh=kdNuhrYw+u24/2ELoowNFHAB+iDyWFyGZrCLDLO2C98=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=s6oDkMn4Salh4D5HeMNj1ySDmUngtSiov1OgLUVMW7ShsQcciRyhMW0CqxI2buwcPI5H0vpRWAoqZKNwcUCJFgPaaj2w9hECkUgyFn/Wl7j2v79qbDNYXspeKLTeSDlCdaiZZ1B2sHzE/Uw96356naAAjxQfu+bjbyuOE6UppoY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com;
 spf=pass smtp.mailfrom=linux.intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=f0EyNftF; arc=none smtp.client-ip=192.198.163.14
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="f0EyNftF"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1772569334; x=1804105334;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=kdNuhrYw+u24/2ELoowNFHAB+iDyWFyGZrCLDLO2C98=;
  b=f0EyNftFDN6JMDrZPkwQG8zkZ5HjSSRhAWDK2ABl6yEnSKJussF2DqI8
   xftkqCtu/3TStcXRnTo0ZSStNgVWZWDnWurIlQjq2KiMiksoQm7paR8lM
   kaA6Wd0wdE98C2qQEZGo4vZM+RHz4Z3MKJAz61Y9m5J5aSkhvQaOPJRE8
   l17cBR6X2r/9PAyGsbu5jV4oXC6Ew/WDhncv/IgfPcA4+jyz4Q34js3/4
   XPbK6Gtd5Aj7tnMFFeWGTRr7tCLhhWOe/cs2Zw2Bmusj5vyU4zya6d3O3
   XmyZYc+J7lCqATm66z9I1Y90t0P88u6ovrVyj/rNKvAUfTbrZkY7rmeSw
   w==;
X-CSE-ConnectionGUID: QpLUdjzmQhir7jgUw8RHeQ==
X-CSE-MsgGUID: 0a9+SAxSROKo1lusSmUS2A==
X-IronPort-AV: E=McAfee;i="6800,10657,11718"; a="73684535"
X-IronPort-AV: E=Sophos;i="6.21,322,1763452800";
   d="scan'208";a="73684535"
Received: from orviesa004.jf.intel.com ([10.64.159.144])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Mar 2026 12:22:14 -0800
X-CSE-ConnectionGUID: CArgTwp9TJqlfB/nS7S2dw==
X-CSE-MsgGUID: u6BE4F+cRgC+VrF0IUebeA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.21,322,1763452800";
   d="scan'208";a="222588669"
Received: from tassilo.jf.intel.com ([10.54.38.190])
  by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Mar 2026 12:22:14 -0800
From: Andi Kleen <ak@linux.intel.com>
To: tglx@kernel.org
Cc: mingo@redhat.com,
	dave.hansen@linux.intel.com,
	x86@kernel.org,
	peterz@infradead.org,
	hpa@zytor.com,
	rafael@kernel.org,
	linux-kernel@vger.kernel.org,
	Andi Kleen <ak@linux.intel.com>,
	ggherdovich@suse.cz,
	rafael.j.wysocki@intel.com
Subject: [PATCH v3] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad
 samples
Date: Tue,  3 Mar 2026 12:22:04 -0800
Message-ID: <20260303202204.108321-1-ak@linux.intel.com>
X-Mailer: git-send-email 2.53.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The APERF and MPERF MSRs get read together and the ratio
between the two is used to scale the scheduler capacity with frequency.

Since e2b0d619b400 when there is ever an over/underflow of
the APERF/MPERF computation the sampling gets completely
disabled, under the assumption that there is a problem with
the hardware.

However this can happen without any malfunction when there is
a long enough interruption between the two MSR reads, for
example due to an unlucky NMI or SMI or other system event
causing delays. We saw it when a delay resulted in
Acnt_Delta << Mcnt_Delta (about ~4k for acnt_delta and
2M for MCnt_Delta).

In this case the ratio computation underflows, which is detected,
but then APERF/MPERF usage gets incorrectly disabled forever.

Remove the code to completely disable APERF/MPERF on
a bad sample. Instead when any over/underflow happens
return the fallback full capacity.

In theory could have a threshold to disable, but since
delays could happen randomly it's unclear what a good
threshold would be. If the hardware is truly broken
this will result in using a few more cycles to read
the bogus samples, but they will be all still rejected.

Cc: ggherdovich@suse.cz
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rafael.j.wysocki@intel.com
Fixes: e2b0d619b400 ("x86, sched: check for counters overflow ...")
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Move freq_scale initialization to cover all cases (thanks 0day bot!)
v3: Rebased/reposted.
---
 arch/x86/kernel/cpu/aperfmperf.c | 35 +++++++++-----------------------
 1 file changed, 10 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmp=
erf.c
index 7ffc78d5ebf2..99ebbda53d10 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -334,23 +334,6 @@ static void __init bp_init_freq_invariance(void)
 	}
 }
=20
-static void disable_freq_invariance_workfn(struct work_struct *work)
-{
-	int cpu;
-
-	static_branch_disable(&arch_scale_freq_key);
-
-	/*
-	 * Set arch_freq_scale to a default value on all cpus
-	 * This negates the effect of scaling
-	 */
-	for_each_possible_cpu(cpu)
-		per_cpu(arch_freq_scale, cpu) =3D SCHED_CAPACITY_SCALE;
-}
-
-static DECLARE_WORK(disable_freq_invariance_work,
-		    disable_freq_invariance_workfn);
-
 DEFINE_PER_CPU(unsigned long, arch_freq_scale) =3D SCHED_CAPACITY_SCALE;
 EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);
=20
@@ -441,8 +424,14 @@ static void scale_freq_tick(u64 acnt, u64 mcnt)
 	if (!arch_scale_freq_invariant())
 		return;
=20
+	/*
+	 * On any over/underflow just ignore the sample. It could
+	 * be due to an unlucky NMI or similar between the
+	 * APERF and MPERF reads.
+	 */
+	freq_scale =3D SCHED_CAPACITY_SCALE;
 	if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
-		goto error;
+		goto out;
=20
 	if (static_branch_unlikely(&arch_hybrid_cap_scale_key))
 		freq_ratio =3D READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio);
@@ -450,21 +439,17 @@ static void scale_freq_tick(u64 acnt, u64 mcnt)
 		freq_ratio =3D arch_max_freq_ratio;
=20
 	if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt)
-		goto error;
+		goto out;
=20
 	freq_scale =3D div64_u64(acnt, mcnt);
 	if (!freq_scale)
-		goto error;
+		goto out;
=20
 	if (freq_scale > SCHED_CAPACITY_SCALE)
 		freq_scale =3D SCHED_CAPACITY_SCALE;
=20
+out:
 	this_cpu_write(arch_freq_scale, freq_scale);
-	return;
-
-error:
-	pr_warn("Scheduler frequency invariance went wobbly, disabling!\n");
-	schedule_work(&disable_freq_invariance_work);
 }
 #else
 static inline void bp_init_freq_invariance(void) { }
--=20
2.53.0