From nobody Fri Dec 19 18:54:04 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 026C9350D74 for ; Thu, 4 Dec 2025 18:09:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764871789; cv=none; b=X2nwM9Ny08nb6gd6Jh9ByfpIfPZuPNJNqwQ1eOlHmORhf7Tx3Fr1JIo7nAniesPg/PeHcdrXK8n1G/2/VUovPxqylJBJdR7WjaYW3nvu6dtsbhFOCxmnSCDdUFI6M2KzheotyxhHdXQ2U6KVb28mZtMKlcYOHA+WRRBGPw7V/UI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764871789; c=relaxed/simple; bh=OJ/OVfi9EM5Thut+7AILs3O4Mi2nJMJB0KoLTzlxgJA=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=cTYHXmcoj1FZiyqGnDWWqP+cgViYvyuK0VXfzk+piieKSu94Lsw4siR2xzwnt9FcvaHc1PICMYN9ZYXHl4DSbcNaXlmYQ0/w+MCVTw6ju6wNUUfureytyQUxEC4ht2hDFoqjTKAj38CbzfCL0S/pQuPKeZB786fJa4BTZ4jHTo0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=iLiq+WVJ; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="iLiq+WVJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764871785; x=1796407785; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=OJ/OVfi9EM5Thut+7AILs3O4Mi2nJMJB0KoLTzlxgJA=; b=iLiq+WVJPqBUHb2tjNr/5hrtiEw9pD+lXIg6Kj9vQUunPcA37s4xgzeT ghwO44QkMPKxYE1///7X6gVFFR6mbKrgsRXWS46FnNaWz9jirVZwxsywR P5KIvS0Am6nzRF0ynZfJlpsmLDYpk+BQxC8rCytoaflbQakw4NAvXpapT S9fylhUj2rWQkXM30MSd/US5g/Q4sB0l7j+xgmUiYP0VozN3Nt3KgXId1 3ojrVZ0So8p0x6xuRC41vQZ2ulU0FKMDeqwdk0eSl99vMXjKYdZHGCz7R 2gkd+26epz3enidVEvCdS5H/M3qmKIwufqId1sBSS+8hHfKCBnx0xH064 w==; X-CSE-ConnectionGUID: ZC+lYhJJQCa1JSdXucaJPg== X-CSE-MsgGUID: C5hFknLgRTKV1o5D52BeKA== X-IronPort-AV: E=McAfee;i="6800,10657,11632"; a="70751594" X-IronPort-AV: E=Sophos;i="6.20,249,1758610800"; d="scan'208";a="70751594" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2025 10:09:44 -0800 X-CSE-ConnectionGUID: 8l53wVCiSN2fem8r4TCfVg== X-CSE-MsgGUID: U9uumU9JTCeYn7tzighrbw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,249,1758610800"; d="scan'208";a="195861721" Received: from tassilo.jf.intel.com ([10.54.38.190]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2025 10:09:44 -0800 From: Andi Kleen To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, Andi Kleen , ggherdovich@suse.cz, Peter Zijlstra , rafael.j.wysocki@intel.com Subject: [PATCH] x86/aperfmperf: Don't disable scheduler APERF/MPERF on bad samples Date: Thu, 4 Dec 2025 10:09:14 -0800 Message-ID: <20251204180914.1855553-1-ak@linux.intel.com> X-Mailer: git-send-email 2.51.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The APERF and MPERF MSRs get read together and the ratio between the two is used to scale the scheduler capacity with frequency. Since e2b0d619b400 when there is ever an over/underflow of the APERF/MPERF computation the sampling gets completely disabled, under the assumption that there is a problem with the hardware. However this can happen without any malfunction when there is a long enough interruption between the two MSR reads, for example due to an unlucky NMI or SMI or other system event causing delays. We saw it when a delay resulted in Acnt_Delta << Mcnt_Delta (about ~4k for acnt_delta and 2M for MCnt_Delta) In this case the ratio computation underflows, which is detected, but then APERF/MPERF usage gets incorrectly disabled forever. Remove the code to completely disable APERF/MPERF on a bad sample. Instead when any over/underflow happens return the fallback full capacity. In theory could have a threshold to disable, but since delays could happen randomly it's unclear what a good threshold would be. If the hardware is truly broken this will result in using a few more cycles to read the bogus samples, but they will be all still rejected. Cc: ggherdovich@suse.cz Cc: Peter Zijlstra Cc: rafael.j.wysocki@intel.com Fixes: e2b0d619b400 ("x86, sched: check for counters overflow ...") Signed-off-by: Andi Kleen --- arch/x86/kernel/cpu/aperfmperf.c | 36 ++++++++++---------------------- 1 file changed, 11 insertions(+), 25 deletions(-) diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmp= erf.c index a315b0627dfb..7f4210e1082b 100644 --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -330,23 +330,6 @@ static void __init bp_init_freq_invariance(void) } } =20 -static void disable_freq_invariance_workfn(struct work_struct *work) -{ - int cpu; - - static_branch_disable(&arch_scale_freq_key); - - /* - * Set arch_freq_scale to a default value on all cpus - * This negates the effect of scaling - */ - for_each_possible_cpu(cpu) - per_cpu(arch_freq_scale, cpu) =3D SCHED_CAPACITY_SCALE; -} - -static DECLARE_WORK(disable_freq_invariance_work, - disable_freq_invariance_workfn); - DEFINE_PER_CPU(unsigned long, arch_freq_scale) =3D SCHED_CAPACITY_SCALE; EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale); =20 @@ -437,30 +420,33 @@ static void scale_freq_tick(u64 acnt, u64 mcnt) if (!arch_scale_freq_invariant()) return; =20 + /* + * On any over/underflow just ignore the sample. It could + * be due to an unlucky NMI or similar between the + * APERF and MPERF reads. + */ if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt)) - goto error; + goto out; =20 if (static_branch_unlikely(&arch_hybrid_cap_scale_key)) freq_ratio =3D READ_ONCE(this_cpu_ptr(arch_cpu_scale)->freq_ratio); else freq_ratio =3D arch_max_freq_ratio; =20 + freq_scale =3D SCHED_CAPACITY_SCALE; + if (check_mul_overflow(mcnt, freq_ratio, &mcnt) || !mcnt) - goto error; + goto out; =20 freq_scale =3D div64_u64(acnt, mcnt); if (!freq_scale) - goto error; + goto out; =20 if (freq_scale > SCHED_CAPACITY_SCALE) freq_scale =3D SCHED_CAPACITY_SCALE; =20 +out: this_cpu_write(arch_freq_scale, freq_scale); - return; - -error: - pr_warn("Scheduler frequency invariance went wobbly, disabling!\n"); - schedule_work(&disable_freq_invariance_work); } #else static inline void bp_init_freq_invariance(void) { } --=20 2.51.1