From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44D912E6CD9 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; cv=none; b=rM4aAx4+eSfKQQT8lZSmdVNJSZeYU0mxzgQ5SXwIsdgDSQ+lti4zewV6uNjpoogJHq+HZfJ+oOLGnGFIimF0hV0cmNZozC+SlBezDG/1sp6I3CEdjrxM913iqGecZGop/f9kcqZPFEs9AfQHUxKsdig9Y7s17DYngYmxmzsDUpg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; c=relaxed/simple; bh=kyNvvX8d8Z10nF3LzBMwP9AHfpqra/BL93A1sYBNPAI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eggFHV5Kf6k+pdY60p+jc199pkLE33ITFhXcXHYLczDLWqhreYLGlfw4xRnhkVwU0dd2xYIMtyjTg1hFkuezpQyG0pSj5cRfRC4ylER6mUFumoxPyDjbWPdIv0V/5B4pfP9TKWn/dJMLIxI2K4gcRcOfoE6x2lHmKJjp+6RWuwc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=nH76roIl; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="nH76roIl" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=zO4x19k55VdNyjCXU0v5aeF8BxFoVMCPY3uuyYoVlXM=; b=nH76roIl0112QJWoy/rkGn3t87 EyFym6veATc3+d+Igg0TZTizEsouClhXPB2EyopQhBpAAqz7LwzcCiTEPleamTMdqxZ6mGoiatW6+ k7bF6Fm2q6udc5ch5aO9f2jrgPbatrR10bIV3Hvo9Xpc+pcAcXO43YjdYDYIGn20gRmRMLcwyicd7 oZxd1UmQsE/qPepAOJC7MIRLqaZIS4ptelmM7Dwf44BTBqGkxqg+WOMpLn1ZwGsPZQu3uFxG3qLdA p0NU7UCRHskoD7BSjnP4awN70aUY8lQEBZlHihf6FK7990sik6jch/OUi1X6KpGomsgKwGQ/JIVYo 0yHP3o6Q==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1Es7; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Yu-cheng Yu , Rik van Riel Subject: [RFC v5 1/8] x86/mm: Introduce Remote Action Request MSRs Date: Fri, 21 Nov 2025 13:54:22 -0500 Message-ID: <20251121185530.21876-2-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu Remote Action Request (RAR) is a model-specific feature to speed up inter-processor operations by moving parts of those operations from software to hardware. The current RAR implementation handles TLB flushes and MSR writes. This patch introduces RAR MSRs. RAR is introduced in later patches. There are five RAR MSRs: MSR_CORE_CAPABILITIES MSR_IA32_RAR_CTRL MSR_IA32_RAR_ACT_VEC MSR_IA32_RAR_PAYLOAD_BASE MSR_IA32_RAR_INFO Signed-off-by: Yu-cheng Yu Signed-off-by: Rik van Riel --- arch/x86/include/asm/msr-index.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-in= dex.h index 3d0a0950d20a..69d9e96e8324 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -110,6 +110,8 @@ =20 /* Abbreviated from Intel SDM name IA32_CORE_CAPABILITIES */ #define MSR_IA32_CORE_CAPS 0x000000cf +#define MSR_IA32_CORE_CAPS_RAR_BIT 1 +#define MSR_IA32_CORE_CAPS_RAR BIT(MSR_IA32_CORE_CAPS_RAR_BIT) #define MSR_IA32_CORE_CAPS_INTEGRITY_CAPS_BIT 2 #define MSR_IA32_CORE_CAPS_INTEGRITY_CAPS BIT(MSR_IA32_CORE_CAPS_INTEGRI= TY_CAPS_BIT) #define MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT_BIT 5 @@ -122,6 +124,17 @@ #define SNB_C3_AUTO_UNDEMOTE (1UL << 27) #define SNB_C1_AUTO_UNDEMOTE (1UL << 28) =20 +/* + * Remote Action Requests (RAR) MSRs + */ +#define MSR_IA32_RAR_CTRL 0x000000ed +#define MSR_IA32_RAR_ACT_VEC 0x000000ee +#define MSR_IA32_RAR_PAYLOAD_BASE 0x000000ef +#define MSR_IA32_RAR_INFO 0x000000f0 + +#define RAR_CTRL_ENABLE BIT(31) +#define RAR_CTRL_IGNORE_IF BIT(30) + #define MSR_MTRRcap 0x000000fe =20 #define MSR_IA32_ARCH_CAPABILITIES 0x0000010a --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44E172E7F25 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751342; cv=none; b=thm6dLJwPBNzqJx8Jl8p8UTzgsAR8TPuEQwsRjPRZr6sjj/LlJzFgBB8wcJoVVSZH6VkFr4Uq412iPr8Gq2oYQUgeHiz3r3x39B5RmV68RNq6kzk0VMR0v7ZxdjjuoIFrVmj1sP9i7w6ECgpZqSNma5emc8YM7h+qvp2Ty0DCC8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751342; c=relaxed/simple; bh=2LlKtg5OTutcR7CXG/S2Vdj8byB0vmlyWX5md2K/2us=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ky1gr+rvkH7EGvJTLoq+dtybNSWX+WaWselBSZvHNeNL7BJ3slXBRHSnLU7586SNNWR42KyuK4IVIENh80gO5MB1JBXbM0hJgRf7mOXdYoxBj+nheCCCOcRYX/PviLDFkZRLcSq+ZlPnFl/9C6zne8B3uLUnUkJem0h1o3ebF3A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=N0xVb90S; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="N0xVb90S" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=zLEh+v4goIHofpOxJTSOvne0xnDbCpiZkI74Ux8pr/s=; b=N0xVb90SQDRvyVlM14+IH8E9tu rx9eVc+ScW921gqJ9dMnmurdZsW09ksnJb4S0c0zQGeUCV+T96XA9U1V7EX4nSlLEWs6k01PPdLui 9NRDp8yu1/eS5g2Z3wf17OUbwMCvSF4JePhIGL/eV1cO3WysK2xsGCZmW56E2zTjTOq5CUoR+sQv+ tmISLBQQZ4PPSWzLVCwdlIYjGU0CvS/543klKnDAwdydsBt3QtLCxPnuewFd+TWTJU78fLQ1pzlay pjAqlL4udIHeAx2do/XIj+CTx6LMWVWMvp2lozHyVoEjppiQ/ktuRcMx2FCKj6fmGlD9waMW+fapY PHkHY2ow==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1Kky; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Rik van Riel , Rik van Riel Subject: [RFC v5 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Date: Fri, 21 Nov 2025 13:54:23 -0500 Message-ID: <20251121185530.21876-3-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel Much of the code for Intel RAR and AMD INVLPGB is shared. Place both under the same config option. Signed-off-by: Rik van Riel --- arch/x86/Kconfig.cpu | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu index f928cf6e3252..ab763f69f54d 100644 --- a/arch/x86/Kconfig.cpu +++ b/arch/x86/Kconfig.cpu @@ -360,7 +360,7 @@ menuconfig PROCESSOR_SELECT =20 config BROADCAST_TLB_FLUSH def_bool y - depends on CPU_SUP_AMD && 64BIT + depends on (CPU_SUP_AMD || CPU_SUP_INTEL) && 64BIT && SMP =20 config CPU_SUP_INTEL default y --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44BC02E4254 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; cv=none; b=SxgXHRlGXs/rYCnx2WN4DsE6WOQn/IbGHFviGzjOmtAF+nmjOyKCuQl/Nz2lN6a/upb6ewFASEXW/xLvydEtoH2w6UqLEB/qpuBDx8DWnF+aBPtaixjgwo+CycPyaG9VaiHCI1pEY1Ddj6d+a6nE5LeQ6u6gTvbaX6jTn63dIz8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; c=relaxed/simple; bh=NCEIGYCRLDe5SO5QV/Sz8yL3ufMc+gjo3oK4TfY5XdU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GRBl2ECAIUDgLsZh8gmWox9fI1v/hGHtTNVkMy5YvIfhloI4ZLM1MjurExP6//zW/Lr71EmhfCuYn4qCJwhO8/Y3iGqVZhc6PrMUBGHxP2lW13djyWt0VXoU8X6ZCoPxo4ymELhXdFnmg+tFS5aL+6PW5Pkf5apR11NR/N2T5vo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=VheqOwN6; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="VheqOwN6" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=A/tEc5tFrvTtBuLMXuLE+3Ir+qQmPEaILfulPI2BDOU=; b=VheqOwN6A2nhqQpeYKL3X+7bW4 oVtNF6AZIrS/xvOmeFyApFmstN2D+jPT8I5GrBf6IL6Rz0UH4BVH6zAc7qeqped0EUPzhe7CoyvVV NQQar/MYTER9IuKdSs3nAjWJhTZFQk21AXMc8N4+T1VmUkAfj1gG4uRFEiiTGOpWU3qV8OKywzvpg D8qT6uHRmzt07V2KuvF939mh1wXIDct2V+tldBrFe7gSTEkVGtD4hNTMfC98BbJ5+fEIjOj34Qz0e WRyo5RcMjfkA8AezN/s3fVFIo9I4TaRW/2Eg09a9LqZRhKu6zPuCq0KC0ODVPc7SsUxivt1USi7N/ vrIIiHJg==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1QJt; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Yu-cheng Yu , Rik van Riel Subject: [RFC v5 3/8] x86/mm: Introduce X86_FEATURE_RAR Date: Fri, 21 Nov 2025 13:54:24 -0500 Message-ID: <20251121185530.21876-4-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu Introduce X86_FEATURE_RAR and enumeration of the feature. [riel: moved initialization to intel.c and disabling to Kconfig.cpufeatures] Signed-off-by: Yu-cheng Yu Signed-off-by: Rik van Riel --- arch/x86/Kconfig.cpufeatures | 4 ++++ arch/x86/include/asm/cpufeatures.h | 2 +- arch/x86/kernel/cpu/intel.c | 9 +++++++++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig.cpufeatures b/arch/x86/Kconfig.cpufeatures index 733d5aff2456..6389f0a617c8 100644 --- a/arch/x86/Kconfig.cpufeatures +++ b/arch/x86/Kconfig.cpufeatures @@ -199,3 +199,7 @@ config X86_DISABLED_FEATURE_SEV_SNP config X86_DISABLED_FEATURE_INVLPGB def_bool y depends on !BROADCAST_TLB_FLUSH + +config X86_DISABLED_FEATURE_RAR + def_bool y + depends on !BROADCAST_TLB_FLUSH diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpuf= eatures.h index b6472e252491..5cb57a820198 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -76,7 +76,7 @@ #define X86_FEATURE_K8 ( 3*32+ 4) /* Opteron, Athlon64 */ #define X86_FEATURE_ZEN5 ( 3*32+ 5) /* CPU based on Zen5 microarchitectur= e */ #define X86_FEATURE_ZEN6 ( 3*32+ 6) /* CPU based on Zen6 microarchitectur= e */ -/* Free ( 3*32+ 7) */ +#define X86_FEATURE_RAR ( 3*32+ 7) /* Intel Remote Action Request */ #define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* "constant_tsc" TSC ticks at= a constant rate */ #define X86_FEATURE_UP ( 3*32+ 9) /* "up" SMP kernel running on UP */ #define X86_FEATURE_ART ( 3*32+10) /* "art" Always running timer (ART) */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 98ae4c37c93e..b53bf3452d6a 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -719,6 +719,15 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c) cpuid_leaf_0x2(®s); for_each_cpuid_0x2_desc(regs, ptr, desc) intel_tlb_lookup(desc); + + if (cpu_has(c, X86_FEATURE_CORE_CAPABILITIES)) { + u64 msr; + + rdmsrl(MSR_IA32_CORE_CAPS, msr); + + if (msr & MSR_IA32_CORE_CAPS_RAR) + setup_force_cpu_cap(X86_FEATURE_RAR); + } } =20 static const struct cpu_dev intel_cpu_dev =3D { --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E56222E336F for ; Fri, 21 Nov 2025 18:55:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; cv=none; b=VVR64+cygeW1tZvmltLBFlLKcVuJO3wGT4BHDdM7gB5oYGw6XjsV1EInALkUsr2IZ9bBlU69r+X1c/FPSewkkjQJWGa+VmxoqM74eHYDNvjCaXi+JQsUhdX14Yvo4Fe55nrOtOfbN1u3ZxSnJUK0EqOZizNemH1Mhfy7xcFAl3g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; c=relaxed/simple; bh=9176J17qEMihlqFzpZ3NQ0m9ZU+eUwlm7YbfN6XlV8w=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=D0KtrYPtH04FFA/em0L7xFu3fwJ6rTr8rNkVulvlLojFZJ0m+z7ssSmKEF/vUsJFFc0jS4R6RpgQXxukbzC4F411T8FNtLsdHjAaRXJvwTApjAJ+9wCYgupR882V7isvCEUKKnwOB/RLpX+a4DvIJiFFADUxGXu+8IYou6F8g2g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=V2cZGPcn; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="V2cZGPcn" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=QZeXt++V+t0s1eS9v+LvMKqHQ5JNcRneWedbaF2UH6k=; b=V2cZGPcnVGCAfaYJVus1PjGNkk 1LgJTD4zRtBrvfL6GIFzkTlh3Cn4EdjH0th5asMeRZs/L2eu2bKiYnflBqvDu4Wfn3WDUTuD0p/5+ cTp0PDC+HmNQmnQAjId5p9phZjPoJvqm5qu+z/2X2EIlLDftXuXm2uRBbTSF4Qo/HnJL0SZeaMoLy st+YhpzVoaY8os/mh3R43kh5YHypIcnUpa7UnIi4ja5N4DDMk8KeyHi+CakKAjYIu2ycB0Wu3u7Zf 7F7son2vDjjhVS70376Lt6lIFmdKStihMwmxt5NmamnCVn6hcgUyr4XQZl0wOPh2pHCcEg6U/Q0PH 4qbTRr5w==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1VoD; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Yu-cheng Yu , Rik van Riel Subject: [RFC v5 4/8] x86/apic: Introduce Remote Action Request Operations Date: Fri, 21 Nov 2025 13:54:25 -0500 Message-ID: <20251121185530.21876-5-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu RAR TLB flushing is started by sending a command to the APIC. This patch adds Remote Action Request commands. Because RAR_VECTOR is hardcoded at 0xe0, POSTED_MSI_NOTIFICATION_VECTOR has to be lowered to 0xdf, reducing the number of available vectors by 13. [riel: refactor after 6 years of changes, lower POSTED_MSI_NOTIFICATION_VEC= TOR] Signed-off-by: Yu-cheng Yu Signed-off-by: Rik van Riel --- arch/x86/include/asm/apicdef.h | 1 + arch/x86/include/asm/irq_vectors.h | 7 ++++++- arch/x86/include/asm/smp.h | 1 + arch/x86/kernel/apic/ipi.c | 5 +++++ arch/x86/kernel/apic/local.h | 3 +++ 5 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h index be39a543fbe5..1b0bf20f0c7d 100644 --- a/arch/x86/include/asm/apicdef.h +++ b/arch/x86/include/asm/apicdef.h @@ -92,6 +92,7 @@ #define APIC_DM_LOWEST 0x00100 #define APIC_DM_SMI 0x00200 #define APIC_DM_REMRD 0x00300 +#define APIC_DM_RAR 0x00300 #define APIC_DM_NMI 0x00400 #define APIC_DM_INIT 0x00500 #define APIC_DM_STARTUP 0x00600 diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_= vectors.h index 47051871b436..52a0cf56562a 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -97,11 +97,16 @@ =20 #define LOCAL_TIMER_VECTOR 0xec =20 +/* + * RAR (remote action request) TLB flush + */ +#define RAR_VECTOR 0xe0 + /* * Posted interrupt notification vector for all device MSIs delivered to * the host kernel. */ -#define POSTED_MSI_NOTIFICATION_VECTOR 0xeb +#define POSTED_MSI_NOTIFICATION_VECTOR 0xdf =20 #define NR_VECTORS 256 =20 diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 84951572ab81..d3ef57e60360 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -123,6 +123,7 @@ void __noreturn mwait_play_dead(unsigned int eax_hint); void native_smp_send_reschedule(int cpu); void native_send_call_func_ipi(const struct cpumask *mask); void native_send_call_func_single_ipi(int cpu); +void native_send_rar_ipi(const struct cpumask *mask); =20 asmlinkage __visible void smp_reboot_interrupt(void); __visible void smp_reschedule_interrupt(struct pt_regs *regs); diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c index 98a57cb4aa86..9983c42619ef 100644 --- a/arch/x86/kernel/apic/ipi.c +++ b/arch/x86/kernel/apic/ipi.c @@ -106,6 +106,11 @@ void apic_send_nmi_to_offline_cpu(unsigned int cpu) return; apic->send_IPI(cpu, NMI_VECTOR); } + +void native_send_rar_ipi(const struct cpumask *mask) +{ + __apic_send_IPI_mask(mask, RAR_VECTOR); +} #endif /* CONFIG_SMP */ =20 static inline int __prepare_ICR2(unsigned int mask) diff --git a/arch/x86/kernel/apic/local.h b/arch/x86/kernel/apic/local.h index bdcf609eb283..833669174267 100644 --- a/arch/x86/kernel/apic/local.h +++ b/arch/x86/kernel/apic/local.h @@ -38,6 +38,9 @@ static inline unsigned int __prepare_ICR(unsigned int sho= rtcut, int vector, case NMI_VECTOR: icr |=3D APIC_DM_NMI; break; + case RAR_VECTOR: + icr |=3D APIC_DM_RAR; + break; } return icr; } --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44C732E5427 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751345; cv=none; b=TpPcbffq175ctOK8aP1lEV98LJG8/Frdc/ewbFsu9zRHpFYkHF/+Y0xeS5ZGkJaB/AzBLRInZlzgUhxwNUjmY1Wn3kD94qU2SIRxIXlNMtYHQXfplEVQ2HoyHZvI5P9kYx1tbE8rn7Lq34jDCr7C+1P3bOslatRaYfbcaq55PCg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751345; c=relaxed/simple; bh=EkTH+37YUZSiXjAv3+f/ph7kP7OxJvN7I95OwaNhUY0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mcCLhX2LpPonCvcZyKBgmmGfnZC1dIazat8FUYdZNp2szViJC8qH5yOk0P3h1ylGvh6epg/XZtXDw1DolIfZCYNmWt12gTkVxCPVWLFPkf6JEZaVRo+ILkvUjvR5fbrG3nljOCeDUDld6bocFMnBl+/HEfgIw8KxPln4GyIExn4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=ZumgaSzH; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="ZumgaSzH" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=CIZNG/FavMs+HtFoA4fu2KhD+yCsuK7pGx2ljUvhcCo=; b=ZumgaSzH1xr2UOGHUoas28T14K 3G/MTOQI5YByOXZui8oe40rn/NeooVPleSRwNFaZpxgV9B43T8E3JxiyCz8Npl4p4Hf3aVbyQ+Kl3 kd+pskURWotTZ0/g/WfE+BqV5ynOsYMelRdQGtjAjEQshW5O0DozDn7YNfmgpNiGNw6uRwySw4rlC zp03MKHqmmW/7ITDXPN9MqvSCmqmW41k2aImU5sRzTf5h++vGwZxLiV3KOz/Ohj52jwj0vEiOFQhb mcNO6mxVMnkS+GaBOEBF9+tWa6szUQccDETNDXMG8leKnPQv7Ruoheamg5Pird5dNB6GlhXBON1hy glwfo/qw==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1bNF; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Yu-cheng Yu , Rik van Riel Subject: [RFC v5 5/8] x86/mm: Introduce Remote Action Request Date: Fri, 21 Nov 2025 13:54:26 -0500 Message-ID: <20251121185530.21876-6-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yu-cheng Yu Remote Action Request (RAR) is a TLB flushing broadcast facility. To start a TLB flush, the initiator CPU creates a RAR payload and sends a command to the APIC. The receiving CPUs automatically flush TLBs as specified in the payload without the kernel's involement. [ riel: add pcid parameter to smp_call_rar_many so other mms can be flushed= ] Signed-off-by: Yu-cheng Yu Signed-off-by: Rik van Riel --- arch/x86/include/asm/rar.h | 76 ++++++++++++ arch/x86/kernel/cpu/intel.c | 8 +- arch/x86/mm/Makefile | 1 + arch/x86/mm/rar.c | 236 ++++++++++++++++++++++++++++++++++++ 4 files changed, 320 insertions(+), 1 deletion(-) create mode 100644 arch/x86/include/asm/rar.h create mode 100644 arch/x86/mm/rar.c diff --git a/arch/x86/include/asm/rar.h b/arch/x86/include/asm/rar.h new file mode 100644 index 000000000000..c875b9e9c509 --- /dev/null +++ b/arch/x86/include/asm/rar.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_RAR_H +#define _ASM_X86_RAR_H + +/* + * RAR payload types + */ +#define RAR_TYPE_INVPG 0 +#define RAR_TYPE_INVPG_NO_CR3 1 +#define RAR_TYPE_INVPCID 2 +#define RAR_TYPE_INVEPT 3 +#define RAR_TYPE_INVVPID 4 +#define RAR_TYPE_WRMSR 5 + +/* + * Subtypes for RAR_TYPE_INVLPG + */ +#define RAR_INVPG_ADDR 0 /* address specific */ +#define RAR_INVPG_ALL 2 /* all, include global */ +#define RAR_INVPG_ALL_NO_GLOBAL 3 /* all, exclude global */ + +/* + * Subtypes for RAR_TYPE_INVPCID + */ +#define RAR_INVPCID_ADDR 0 /* address specific */ +#define RAR_INVPCID_PCID 1 /* all of PCID */ +#define RAR_INVPCID_ALL 2 /* all, include global */ +#define RAR_INVPCID_ALL_NO_GLOBAL 3 /* all, exclude global */ + +/* + * Page size for RAR_TYPE_INVLPG + */ +#define RAR_INVLPG_PAGE_SIZE_4K 0 +#define RAR_INVLPG_PAGE_SIZE_2M 1 +#define RAR_INVLPG_PAGE_SIZE_1G 2 + +/* + * Max number of pages per payload + */ +#define RAR_INVLPG_MAX_PAGES 63 + +struct rar_payload { + u64 for_sw : 8; + u64 type : 8; + u64 must_be_zero_1 : 16; + u64 subtype : 3; + u64 page_size : 2; + u64 num_pages : 6; + u64 must_be_zero_2 : 21; + + u64 must_be_zero_3; + + /* + * Starting address + */ + union { + u64 initiator_cr3; + struct { + u64 pcid : 12; + u64 ignored : 52; + }; + }; + u64 linear_address; + + /* + * Padding + */ + u64 padding[4]; +}; + +void rar_cpu_init(void); +void rar_boot_cpu_init(void); +void smp_call_rar_many(const struct cpumask *mask, u16 pcid, + unsigned long start, unsigned long end); + +#endif /* _ASM_X86_RAR_H */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index b53bf3452d6a..032e0c840537 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -624,6 +625,9 @@ static void init_intel(struct cpuinfo_x86 *c) split_lock_init(); =20 intel_init_thermal(c); + + if (cpu_feature_enabled(X86_FEATURE_RAR)) + rar_cpu_init(); } =20 #ifdef CONFIG_X86_32 @@ -725,8 +729,10 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c) =20 rdmsrl(MSR_IA32_CORE_CAPS, msr); =20 - if (msr & MSR_IA32_CORE_CAPS_RAR) + if (msr & MSR_IA32_CORE_CAPS_RAR) { setup_force_cpu_cap(X86_FEATURE_RAR); + rar_boot_cpu_init(); + } } } =20 diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5b9908f13dcf..f36fc99e8b10 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -52,6 +52,7 @@ obj-$(CONFIG_ACPI_NUMA) +=3D srat.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) +=3D pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) +=3D kaslr.o obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) +=3D pti.o +obj-$(CONFIG_BROADCAST_TLB_FLUSH) +=3D rar.o =20 obj-$(CONFIG_X86_MEM_ENCRYPT) +=3D mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) +=3D mem_encrypt_amd.o diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c new file mode 100644 index 000000000000..76959782fb03 --- /dev/null +++ b/arch/x86/mm/rar.c @@ -0,0 +1,236 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * RAR TLB shootdown + */ +#include +#include +#include +#include +#include +#include +#include + +static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask); + +#define RAR_SUCCESS 0x00 +#define RAR_PENDING 0x01 +#define RAR_FAILURE 0x80 + +#define RAR_MAX_PAYLOADS 64UL + +/* How many RAR payloads are supported by this CPU */ +static int rar_max_payloads __ro_after_init =3D RAR_MAX_PAYLOADS; + +/* + * RAR payloads telling CPUs what to do. This table is shared between + * all CPUs; it is possible to have multiple payload tables shared between + * different subsets of CPUs, but that adds a lot of complexity. + */ +static struct rar_payload rar_payload[RAR_MAX_PAYLOADS] __page_aligned_bss; + +/* + * Reduce contention for the RAR payloads by having a small number of + * CPUs share a RAR payload entry, instead of a free for all with all CPUs. + */ +struct rar_lock { + union { + raw_spinlock_t lock; + char __padding[SMP_CACHE_BYTES]; + }; +}; + +static struct rar_lock rar_locks[RAR_MAX_PAYLOADS] __cacheline_aligned; + +/* + * The action vector tells each CPU which payload table entries + * have work for that CPU. + */ +static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action); + +/* + * TODO: group CPUs together based on locality in the system instead + * of CPU number, to further reduce the cost of contention. + */ +static int cpu_rar_payload_number(void) +{ + int cpu =3D raw_smp_processor_id(); + return cpu % rar_max_payloads; +} + +static int get_payload_slot(void) +{ + int payload_nr =3D cpu_rar_payload_number(); + raw_spin_lock(&rar_locks[payload_nr].lock); + return payload_nr; +} + +static void free_payload_slot(unsigned long payload_nr) +{ + raw_spin_unlock(&rar_locks[payload_nr].lock); +} + +static void set_payload(struct rar_payload *p, u16 pcid, unsigned long sta= rt, + long pages) +{ + p->must_be_zero_1 =3D 0; + p->must_be_zero_2 =3D 0; + p->must_be_zero_3 =3D 0; + p->page_size =3D RAR_INVLPG_PAGE_SIZE_4K; + p->type =3D RAR_TYPE_INVPCID; + p->pcid =3D pcid; + p->linear_address =3D start; + + if (pcid) { + /* RAR invalidation of the mapping of a specific process. */ + if (pages < RAR_INVLPG_MAX_PAGES) { + p->num_pages =3D pages; + p->subtype =3D RAR_INVPCID_ADDR; + } else { + p->subtype =3D RAR_INVPCID_PCID; + } + } else { + /* + * Unfortunately RAR_INVPCID_ADDR excludes global translations. + * Always do a full flush for kernel invalidations. + */ + p->subtype =3D RAR_INVPCID_ALL; + } + + /* Ensure all writes are visible before the action entry is set. */ + smp_wmb(); +} + +static void set_action_entry(unsigned long payload_nr, int target_cpu) +{ + u8 *bitmap =3D per_cpu(rar_action, target_cpu); + + /* + * Given a remote CPU, "arm" its action vector to ensure it handles + * the request at payload_nr when it receives a RAR signal. + * The remote CPU will overwrite RAR_PENDING when it handles + * the request. + */ + WRITE_ONCE(bitmap[payload_nr], RAR_PENDING); +} + +static void wait_for_action_done(unsigned long payload_nr, int target_cpu) +{ + u8 status; + u8 *rar_actions =3D per_cpu(rar_action, target_cpu); + + status =3D READ_ONCE(rar_actions[payload_nr]); + + while (status =3D=3D RAR_PENDING) { + cpu_relax(); + status =3D READ_ONCE(rar_actions[payload_nr]); + } + + WARN_ON_ONCE(rar_actions[payload_nr] !=3D RAR_SUCCESS); +} + +void rar_cpu_init(void) +{ + u8 *bitmap; + u64 r; + + /* Check if this CPU was already initialized. */ + rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r); + if (r =3D=3D (u64)virt_to_phys(rar_payload)) + return; + + bitmap =3D this_cpu_ptr(rar_action); + memset(bitmap, 0, RAR_MAX_PAYLOADS); + wrmsrl(MSR_IA32_RAR_ACT_VEC, (u64)virt_to_phys(bitmap)); + wrmsrl(MSR_IA32_RAR_PAYLOAD_BASE, (u64)virt_to_phys(rar_payload)); + + /* + * Allow RAR events to be processed while interrupts are disabled on + * a target CPU. This prevents "pileups" where many CPUs are waiting + * on one CPU that has IRQs blocked for too long, and should reduce + * contention on the rar_payload table. + */ + wrmsrl(MSR_IA32_RAR_CTRL, RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF); +} + +void rar_boot_cpu_init(void) +{ + int max_payloads; + u64 r; + + /* The MSR contains N defining the max [0-N] rar payload slots. */ + rdmsrl(MSR_IA32_RAR_INFO, r); + max_payloads =3D (r >> 32) + 1; + + /* If this CPU supports less than RAR_MAX_PAYLOADS, lower our limit. */ + if (max_payloads < rar_max_payloads) + rar_max_payloads =3D max_payloads; + pr_info("RAR: support %d payloads\n", max_payloads); + + for (r =3D 0; r < rar_max_payloads; r++) + rar_locks[r].lock =3D __RAW_SPIN_LOCK_UNLOCKED(rar_lock); + + /* Initialize the boot CPU early to handle early boot flushes. */ + rar_cpu_init(); +} + +/* + * Inspired by smp_call_function_many(), but RAR requires a global payload + * table rather than per-CPU payloads in the CSD table, because the action + * handler is microcode rather than software. + */ +void smp_call_rar_many(const struct cpumask *mask, u16 pcid, + unsigned long start, unsigned long end) +{ + unsigned long pages =3D (end - start + PAGE_SIZE) / PAGE_SIZE; + int cpu, this_cpu =3D smp_processor_id(); + cpumask_t *dest_mask; + unsigned long payload_nr; + + /* Catch the "end - start + PAGE_SIZE" overflow above. */ + if (end =3D=3D TLB_FLUSH_ALL) + pages =3D RAR_INVLPG_MAX_PAGES + 1; + + /* + * Can deadlock when called with interrupts disabled. + * Allow CPUs that are not yet online though, as no one else can + * send smp call function interrupt to this CPU and as such deadlocks + * can't happen. + */ + if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disable= d) { + lockdep_assert_irqs_enabled(); + lockdep_assert_preemption_disabled(); + } + + /* + * A CPU needs to be initialized in order to process RARs. + * Skip offline CPUs. + * + * TODO: + * - Skip RAR to CPUs that are in a deeper C-state, with an empty TLB + * + * This code cannot use the should_flush_tlb() logic here because + * RAR flushes do not update the tlb_gen, resulting in unnecessary + * flushes at context switch time. + */ + dest_mask =3D this_cpu_ptr(&rar_cpu_mask); + cpumask_and(dest_mask, mask, cpu_online_mask); + + /* Some callers race with other CPUs changing the passed mask */ + if (unlikely(!cpumask_weight(dest_mask))) + return; + + payload_nr =3D get_payload_slot(); + set_payload(&rar_payload[payload_nr], pcid, start, pages); + + for_each_cpu(cpu, dest_mask) + set_action_entry(payload_nr, cpu); + + /* Send a message to all CPUs in the map */ + native_send_rar_ipi(dest_mask); + + for_each_cpu(cpu, dest_mask) + wait_for_action_done(payload_nr, cpu); + + free_payload_slot(payload_nr); +} +EXPORT_SYMBOL(smp_call_rar_many); --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E55272E22BF for ; Fri, 21 Nov 2025 18:55:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751342; cv=none; b=lYnQ92grfXBVoVYvIKtW0grvXx2XmaDEfGN4X0T7KY55exkEeNbZHBaHPD1LKDvmOy547K5YsZpKvfyFc27YKTCBAVGmI3nP5v5yi75F/fR0aS5Oear0hi43R2Kyu2hpBPWMjqqz2uzzLLE+auNd/Iyg9wqrqfJ4eHdNBa2v+LA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751342; c=relaxed/simple; bh=Cd3YiqfWeJk791JefxPprm5/Wn769yeHF/2YpqSsR1I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=L+J5w0V3TbrGZZMf4hT9stBvq5S6Jv07ZAo3ocPHEaEGLDgQ6/xvKK1+i4xHGUl7pJpRVhiQl5FAYXEfrLL1EeMKVuhNHfn7jfZxRvujhk1lVBQ1ryZ7jexieEBo3jkPyLG7M0QXg2iRibPeDHUzjNr8J4ddICxrfiv6MJGf0J8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=NA5z7vlf; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="NA5z7vlf" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=lFtNXw38IE3HRdDCdMoM1V9qz9lJ/AVCWnMaz6ZoLPI=; b=NA5z7vlf/bsATWv9tvXtLCcgfw hYmnIk5IUs5Qp+EEs8zNJYhVYqbJ9TX6F636UFk538EEkuW5qkDhjdlOJM+azQhNl4bErHZSVkjrN 9wHW3G2j/+HK0EJDL/QvzmF7ePEkA2/Al318LvauERbHC5bwkqcmZjqVMzdYx30UWHD4ugIzx8mNP BIkUitADajkzZO4JShqJ/hESdl/dT8351dE/iqrnUgI/wYCCVzwTslNYZir/HptpGijk/xuAHu8z7 cuj0LFaDzVL3oWXMLuWdVeloAVxeYZkex0xulc4I9sCvyZ6Id6m8x3N2lHdH4oyKaOeqo1piZWmzK Bgpr4WvQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1keH; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Rik van Riel , Rik van Riel Subject: [RFC v5 6/8] x86/mm: use RAR for kernel TLB flushes Date: Fri, 21 Nov 2025 13:54:27 -0500 Message-ID: <20251121185530.21876-7-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel Use Intel RAR for kernel TLB flushes, when enabled. Pass in PCID 0 to smp_call_rar_many() to flush the specified addresses, regardless of which PCID they might be cached under in any destination CPU. Signed-off-by: Rik van Riel --- arch/x86/mm/tlb.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index f5b93e01e347..19c28386d8de 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -22,6 +22,7 @@ #include #include #include +#include #include =20 #include "mm_internal.h" @@ -1489,6 +1490,17 @@ static void do_flush_tlb_all(void *info) __flush_tlb_all(); } =20 +static void rar_full_flush(const cpumask_t *cpumask) +{ + guard(preempt)(); + smp_call_rar_many(cpumask, 0, 0, TLB_FLUSH_ALL); +} + +static void rar_flush_all(void) +{ + rar_full_flush(cpu_online_mask); +} + void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); @@ -1496,6 +1508,8 @@ void flush_tlb_all(void) /* First try (faster) hardware-assisted TLB invalidation. */ if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(); + else if (cpu_feature_enabled(X86_FEATURE_RAR)) + rar_flush_all(); else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1); @@ -1525,15 +1539,35 @@ static void do_kernel_range_flush(void *info) struct flush_tlb_info *f =3D info; unsigned long addr; =20 + /* + * With PTI kernel TLB entries in all PCIDs need to be flushed. + * With RAR the PCID space becomes so large, we might as well flush it al= l. + * + * Either of the two by itself works with targeted flushes. + */ + if (cpu_feature_enabled(X86_FEATURE_RAR) && + cpu_feature_enabled(X86_FEATURE_PTI)) { + invpcid_flush_all(); + return; + } + /* flush range by one by one 'invlpg' */ for (addr =3D f->start; addr < f->end; addr +=3D PAGE_SIZE) flush_tlb_one_kernel(addr); } =20 +static void rar_kernel_range_flush(struct flush_tlb_info *info) +{ + guard(preempt)(); + smp_call_rar_many(cpu_online_mask, 0, info->start, info->end); +} + static void kernel_tlb_flush_all(struct flush_tlb_info *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(); + else if (cpu_feature_enabled(X86_FEATURE_RAR)) + rar_flush_all(); else on_each_cpu(do_flush_tlb_all, NULL, 1); } @@ -1542,6 +1576,8 @@ static void kernel_tlb_flush_range(struct flush_tlb_i= nfo *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_kernel_range_flush(info); + else if (cpu_feature_enabled(X86_FEATURE_RAR)) + rar_kernel_range_flush(info); else on_each_cpu(do_kernel_range_flush, info, 1); } --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44CF52E62A9 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751345; cv=none; b=uUVsY7rXjB/eRAywUMySDHVA3ys799iv8ZvRzIKSishiOfg5wv3AfXor+CYtNW/XK6+JQ0obFyl9yVJU2Esqv4dxZ2+pbTJPb0eth+gbZhUdBHPkx4GsZ26+xFFK3l6mIUPW6iaC3dPTfa/G8JpVTL23oTONj3tc43TPOm+S49E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751345; c=relaxed/simple; bh=YZIbHOTkpWecGz5qzdzDOshajdpNtgeM8pp54cYdJ8U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KNcR6mxU+RD60i3NVXpc3DcqEKSloFMI83Iujn/HYJEkNFKaPvOJCeuMQNJoDFKY7qbIEK+YWJXYqLh8dh+NFjleZm+Vmkpin9MmRITUENlYFdyOYW85UjLcHKJjpTF+Z9rEDQVpFCPvD+542YgifXNhgWGWjIQFJe5ivNgX0PU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=CjUe3wsY; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="CjUe3wsY" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=pQZzRuzp2rqeyHytAr/IcYMzpnY2peqm60QAfO5FIAo=; b=CjUe3wsYubQ/IT7lHtQR/KMU2W 5wgorVgBgHODk6TWi20EoJkACRFt2UxXi5AzFnsB2gugG9sH2oyDDEKtXiON072rnFZDiDqkvDlMZ cieUc82Ta0ElhtrXnnh1EeL+H1puXRDdXhdcNiE7hDSEtUMJ1pcx9q0mFtIygTHvPmVZP3zpw6uio H9Izh26/FMQpICu4Nu8OXC3sXkm14k2N1zLecds5iSV2Tg/mDl4339mX2G/eH7iA5jjltRgZvtDf+ 90wQuwIxR2ZWYwI6tserpQBdozDsx2Lwp7UIh/79Oe2bgswSUSe6824rc4uGyo53CxB7ZU+z2P6sk HsOYoZ5g==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1qtA; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Rik van Riel , Rik van Riel Subject: [RFC v5 7/8] x86/mm: userspace & pageout flushing using Intel RAR Date: Fri, 21 Nov 2025 13:54:28 -0500 Message-ID: <20251121185530.21876-8-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel Use Intel RAR to flush userspace mappings. Because RAR flushes are targeted using a cpu bitmap, the rules are a little bit different than for true broadcast TLB invalidation. For true broadcast TLB invalidation, like done with AMD INVLPGB, a global ASID always has up to date TLB entries on every CPU. The context switch code never has to flush the TLB when switching to a global ASID on any CPU with INVLPGB. For RAR, the TLB mappings for a global ASID are kept up to date only on CPUs within the mm_cpumask, which lazily follows the threads around the system. The context switch code does not need to flush the TLB if the CPU is in the mm_cpumask, and the PCID used stays the same. However, a CPU that falls outside of the mm_cpumask can have out of date TLB mappings for this task. When switching to that task on a CPU not in the mm_cpumask, the TLB does need to be flushed. Signed-off-by: Rik van Riel --- arch/x86/include/asm/tlbflush.h | 9 +- arch/x86/mm/tlb.c | 216 ++++++++++++++++++++++++++------ 2 files changed, 181 insertions(+), 44 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index 00daedfefc1b..561a38f90588 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -250,7 +250,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm) { u16 asid; =20 - if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) && + !cpu_feature_enabled(X86_FEATURE_RAR)) return 0; =20 asid =3D smp_load_acquire(&mm->context.global_asid); @@ -263,7 +264,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm) =20 static inline void mm_init_global_asid(struct mm_struct *mm) { - if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) || + cpu_feature_enabled(X86_FEATURE_RAR)) { mm->context.global_asid =3D 0; mm->context.asid_transition =3D false; } @@ -287,7 +289,8 @@ static inline void mm_clear_asid_transition(struct mm_s= truct *mm) =20 static inline bool mm_in_asid_transition(struct mm_struct *mm) { - if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) && + !cpu_feature_enabled(X86_FEATURE_RAR)) return false; =20 return mm && READ_ONCE(mm->context.asid_transition); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 19c28386d8de..f59140f87982 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -223,7 +223,8 @@ struct new_asid { unsigned int need_flush : 1; }; =20 -static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tl= b_gen) +static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tl= b_gen, + bool new_cpu) { struct new_asid ns; u16 asid; @@ -236,14 +237,22 @@ static struct new_asid choose_new_asid(struct mm_stru= ct *next, u64 next_tlb_gen) =20 /* * TLB consistency for global ASIDs is maintained with hardware assisted - * remote TLB flushing. Global ASIDs are always up to date. + * remote TLB flushing. Global ASIDs are always up to date with INVLPGB, + * and up to date for CPUs in the mm_cpumask with RAR.. */ - if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) || + cpu_feature_enabled(X86_FEATURE_RAR)) { u16 global_asid =3D mm_global_asid(next); =20 if (global_asid) { ns.asid =3D global_asid; ns.need_flush =3D 0; + /* + * If the CPU fell out of the cpumask, it can be + * out of date with RAR, and should be flushed. + */ + if (cpu_feature_enabled(X86_FEATURE_RAR)) + ns.need_flush =3D new_cpu; return ns; } } @@ -301,7 +310,14 @@ static void reset_global_asid_space(void) { lockdep_assert_held(&global_asid_lock); =20 - invlpgb_flush_all_nonglobals(); + /* + * The global flush ensures that a freshly allocated global ASID + * has no entries in any TLB, and can be used immediately. + * With Intel RAR, the TLB may still need to be flushed at context + * switch time when dealing with a CPU that was not in the mm_cpumask + * for the process, and may have missed flushes along the way. + */ + flush_tlb_all(); =20 /* * The TLB flush above makes it safe to re-use the previously @@ -378,7 +394,7 @@ static void use_global_asid(struct mm_struct *mm) { u16 asid; =20 - guard(raw_spinlock_irqsave)(&global_asid_lock); + guard(raw_spinlock)(&global_asid_lock); =20 /* This process is already using broadcast TLB invalidation. */ if (mm_global_asid(mm)) @@ -404,13 +420,14 @@ static void use_global_asid(struct mm_struct *mm) =20 void mm_free_global_asid(struct mm_struct *mm) { - if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) && + !cpu_feature_enabled(X86_FEATURE_RAR)) return; =20 if (!mm_global_asid(mm)) return; =20 - guard(raw_spinlock_irqsave)(&global_asid_lock); + guard(raw_spinlock)(&global_asid_lock); =20 /* The global ASID can be re-used only after flush at wrap-around. */ #ifdef CONFIG_BROADCAST_TLB_FLUSH @@ -428,7 +445,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, = u16 asid) { u16 global_asid =3D mm_global_asid(mm); =20 - if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) && + !cpu_feature_enabled(X86_FEATURE_RAR)) return false; =20 /* Process is transitioning to a global ASID */ @@ -446,7 +464,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, = u16 asid) */ static void consider_global_asid(struct mm_struct *mm) { - if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) && + !cpu_feature_enabled(X86_FEATURE_RAR)) return; =20 /* Check every once in a while. */ @@ -491,6 +510,7 @@ static void finish_asid_transition(struct flush_tlb_inf= o *info) * that results in a (harmless) extra IPI. */ if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) !=3D bc_asid) { + info->trim_cpumask =3D true; flush_tlb_multi(mm_cpumask(info->mm), info); return; } @@ -500,7 +520,7 @@ static void finish_asid_transition(struct flush_tlb_inf= o *info) mm_clear_asid_transition(mm); } =20 -static void broadcast_tlb_flush(struct flush_tlb_info *info) +static void invlpgb_tlb_flush(struct flush_tlb_info *info) { bool pmd =3D info->stride_shift =3D=3D PMD_SHIFT; unsigned long asid =3D mm_global_asid(info->mm); @@ -531,8 +551,6 @@ static void broadcast_tlb_flush(struct flush_tlb_info *= info) addr +=3D nr << info->stride_shift; } while (addr < info->end); =20 - finish_asid_transition(info); - /* Wait for the INVLPGBs kicked off above to finish. */ __tlbsync(); } @@ -863,7 +881,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struc= t mm_struct *next, /* Check if the current mm is transitioning to a global ASID */ if (mm_needs_global_asid(next, prev_asid)) { next_tlb_gen =3D atomic64_read(&next->context.tlb_gen); - ns =3D choose_new_asid(next, next_tlb_gen); + ns =3D choose_new_asid(next, next_tlb_gen, true); goto reload_tlb; } =20 @@ -901,6 +919,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struc= t mm_struct *next, ns.asid =3D prev_asid; ns.need_flush =3D true; } else { + bool new_cpu =3D false; /* * Apply process to process speculation vulnerability * mitigations if applicable. @@ -933,22 +952,26 @@ void switch_mm_irqs_off(struct mm_struct *unused, str= uct mm_struct *next, * This way switch_mm() must see the new tlb_gen or * flush_tlb_mm_range() must see the new loaded_mm, or both. */ - if (next !=3D &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) + if (next !=3D &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) { cpumask_set_cpu(cpu, mm_cpumask(next)); - else + if (cpu_feature_enabled(X86_FEATURE_RAR)) + new_cpu =3D true; + } else { smp_mb(); + } =20 next_tlb_gen =3D atomic64_read(&next->context.tlb_gen); =20 - ns =3D choose_new_asid(next, next_tlb_gen); + ns =3D choose_new_asid(next, next_tlb_gen, new_cpu); } =20 reload_tlb: new_lam =3D mm_lam_cr3_mask(next); if (ns.need_flush) { - VM_WARN_ON_ONCE(is_global_asid(ns.asid)); - this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id); - this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen); + if (is_dyn_asid(ns.asid)) { + this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id); + this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen); + } load_new_mm_cr3(next->pgd, ns.asid, new_lam, true); =20 trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); @@ -1136,7 +1159,7 @@ static void flush_tlb_func(void *info) const struct flush_tlb_info *f =3D info; struct mm_struct *loaded_mm =3D this_cpu_read(cpu_tlbstate.loaded_mm); u32 loaded_mm_asid =3D this_cpu_read(cpu_tlbstate.loaded_mm_asid); - u64 local_tlb_gen; + u64 local_tlb_gen =3D 0; bool local =3D smp_processor_id() =3D=3D f->initiating_cpu; unsigned long nr_invalidate =3D 0; u64 mm_tlb_gen; @@ -1159,19 +1182,6 @@ static void flush_tlb_func(void *info) if (unlikely(loaded_mm =3D=3D &init_mm)) return; =20 - /* Reload the ASID if transitioning into or out of a global ASID */ - if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) { - switch_mm_irqs_off(NULL, loaded_mm, NULL); - loaded_mm_asid =3D this_cpu_read(cpu_tlbstate.loaded_mm_asid); - } - - /* Broadcast ASIDs are always kept up to date with INVLPGB. */ - if (is_global_asid(loaded_mm_asid)) - return; - - VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=3D - loaded_mm->context.ctx_id); - if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) { /* * We're in lazy mode. We need to at least flush our @@ -1182,11 +1192,31 @@ static void flush_tlb_func(void *info) * This should be rare, with native_flush_tlb_multi() skipping * IPIs to lazy TLB mode CPUs. */ + cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(loaded_mm)); switch_mm_irqs_off(NULL, &init_mm, NULL); return; } =20 - local_tlb_gen =3D this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen= ); + /* Reload the ASID if transitioning into or out of a global ASID */ + if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) { + switch_mm_irqs_off(NULL, loaded_mm, NULL); + loaded_mm_asid =3D this_cpu_read(cpu_tlbstate.loaded_mm_asid); + } + + /* + * Broadcast ASIDs are always kept up to date with INVLPGB; with + * Intel RAR IPI based flushes are used periodically to trim the + * mm_cpumask, and flushes that get here should be processed. + */ + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && + is_global_asid(loaded_mm_asid)) + return; + + VM_WARN_ON(is_dyn_asid(loaded_mm_asid) && loaded_mm->context.ctx_id !=3D + this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id)); + + if (is_dyn_asid(loaded_mm_asid)) + local_tlb_gen =3D this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_ge= n); =20 if (unlikely(f->new_tlb_gen !=3D TLB_GENERATION_INVALID && f->new_tlb_gen <=3D local_tlb_gen)) { @@ -1285,7 +1315,8 @@ static void flush_tlb_func(void *info) } =20 /* Both paths above update our state to mm_tlb_gen. */ - this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen); + if (is_dyn_asid(loaded_mm_asid)) + this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen); =20 /* Tracing is done in a unified manner to reduce the code size */ done: @@ -1326,15 +1357,15 @@ static bool should_flush_tlb(int cpu, void *data) if (loaded_mm =3D=3D info->mm) return true; =20 - /* In cpumask, but not the loaded mm? Periodically remove by flushing. */ - if (info->trim_cpumask) - return true; - return false; } =20 static bool should_trim_cpumask(struct mm_struct *mm) { + /* INVLPGB always goes to all CPUs. No need to trim the mask. */ + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) + return false; + if (time_after(jiffies, READ_ONCE(mm->context.next_trim_cpumask))) { WRITE_ONCE(mm->context.next_trim_cpumask, jiffies + HZ); return true; @@ -1345,6 +1376,27 @@ static bool should_trim_cpumask(struct mm_struct *mm) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared= ); EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared); =20 +static bool should_flush_all(const struct flush_tlb_info *info) +{ + if (info->freed_tables) + return true; + + if (info->trim_cpumask) + return true; + + /* + * INVLPGB and RAR do not use this code path normally. + * This call cleans up the cpumask or ASID transition. + */ + if (mm_global_asid(info->mm)) + return true; + + if (mm_in_asid_transition(info->mm)) + return true; + + return false; +} + STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info) { @@ -1370,7 +1422,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct = cpumask *cpumask, * up on the new contents of what used to be page tables, while * doing a speculative memory access. */ - if (info->freed_tables || mm_in_asid_transition(info->mm)) + if (should_flush_all(info)) on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true); else on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func, @@ -1401,6 +1453,74 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tl= b_info, flush_tlb_info); static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); #endif =20 +static void trim_cpumask_func(void *data) +{ + struct mm_struct *loaded_mm =3D this_cpu_read(cpu_tlbstate.loaded_mm); + const struct flush_tlb_info *f =3D data; + + /* + * Clearing this bit from an IRQ handler synchronizes against + * the bit being set in switch_mm_irqs_off, with IRQs disabled. + */ + if (f->mm !=3D loaded_mm) + cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm)); +} + +static bool should_remove_cpu_from_mask(int cpu, void *data) +{ + struct mm_struct *loaded_mm =3D per_cpu(cpu_tlbstate.loaded_mm, cpu); + struct flush_tlb_info *info =3D data; + + if (loaded_mm !=3D info->mm) + return true; + + return false; +} + +/* Remove CPUs from the mm_cpumask that are running another mm. */ +static void trim_cpumask(struct flush_tlb_info *info) +{ + cpumask_t *cpumask =3D mm_cpumask(info->mm); + on_each_cpu_cond_mask(should_remove_cpu_from_mask, trim_cpumask_func, + (void *)info, 1, cpumask); +} + +static void rar_tlb_flush(struct flush_tlb_info *info) +{ + unsigned long asid =3D mm_global_asid(info->mm); + cpumask_t *cpumask =3D mm_cpumask(info->mm); + u16 pcid =3D kern_pcid(asid); + + if (info->trim_cpumask) + trim_cpumask(info); + + /* Only the local CPU needs to be flushed? */ + if (cpumask_equal(cpumask, cpumask_of(raw_smp_processor_id()))) { + lockdep_assert_irqs_enabled(); + local_irq_disable(); + flush_tlb_func(info); + local_irq_enable(); + return; + } + + /* Flush all the CPUs at once with RAR. */ + if (cpumask_weight(cpumask)) { + smp_call_rar_many(mm_cpumask(info->mm), pcid, info->start, info->end); + if (cpu_feature_enabled(X86_FEATURE_PTI)) + smp_call_rar_many(mm_cpumask(info->mm), user_pcid(asid), info->start, i= nfo->end); + } +} + +static void broadcast_tlb_flush(struct flush_tlb_info *info) +{ + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) + invlpgb_tlb_flush(info); + else /* Intel RAR */ + rar_tlb_flush(info); + + finish_asid_transition(info); +} + static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables, @@ -1461,6 +1581,13 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsign= ed long start, info =3D get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, new_tlb_gen); =20 + /* + * IPIs and RAR can be targeted to a cpumask. Periodically trim that + * mm_cpumask by sending TLB flush IPIs, even when most TLB flushes + * are done with RAR. + */ + info->trim_cpumask =3D should_trim_cpumask(mm); + /* * flush_tlb_multi() is not optimized for the common case in which only * a local TLB flush is needed. Optimize this use-case by calling @@ -1469,7 +1596,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigne= d long start, if (mm_global_asid(mm)) { broadcast_tlb_flush(info); } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { - info->trim_cpumask =3D should_trim_cpumask(mm); flush_tlb_multi(mm_cpumask(mm), info); consider_global_asid(mm); } else if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { @@ -1778,6 +1904,14 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_= batch *batch) if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->unmapped_pages) { invlpgb_flush_all_nonglobals(); batch->unmapped_pages =3D false; + } else if (cpu_feature_enabled(X86_FEATURE_RAR) && cpumask_any(&batch->cp= umask) < nr_cpu_ids) { + rar_full_flush(&batch->cpumask); + if (cpumask_test_cpu(cpu, &batch->cpumask)) { + lockdep_assert_irqs_enabled(); + local_irq_disable(); + invpcid_flush_all_nonglobals(); + local_irq_enable(); + } } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { flush_tlb_multi(&batch->cpumask, info); } else if (cpumask_test_cpu(cpu, &batch->cpumask)) { --=20 2.51.1 From nobody Tue Dec 2 01:04:41 2025 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83BE62F3600 for ; Fri, 21 Nov 2025 18:55:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; cv=none; b=f5qUWgtAyBQAvGwYr9KsOrTIwcKcNvzUX2diiEH/GYibjVOjR2eBMZPd+KcTlcxUg5A3NG4+55HO1pYLerLdl3fMkDZA+1RCqyTyZ9mw7VLBwJ2BqI98nenr5svng4ItHFKUpCowwao2wo1Q+sq0bwJEM1d9IRnuCa0ryMDzgvk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763751344; c=relaxed/simple; bh=a41uQWToJ/q47rPMKsAd8cDSXXtkp9Qj8qZyZehtOdY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=FD11dt3bGdCQixoQoOzMrcUnA8zAEI2gioTvSNunLwsYomydAq7KT2zIKjHs8p2bWETUgtelnJyh7qOiuClxnLeKvA6eyyYzHrLJwq6Yn8tyQUIhc4Fm1HxIQZ6dLVVAlrqA4IxEByx9MwcVD7T0uTAT8mFc30jN7lk9oyCs6zE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=MO+iCsoU; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="MO+iCsoU" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=pH3tylSDYWWE7T+VMZYdhlnwJgMTm4AeNoqz5h0S1Tk=; b=MO+iCsoUnmRNZodJZd+ICylvzf CRwukQLAvh/vsnqme1KtUT+EuEKO/QY9dhULXrvdUKwgMM7FJfdWF0C6gi1ExHbSNMwKgLr2YiaHn 0wFwjVQn18gFXEm1K3xHYbOnPYSINA47qHX9iSnp9SVa8eIa1aQO1NK6Dok7dQje3kJY/vf+QbBTm wJymoIKOODkT23bOWlnEdmgIZw3jPR9olYcSMpkC6i4ZiSI4BEAFQoDjPv7gSDZLoQAs26wSJUfg2 fi+w1C78eAYq2Wkql9EPY5/9UZapsiCo0jXQ/7p8OOZIqEjE9ObVQOJR7Z7791A59x9WoX/FFvBzD 2sWRRAYQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1vMWI9-000000003i4-1yCh; Fri, 21 Nov 2025 13:55:34 -0500 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@linux.intel.com, peterz@infradead.org, kernel-team@meta.com, bp@alien8.de, Rik van Riel , Rik van Riel Subject: [RFC v5 8/8] x86/mm: make RAR invalidation scalable by skipping duplicate APIC pokes Date: Fri, 21 Nov 2025 13:54:29 -0500 Message-ID: <20251121185530.21876-9-riel@surriel.com> X-Mailer: git-send-email 2.51.1 In-Reply-To: <20251121185530.21876-1-riel@surriel.com> References: <20251121185530.21876-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel The naive RAR implementation suffers from heavy contention in apic_mem_wait_irc_idle(), when multiple CPUs send our RAR interrupts simultaneously. When a CPU receives a RAR, it will scan its action vector, and process all the rar_payload entries where the corresponding action vector is set to RAR_PENDING. After processing each payload, it will set the corresponding action vector to RAR_SUCCESS. That means sending one single RAR to a CPU is enough for that CPU to process all the pending RAR payloads, and other CPUs do not usually need to send additional RARs to that CPU. Optimistically avoid sending RAR interrupts to CPUs that are already processing a RAR, looping back only if our request went unprocessed, but the remote CPU is no longer processing any RARs. This changes will-it-scale tlb_flush2_threads numbers like this: loops/sec IPI flush naive RAR optimized RAR threads 1 175k 174k 170k 5 337k 345k 321k 10 530k 469k 497k 20 752k 363k 616k 30 922k 259k 754k 40 1005k 205k 779k 50 1073k 164k 883k 60 1040k 141k 813k The numbers above are on a 30 core / 60 thread, single socket Sapphire Rapids system. Average of 4 runs. This exact same code reached up to 1200k loops/second on a -tip kernel from a few weeks ago, and did so reliably across several reboots. I have no good explanation for the difference. Signed-off-by: Rik van Riel --- arch/x86/mm/rar.c | 60 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 54 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c index 76959782fb03..fd89eaaf4fc1 100644 --- a/arch/x86/mm/rar.c +++ b/arch/x86/mm/rar.c @@ -11,6 +11,7 @@ #include =20 static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask); +static DEFINE_PER_CPU(struct cpumask, apic_cpu_mask); =20 #define RAR_SUCCESS 0x00 #define RAR_PENDING 0x01 @@ -47,6 +48,32 @@ static struct rar_lock rar_locks[RAR_MAX_PAYLOADS] __cac= heline_aligned; */ static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action); =20 +/* + * Tracks whether a RAR is in flight to this CPU. This is used + * to avoid sending another RAR (waiting on the APIC) when the + * target CPU is already handling RARs. + */ +static DEFINE_PER_CPU(int, rar_pending) =3D -1; + +static bool get_rar_pending(int target_cpu, int this_cpu) +{ + int *this_rar_pending =3D &per_cpu(rar_pending, target_cpu); + + /* Another CPU is flushing this CPU already. */ + if (*this_rar_pending !=3D -1) + return false; + + /* Is this_cpu the one that needs to send a RAR to target_cpu? */ + return cmpxchg(this_rar_pending, -1, this_cpu) =3D=3D -1; +} + +static void release_rar_pending(int target_cpu, int this_cpu) +{ + /* If this_cpu sent the RAR to target_cpu, clear rar_pending */ + if (READ_ONCE(per_cpu(rar_pending, target_cpu)) =3D=3D this_cpu) + WRITE_ONCE(per_cpu(rar_pending, target_cpu), -1); +} + /* * TODO: group CPUs together based on locality in the system instead * of CPU number, to further reduce the cost of contention. @@ -113,7 +140,7 @@ static void set_action_entry(unsigned long payload_nr, = int target_cpu) WRITE_ONCE(bitmap[payload_nr], RAR_PENDING); } =20 -static void wait_for_action_done(unsigned long payload_nr, int target_cpu) +static u8 wait_for_action_done(unsigned long payload_nr, int target_cpu) { u8 status; u8 *rar_actions =3D per_cpu(rar_action, target_cpu); @@ -123,9 +150,14 @@ static void wait_for_action_done(unsigned long payload= _nr, int target_cpu) while (status =3D=3D RAR_PENDING) { cpu_relax(); status =3D READ_ONCE(rar_actions[payload_nr]); + /* Target CPU is not processing RARs right now. */ + if (READ_ONCE(per_cpu(rar_pending, target_cpu)) =3D=3D -1) + return status; } =20 WARN_ON_ONCE(rar_actions[payload_nr] !=3D RAR_SUCCESS); + + return status; } =20 void rar_cpu_init(void) @@ -183,7 +215,7 @@ void smp_call_rar_many(const struct cpumask *mask, u16 = pcid, { unsigned long pages =3D (end - start + PAGE_SIZE) / PAGE_SIZE; int cpu, this_cpu =3D smp_processor_id(); - cpumask_t *dest_mask; + cpumask_t *dest_mask, *apic_mask; unsigned long payload_nr; =20 /* Catch the "end - start + PAGE_SIZE" overflow above. */ @@ -213,7 +245,9 @@ void smp_call_rar_many(const struct cpumask *mask, u16 = pcid, * flushes at context switch time. */ dest_mask =3D this_cpu_ptr(&rar_cpu_mask); + apic_mask =3D this_cpu_ptr(&apic_cpu_mask); cpumask_and(dest_mask, mask, cpu_online_mask); + cpumask_clear(apic_mask); =20 /* Some callers race with other CPUs changing the passed mask */ if (unlikely(!cpumask_weight(dest_mask))) @@ -225,11 +259,25 @@ void smp_call_rar_many(const struct cpumask *mask, u1= 6 pcid, for_each_cpu(cpu, dest_mask) set_action_entry(payload_nr, cpu); =20 - /* Send a message to all CPUs in the map */ - native_send_rar_ipi(dest_mask); + do { + for_each_cpu(cpu, dest_mask) { + /* Track the CPUs that have no RAR pending (yet). */ + if (get_rar_pending(cpu, this_cpu)) + __cpumask_set_cpu(cpu, apic_mask); + } =20 - for_each_cpu(cpu, dest_mask) - wait_for_action_done(payload_nr, cpu); + /* Send a message to the CPUs not processing RARs yet */ + native_send_rar_ipi(apic_mask); + + for_each_cpu(cpu, dest_mask) { + u8 status =3D wait_for_action_done(payload_nr, cpu); + if (status =3D=3D RAR_SUCCESS) { + release_rar_pending(cpu, this_cpu); + __cpumask_clear_cpu(cpu, dest_mask); + __cpumask_clear_cpu(cpu, apic_mask); + } + } + } while (unlikely(cpumask_weight(dest_mask))); =20 free_payload_slot(payload_nr); } --=20 2.51.1