From nobody Sat Feb 7 22:34:18 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10E82195FFA; Thu, 19 Sep 2024 21:52:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726782760; cv=none; b=Wrf0sLnhdMOpeqfq35L/yF/nPdt9Zxh+gXMNnLA8yifhG7UXWpXGFRkLzivqJ2Kr8JnQVtyZkQ7DaksY1+4guFN58+MmUTIt40Nkw6QOQwmCmjpN3iEZCVLNYPQ9fU6GeBlXksWcbRu3CXuRY7wfJqERewT+eOxBDbzdf98tglI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726782760; c=relaxed/simple; bh=UJlvI3BLrvCtEcxuIXgu90RvWcQGFXzPnXpX3uO9fsE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=D/X/3bKxUjYr8bTdnwLIxl2P2ftvHztdJy3SLNCbSHwNETRzis7ES8VDhoR2A/5jAyI14MtgG8j+wgTjVVsrfzqhMg7QUE+MIVG1/rQJ3E+FTMWv0OwSE5G0NYhCMbAC1Ixt4R+uRezvPAhjoemJJjQMN+1N/nyBiQZV6bADLIU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Jb92EuqE; arc=none smtp.client-ip=198.175.65.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Jb92EuqE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1726782760; x=1758318760; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=UJlvI3BLrvCtEcxuIXgu90RvWcQGFXzPnXpX3uO9fsE=; b=Jb92EuqERQ/y98sbZIGZTp9G9SCf2DkBF/BixA6FPngXTW/n75o+NCjE 5xstMTdt0e+mIwVauYR2U0ldrdGKCrYKNzKCmb0qxufJ8pNYO71gZ9e9z rkNF22ckddLDuZxdzdAcEUx9rSWPhtSVJkyEp8Anic5FjQ8Vhv5qpYJSV oquNh4h8lsrDPP2q8joepK6rAFBdkoqp7UI75+0J1JSoh2Ifv/topW+1h TjuoffEM/zDA936IATVx8zuYunUpFAzmUbAkxEsSJTDHSsT/iKcY/mztS PP/AT9IRulBPQQI25XZLdGKm3RnbaGJrGwXlDi8AhGkSiKcW0ZkOxkpyD g==; X-CSE-ConnectionGUID: FibWjQMbTJivFt0YBMHP9g== X-CSE-MsgGUID: Lpxw2o9IQu6prBagKFU6QA== X-IronPort-AV: E=McAfee;i="6700,10204,11200"; a="25870761" X-IronPort-AV: E=Sophos;i="6.10,243,1719903600"; d="scan'208";a="25870761" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2024 14:52:39 -0700 X-CSE-ConnectionGUID: rFtaqb5YRYeQzOSunm3i9Q== X-CSE-MsgGUID: pH7XZBTZSFWMqmbMFdb8Pg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,243,1719903600"; d="scan'208";a="74862551" Received: from trevorcr-mobl.amr.corp.intel.com (HELO desk) ([10.125.147.197]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2024 14:52:39 -0700 Date: Thu, 19 Sep 2024 14:52:37 -0700 From: Pawan Gupta To: Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , David Kaplan , Daniel Sneddon , x86@kernel.org, "H. Peter Anvin" , Peter Zijlstra , Josh Poimboeuf , Steven Rostedt Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [PATCH RFC 1/2] x86/entry_64: Add a separate unmitigated entry/exit path Message-ID: <20240919-selective-mitigation-v1-1-1846cf41895e@linux.intel.com> X-Mailer: b4 0.14.1 References: <20240919-selective-mitigation-v1-0-1846cf41895e@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20240919-selective-mitigation-v1-0-1846cf41895e@linux.intel.com> Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" CPU mitigations are deployed system-wide, but usually not all of the userspace is malicious. Yet, they suffer from the performance impact of the mitigations. This all or nothing approach is due to lack of a way for kernel to know which userspace can be trusted and which cannot. For scenarios where an admin can decide which processes to trust, an interface to tell the kernel to possibly skip the mitigation would be useful. In preparation for kernel to be able to selectively apply mitigation per-process add a separate kernel entry/exit path that skips the mitigations. Originally-by: Josh Poimboeuf Signed-off-by: Pawan Gupta --- arch/x86/entry/entry_64.S | 66 +++++++++++++++++++++++++++++++++++----= ---- arch/x86/include/asm/proto.h | 15 +++++++--- arch/x86/include/asm/ptrace.h | 15 +++++++--- arch/x86/kernel/cpu/common.c | 2 +- 4 files changed, 78 insertions(+), 20 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1b5be07f8669..eeaf4226d09c 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -84,7 +84,7 @@ * with them due to bugs in both AMD and Intel CPUs. */ =20 -SYM_CODE_START(entry_SYSCALL_64) +.macro __entry_SYSCALL_64 mitigated=3D0 UNWIND_HINT_ENTRY ENDBR =20 @@ -94,7 +94,12 @@ SYM_CODE_START(entry_SYSCALL_64) SWITCH_TO_KERNEL_CR3 scratch_reg=3D%rsp movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp =20 -SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL) +.if \mitigated +SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack_mitigated, SYM_L_GLOBAL) +.else +SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack_unmitigated, SYM_L_GLOBAL) +.endif + ANNOTATE_NOENDBR =20 /* Construct struct pt_regs on stack */ @@ -103,7 +108,11 @@ SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLO= BAL) pushq %r11 /* pt_regs->flags */ pushq $__USER_CS /* pt_regs->cs */ pushq %rcx /* pt_regs->ip */ + +.if \mitigated SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) +.endif + pushq %rax /* pt_regs->orig_ax */ =20 PUSH_AND_CLEAR_REGS rax=3D$-ENOSYS @@ -113,10 +122,12 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L= _GLOBAL) /* Sign extend the lower 32bit as syscall numbers are treated as int */ movslq %eax, %rsi =20 +.if \mitigated /* clobbers %rax, make sure it is after saving the syscall nr */ IBRS_ENTER UNTRAIN_RET CLEAR_BRANCH_HISTORY +.endif =20 call do_syscall_64 /* returns with IRQs disabled */ =20 @@ -127,15 +138,26 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L= _GLOBAL) * In the Xen PV case we must use iret anyway. */ =20 - ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermod= e", \ - "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV +.if \mitigated + push %rax + IBRS_EXIT + CLEAR_CPU_BUFFERS + pop %rax +.endif + + ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermod= e_from_syscall", \ + "jmp swapgs_restore_regs_and_return_to_usermode_from_syscall", X86_FEATU= RE_XENPV =20 /* * We win! This label is here just for ease of understanding * perf profiles. Nothing jumps here. */ -syscall_return_via_sysret: - IBRS_EXIT +.if \mitigated +syscall_return_via_sysret_mitigated: +.else +syscall_return_via_sysret_unmitigated: +.endif + POP_REGS pop_rdi=3D0 =20 /* @@ -159,15 +181,36 @@ syscall_return_via_sysret: =20 popq %rdi popq %rsp -SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL) + +.if \mitigated +SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack_mitigated, SYM_L_GLOBAL) +.else +SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack_unmitigated, SYM_L_GLOBAL) +.endif + ANNOTATE_NOENDBR swapgs - CLEAR_CPU_BUFFERS + +.if \mitigated +SYM_INNER_LABEL(entry_SYSRETQ_end_mitigated, SYM_L_GLOBAL) +.else +SYM_INNER_LABEL(entry_SYSRETQ_end_unmitigated, SYM_L_GLOBAL) +.endif sysretq -SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL) + +.endm /* __entry_SYSCALL_64 */ + +SYM_CODE_START(entry_SYSCALL_64_unmitigated) + __entry_SYSCALL_64 mitigated=3D0 ANNOTATE_NOENDBR int3 -SYM_CODE_END(entry_SYSCALL_64) +SYM_CODE_END(entry_SYSCALL_64_unmitigated) + +SYM_CODE_START(entry_SYSCALL_64_mitigated) + __entry_SYSCALL_64 mitigated=3D1 + ANNOTATE_NOENDBR + int3 +SYM_CODE_END(entry_SYSCALL_64_mitigated) =20 /* * %rdi: prev task @@ -559,6 +602,8 @@ __irqentry_text_end: SYM_CODE_START_LOCAL(common_interrupt_return) SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) IBRS_EXIT + CLEAR_CPU_BUFFERS +SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode_from_syscall, S= YM_L_GLOBAL) #ifdef CONFIG_XEN_PV ALTERNATIVE "", "jmp xenpv_restore_regs_and_return_to_usermode", X86_FEAT= URE_XENPV #endif @@ -573,7 +618,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_userm= ode, SYM_L_GLOBAL) =20 .Lswapgs_and_iret: swapgs - CLEAR_CPU_BUFFERS /* Assert that the IRET frame indicates user mode. */ testb $3, 8(%rsp) jnz .Lnative_iret diff --git a/arch/x86/include/asm/proto.h b/arch/x86/include/asm/proto.h index 484f4f0131a5..0936e0e70659 100644 --- a/arch/x86/include/asm/proto.h +++ b/arch/x86/include/asm/proto.h @@ -11,10 +11,17 @@ struct task_struct; void syscall_init(void); =20 #ifdef CONFIG_X86_64 -void entry_SYSCALL_64(void); -void entry_SYSCALL_64_safe_stack(void); -void entry_SYSRETQ_unsafe_stack(void); -void entry_SYSRETQ_end(void); + +void entry_SYSCALL_64_unmitigated(void); +void entry_SYSCALL_64_safe_stack_unmitigated(void); +void entry_SYSRETQ_unsafe_stack_unmitigated(void); +void entry_SYSRETQ_end_unmitigated(void); + +void entry_SYSCALL_64_mitigated(void); +void entry_SYSCALL_64_safe_stack_mitigated(void); +void entry_SYSRETQ_unsafe_stack_mitigated(void); +void entry_SYSRETQ_end_mitigated(void); + long do_arch_prctl_64(struct task_struct *task, int option, unsigned long = arg2); #endif =20 diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 5a83fbd9bc0b..74a13c76d241 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -261,11 +261,18 @@ static inline bool any_64bit_mode(struct pt_regs *reg= s) =20 static __always_inline bool ip_within_syscall_gap(struct pt_regs *regs) { - bool ret =3D (regs->ip >=3D (unsigned long)entry_SYSCALL_64 && - regs->ip < (unsigned long)entry_SYSCALL_64_safe_stack); + bool ret =3D (regs->ip >=3D (unsigned long)entry_SYSCALL_64_unmitigated && + regs->ip < (unsigned long)entry_SYSCALL_64_safe_stack_unmitigated); + + ret =3D ret || (regs->ip >=3D (unsigned long)entry_SYSRETQ_unsafe_stack_u= nmitigated && + regs->ip < (unsigned long)entry_SYSRETQ_end_unmitigated); + + ret =3D ret || (regs->ip >=3D (unsigned long)entry_SYSCALL_64_mitigated && + regs->ip < (unsigned long)entry_SYSCALL_64_safe_stack_mitigated); + + ret =3D ret || (regs->ip >=3D (unsigned long)entry_SYSRETQ_unsafe_stack_m= itigated && + regs->ip < (unsigned long)entry_SYSRETQ_end_mitigated); =20 - ret =3D ret || (regs->ip >=3D (unsigned long)entry_SYSRETQ_unsafe_stack && - regs->ip < (unsigned long)entry_SYSRETQ_end); #ifdef CONFIG_IA32_EMULATION ret =3D ret || (regs->ip >=3D (unsigned long)entry_SYSCALL_compat && regs->ip < (unsigned long)entry_SYSCALL_compat_safe_stack); diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index d4e539d4e158..e72c37f3a437 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2026,7 +2026,7 @@ static void wrmsrl_cstar(unsigned long val) =20 static inline void idt_syscall_init(void) { - wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated); =20 if (ia32_enabled()) { wrmsrl_cstar((unsigned long)entry_SYSCALL_compat); --=20 2.34.1 From nobody Sat Feb 7 22:34:18 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6024197A9B; Thu, 19 Sep 2024 21:52:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.16 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726782766; cv=none; b=kQ/Jaq1qSkAnDbmhdQ4jJP9IKcahTVwGB7yDfHC5rTLR2wwnXc5wLIMAvty8QDksa2V7JIwyauRV+Mt4Q7DkjgKhnWMsbC4a7LpDOHnmZhVJGl93l0TYOKx66yH98thb00TINxCZ5dJoZe9WxIPTLNWLXAFjD9Kfc2E92XPxsCo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726782766; c=relaxed/simple; bh=qyOs0aZBTYslrbA17qV9cTYlXbol+LI0A8RI4+O0WKw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Ug9AIu1RWY9w/swJkaEDRBc5yu/Z7BFtmaScTrGV3wBpfc88WjNPqh+aQJY5UWaNlmp4lXxVUjwxHTSJquAscraogmeL4TjpMLB1554zhYu8NzqqEblUNF4AItGKfXQw0eCZlUi/+uvxsBy/Fh93kDfG3iynnTFChHoYHF4js9k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=cfS/rpgG; arc=none smtp.client-ip=198.175.65.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="cfS/rpgG" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1726782765; x=1758318765; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=qyOs0aZBTYslrbA17qV9cTYlXbol+LI0A8RI4+O0WKw=; b=cfS/rpgG1V6G2RKATd0U0ZUeAcI1H2s+pBkZI1muL1hmwCHI6veTlJ35 O3ize25O8HHsCRXL5xv4McQW9WNUKioA88XXIamPS0v1hH9n/03XyVvWY ntElBB2a8ipadiOPuYQEgAi0k0hqQOKQ8NF03uD8p/48BPyAI5+5fVDAb UU+GGj70NczCGceq4qaY5sjtykFIUUcmK1XAxP+hRaY3DV23w4svBdgnH ho55vo5/B45mR96/kBDOVRbbZCEKF47Yhquxf+IuI5Uc6Ukvy6k8PDDew FegG8ksLvSWG03+CyqAlh2uxJOxBPc9UrNTsfia2lWryTVH0ToDoYCxX8 Q==; X-CSE-ConnectionGUID: oYCJBcRaT9eT+EZOahnZvQ== X-CSE-MsgGUID: Vt87eMYHR4e2GuAhnpWKTA== X-IronPort-AV: E=McAfee;i="6700,10204,11200"; a="25870779" X-IronPort-AV: E=Sophos;i="6.10,243,1719903600"; d="scan'208";a="25870779" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by orvoesa108.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2024 14:52:45 -0700 X-CSE-ConnectionGUID: AAGpMcFITHKVwwCl6Xg3rg== X-CSE-MsgGUID: yAAnPFX+RXWmJ3nO+7qRJA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,243,1719903600"; d="scan'208";a="74862583" Received: from trevorcr-mobl.amr.corp.intel.com (HELO desk) ([10.125.147.197]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2024 14:52:45 -0700 Date: Thu, 19 Sep 2024 14:52:43 -0700 From: Pawan Gupta To: Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , David Kaplan , Daniel Sneddon , x86@kernel.org, "H. Peter Anvin" , Peter Zijlstra , Josh Poimboeuf , Steven Rostedt Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: [PATCH RFC 2/2] cpu/bugs: cgroup: Add a cgroup knob to bypass CPU mitigations Message-ID: <20240919-selective-mitigation-v1-2-1846cf41895e@linux.intel.com> X-Mailer: b4 0.14.1 References: <20240919-selective-mitigation-v1-0-1846cf41895e@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20240919-selective-mitigation-v1-0-1846cf41895e@linux.intel.com> Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For cases where an admin wanting to bypass CPU mitigations for a specific workload that they trust. Add a cgroup attribute "cpu.skip_mitigation" that can only be set by a privileged user. When set, the CPU mitigations are bypassed for all tasks in that cgroup. Before setting this knob, the admin should be aware of possible security risks like confused deputy attack on trusted interpreters, JIT engine, etc. Signed-off-by: Pawan Gupta --- arch/x86/include/asm/switch_to.h | 10 ++++++++++ arch/x86/kernel/cpu/bugs.c | 21 ++++++++++++++++++++ include/linux/cgroup-defs.h | 3 +++ kernel/cgroup/cgroup.c | 42 ++++++++++++++++++++++++++++++++++++= ++++ kernel/sched/core.c | 2 +- 5 files changed, 77 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch= _to.h index c3bd0c0758c9..7f32fd139644 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -46,6 +46,16 @@ struct fork_frame { struct pt_regs regs; }; =20 +extern inline void cpu_mitigation_skip(struct task_struct *prev, struct ta= sk_struct *next); + +#define prepare_arch_switch prepare_arch_switch + +static inline void prepare_arch_switch(struct task_struct *prev, + struct task_struct *next) +{ + cpu_mitigation_skip(prev, next); +} + #define switch_to(prev, next, last) \ do { \ ((last) =3D __switch_to_asm((prev), (next))); \ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 45675da354f3..77eb4f6dc5c9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -128,6 +128,27 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear); EXPORT_SYMBOL_GPL(mmio_stale_data_clear); =20 +inline void cpu_mitigation_skip(struct task_struct *prev, + struct task_struct *next) +{ + bool prev_skip =3D false, next_skip =3D false; + + if (prev->mm) + prev_skip =3D task_dfl_cgroup(prev)->cpu_skip_mitigation; + if (next->mm) + next_skip =3D task_dfl_cgroup(next)->cpu_skip_mitigation; + + if (!prev_skip && !next_skip) + return; + if (prev_skip =3D=3D next_skip) + return; + + if (next_skip) + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_unmitigated); + else + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64_mitigated); +} + void __init cpu_select_mitigations(void) { /* diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ae04035b6cbe..6a131a62f43c 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -546,6 +546,9 @@ struct cgroup { struct bpf_local_storage __rcu *bpf_cgrp_storage; #endif =20 + /* Used to bypass the CPU mitigations for tasks in a cgroup */ + bool cpu_skip_mitigation; + /* All ancestors including self */ struct cgroup *ancestors[]; }; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c8e4b62b436a..b745dbcb153e 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2045,6 +2045,7 @@ static void init_cgroup_housekeeping(struct cgroup *c= grp) cgrp->dom_cgrp =3D cgrp; cgrp->max_descendants =3D INT_MAX; cgrp->max_depth =3D INT_MAX; + cgrp->cpu_skip_mitigation =3D 0; INIT_LIST_HEAD(&cgrp->rstat_css_list); prev_cputime_init(&cgrp->prev_cputime); =20 @@ -3751,6 +3752,41 @@ static int cpu_stat_show(struct seq_file *seq, void = *v) return ret; } =20 +static int cpu_skip_mitigation_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + int ret =3D 0; + + seq_printf(seq, "%d\n", cgrp->cpu_skip_mitigation); + + return ret; +} + +static ssize_t cgroup_skip_mitigation_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct cgroup *cgrp =3D of->kn->parent->priv; + struct cgroup_file_ctx *ctx =3D of->priv; + u64 skip_mitigation; + int ret; + + /* Only privileged user in init namespace is allowed to set skip_mitigati= on */ + if ((ctx->ns !=3D &init_cgroup_ns) || !capable(CAP_SYS_ADMIN)) + return -EPERM; + + ret =3D kstrtoull(buf, 0, &skip_mitigation); + if (ret) + return -EINVAL; + + if (skip_mitigation > 1) + return -EINVAL; + + cgrp->cpu_skip_mitigation =3D skip_mitigation; + + return nbytes; +} + static int cpu_local_stat_show(struct seq_file *seq, void *v) { struct cgroup __maybe_unused *cgrp =3D seq_css(seq)->cgroup; @@ -5290,6 +5326,12 @@ static struct cftype cgroup_base_files[] =3D { .name =3D "cpu.stat.local", .seq_show =3D cpu_local_stat_show, }, + { + .name =3D "cpu.skip_mitigation", + .flags =3D CFTYPE_NOT_ON_ROOT, + .seq_show =3D cpu_skip_mitigation_show, + .write =3D cgroup_skip_mitigation_write, + }, { } /* terminate */ }; =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f3951e4a55e5..4b4109afbf7c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4994,7 +4994,7 @@ prepare_task_switch(struct rq *rq, struct task_struct= *prev, fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); - prepare_arch_switch(next); + prepare_arch_switch(prev, next); } =20 /** --=20 2.34.1