[RFC PATCH v2 2/3] x86: cpu/bugs: add support for AMD ERAPS feature

Amit Shah posted 3 patches 1 week, 5 days ago
[RFC PATCH v2 2/3] x86: cpu/bugs: add support for AMD ERAPS feature
Posted by Amit Shah 1 week, 5 days ago
From: Amit Shah <amit.shah@amd.com>

Remove explicit RET stuffing / filling on VMEXITs and context
switches on AMD CPUs with the ERAPS feature (Zen5+).

With the Enhanced Return Address Prediction Security feature,  any of
the following conditions triggers a flush of the RSB (aka RAP in AMD
manuals) in hardware:
* context switch (e.g., move to CR3)
* TLB flush
* some writes to CR4

The feature also explicitly tags host and guest addresses in the RSB -
eliminating the need for explicit flushing of the RSB on VMEXIT.

[RFC note: We'll wait for the APM to be updated with the real wording,
but assuming it's going to say the ERAPS feature works the way described
above, let's continue the discussion re: when the kernel currently calls
FILL_RETURN_BUFFER, and what dropping it entirely means.

Dave Hansen pointed out __switch_to_asm fills the RSB each time it's
called, so let's address the cases there:

1. user->kernel switch: Switching from userspace to kernelspace, and
   then using user-stuffed RSB entries in the kernel is not possible
   thanks to SMEP.  We can safely drop the FILL_RETURN_BUFFER call for
   this case.  In fact, this is how the original code was when dwmw2
   added it originally in c995efd5a.  So while this case currently
   triggers an RSB flush (and will not after this ERAPS patch), the
   current flush isn't necessary for AMD systems with SMEP anyway.

2. user->user or kernel->kernel: If a user->user switch does not result
   in a CR3 change, it's a different thread in the same process context.
   That's the same case for kernel->kernel switch.  In this case, the
   RSB entries are still valid in that context, just not the correct
   ones in the new thread's context.  It's difficult to imagine this
   being a security risk.  The current code clearing it, and this patch
   not doing so for AMD-with-ERAPS, isn't a concern as far as I see.
]

Feature mentioned in AMD PPR 57238.  Will be resubmitted once APM is
public - which I'm told is imminent.

Signed-off-by: Amit Shah <amit.shah@amd.com>
---
 Documentation/admin-guide/hw-vuln/spectre.rst |  5 ++--
 arch/x86/include/asm/cpufeatures.h            |  1 +
 arch/x86/include/asm/nospec-branch.h          | 12 ++++++++
 arch/x86/kernel/cpu/bugs.c                    | 29 ++++++++++++++-----
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index 132e0bc6007e..647c10c0307a 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -417,9 +417,10 @@ The possible values in this file are:
 
   - Return stack buffer (RSB) protection status:
 
-  =============   ===========================================
+  =============   ========================================================
   'RSB filling'   Protection of RSB on context switch enabled
-  =============   ===========================================
+  'ERAPS'         Hardware RSB flush on context switches + guest/host tags
+  =============   ========================================================
 
   - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status:
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 913fd3a7bac6..665032b12871 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -458,6 +458,7 @@
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
 #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */
 
+#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced RAP / RSB / RAS Security */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
 #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 96b410b1d4e8..f5ee7fc71db5 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -117,6 +117,18 @@
  * We define a CPP macro such that it can be used from both .S files and
  * inline assembly. It's possible to do a .macro and then include that
  * from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
+ *
+ * AMD CPUs with the ERAPS feature may have a larger default RSB.  These CPUs
+ * use the default number of entries on a host, and can optionally (based on
+ * hypervisor setup) use 32 (old) or the new default in a guest.  The number
+ * of default entries is reflected in CPUID 8000_0021:EBX[23:16].
+ *
+ * With the ERAPS feature, RSB filling is not necessary anymore: the RSB is
+ * auto-cleared by hardware on context switches, TLB flushes, or some CR4
+ * writes.  Adapting the value of RSB_CLEAR_LOOPS below for ERAPS would change
+ * it to a runtime variable instead of the current compile-time constant, so
+ * leave it as-is, as this works for both older CPUs, as well as newer ones
+ * with ERAPS.
  */
 
 #define RETPOLINE_THUNK_SIZE	32
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0aa629b5537d..02446815b0de 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1818,9 +1818,12 @@ static void __init spectre_v2_select_mitigation(void)
 	pr_info("%s\n", spectre_v2_strings[mode]);
 
 	/*
-	 * If Spectre v2 protection has been enabled, fill the RSB during a
-	 * context switch.  In general there are two types of RSB attacks
-	 * across context switches, for which the CALLs/RETs may be unbalanced.
+	 * If Spectre v2 protection has been enabled, the RSB needs to be
+	 * cleared during a context switch.  Either do it in software by
+	 * filling the RSB, or in hardware via ERAPS.
+	 *
+	 * In general there are two types of RSB attacks across context
+	 * switches, for which the CALLs/RETs may be unbalanced.
 	 *
 	 * 1) RSB underflow
 	 *
@@ -1848,12 +1851,21 @@ static void __init spectre_v2_select_mitigation(void)
 	 *    RSB clearing.
 	 *
 	 * So to mitigate all cases, unconditionally fill RSB on context
-	 * switches.
+	 * switches when ERAPS is not present.
 	 */
-	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
-	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
+	if (!boot_cpu_has(X86_FEATURE_ERAPS)) {
+		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
+		pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
 
-	spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
+		/*
+		 * For guest -> host (or vice versa) RSB poisoning scenarios,
+		 * determine the mitigation mode here.  With ERAPS, RSB
+		 * entries are tagged as host or guest - ensuring that neither
+		 * the host nor the guest have to clear or fill RSB entries to
+		 * avoid poisoning: skip RSB filling at VMEXIT in that case.
+		 */
+		spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
+	}
 
 	/*
 	 * Retpoline protects the kernel, but doesn't protect firmware.  IBRS
@@ -2866,7 +2878,7 @@ static ssize_t spectre_v2_show_state(char *buf)
 	    spectre_v2_enabled == SPECTRE_V2_EIBRS_LFENCE)
 		return sysfs_emit(buf, "Vulnerable: eIBRS+LFENCE with unprivileged eBPF and SMT\n");
 
-	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s\n",
+	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s%s\n",
 			  spectre_v2_strings[spectre_v2_enabled],
 			  ibpb_state(),
 			  boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? "; IBRS_FW" : "",
@@ -2874,6 +2886,7 @@ static ssize_t spectre_v2_show_state(char *buf)
 			  boot_cpu_has(X86_FEATURE_RSB_CTXSW) ? "; RSB filling" : "",
 			  pbrsb_eibrs_state(),
 			  spectre_bhi_state(),
+			  boot_cpu_has(X86_FEATURE_ERAPS) ? "; ERAPS hardware RSB flush" : "",
 			  /* this should always be at the end */
 			  spectre_v2_module_string());
 }
-- 
2.47.0
Re: [RFC PATCH v2 2/3] x86: cpu/bugs: add support for AMD ERAPS feature
Posted by Dave Hansen 1 week, 5 days ago
On 11/11/24 08:39, Amit Shah wrote:
> Remove explicit RET stuffing / filling on VMEXITs and context
> switches on AMD CPUs with the ERAPS feature (Zen5+).

I'm not a super big fan of how this changelog is constructed.  I
personally really like the "background, problem, solution" form.  This
starts with the solution before even telling us what it is.

> With the Enhanced Return Address Prediction Security feature,  any of
> the following conditions triggers a flush of the RSB (aka RAP in AMD
> manuals) in hardware:
> * context switch (e.g., move to CR3)

IMNHO, context switching is a broader topic that process-to-process
switches.  A user=>kernel thread switch is definitely a context switch,
but doesn't involve a CR3 write.  A user=>user switch in the same mm is
also highly arguable to be a context switch and doesn't involve a CR3
write.  There are userspace threading libraries that do "context switches".

This is further muddled by the very comments you are editing in this
patch like: "unconditionally fill RSB on context switches".

Please be very, very precise in the language that's chosen here.

> * TLB flush
> * some writes to CR4

... and only one of those is relevant for this patch.  Aren't the others
just noise?

> The feature also explicitly tags host and guest addresses in the RSB -
> eliminating the need for explicit flushing of the RSB on VMEXIT.
> 
> [RFC note: We'll wait for the APM to be updated with the real wording,
> but assuming it's going to say the ERAPS feature works the way described
> above, let's continue the discussion re: when the kernel currently calls
> FILL_RETURN_BUFFER, and what dropping it entirely means.
> 
> Dave Hansen pointed out __switch_to_asm fills the RSB each time it's
> called, so let's address the cases there:
> 
> 1. user->kernel switch: Switching from userspace to kernelspace, and
>    then using user-stuffed RSB entries in the kernel is not possible
>    thanks to SMEP.  We can safely drop the FILL_RETURN_BUFFER call for
>    this case.  In fact, this is how the original code was when dwmw2
>    added it originally in c995efd5a.  So while this case currently
>    triggers an RSB flush (and will not after this ERAPS patch), the
>    current flush isn't necessary for AMD systems with SMEP anyway.

This description makes me uneasy.  It basically boils down to, "SMEP
guarantees that the kernel can't consume user-placed RSB entries."

It makes me uneasy because I recall that not holding true on some Intel
CPUs. I believe some CPUs have a partial-width RSB.  Kernel consumption
of a user-placed entry would induce the kernel to speculate to a
*kernel* address so SMEP is rather ineffective.

So, instead of just saying, "SMEP magically fixes everything, trust us",
could we please step through a few of the properties of the hardware and
software that make SMEP effective?

First, I think that you're saying that AMD hardware has a full-width,
precise RSB.  That ensures that speculation always happens back
precisely to the original address, not some weird alias.  Second, ERAPS
guarantees that the the address still has the same stuff mapped there.
Any change to the address space involves either a CR3 write or a TLB
flush, both of which would have flushed any user-placed content in the RSB.

	Aside: Even the kernel-only text poking mm or EFI mm would "just
	work" in this scenario since they have their own mm_structs,
	page tables and root CR3 values.

> 2. user->user or kernel->kernel: If a user->user switch does not result
>    in a CR3 change, it's a different thread in the same process context.
>    That's the same case for kernel->kernel switch.  In this case, the
>    RSB entries are still valid in that context, just not the correct
>    ones in the new thread's context.  It's difficult to imagine this
>    being a security risk.  The current code clearing it, and this patch
>    not doing so for AMD-with-ERAPS, isn't a concern as far as I see.
> ]

The current rules are dirt simple: if the kernel stack gets switched,
the RSB is flushed.  It's kinda hard to have a mismatched stack if it's
never switched in the first place.  That makes the current rules dirt
simple.  The problem (stack switching) is dirt simple to correlate 1:1
with the fix (RSB stuffing).

This patch proposes separating the problem (stack switching) and the
mitigations (implicit new microcode actions).  That's a hell of a lot
more complicated and hardware to audit than the existing rules.  Agreed?

So, how are the rules relaxed?

First, it opens up the case where user threads consume RSB entries from
other threads in the same process. Threads are usually at the same
privilege level.  Instead of using a speculative attack, an attacker
could just read the data directly.  The only place I can imagine this
even remotely being a problem would be if threads were relying on
protection keys to keep secrets from each other.

The kernel consuming RSB entries from another kernel thread seems less
clear.  I disagree with the idea that a valid entry in a given context
is a _safe_ entry though.  Spectre v2 in _general_ involves nasty
speculation to otherwise perfectly safe code. A problematic scenario
would be where kswapd wakes up after waiting for I/O and starts
speculating back along the return path of the userspace thread that
kswapd interrupted. Userspace has some level of control over both stacks
and it doesn't seem super far fetched to think that there could be some
disclosure gadgets in there.

In short: the user-consumes-user-RSB-entry attacks seem fundamentally
improbable because user threads are unlikely to have secrets from each
other.

The kernel-consumes-kernel-RSB-entry attacks seem hard rather than
fundamentally improbable.  I can't even count how many times our "oh
that attack would be too hard to pull off" optimism has gone wrong.

What does that all _mean_?  ERAPS sucks?  Nah.  Maybe we just make sure
that the existing spectre_v2=whatever controls can be used to stop
relying on it if asked.

> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 96b410b1d4e8..f5ee7fc71db5 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -117,6 +117,18 @@
>   * We define a CPP macro such that it can be used from both .S files and
>   * inline assembly. It's possible to do a .macro and then include that
>   * from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
> + *
> + * AMD CPUs with the ERAPS feature may have a larger default RSB.  These CPUs
> + * use the default number of entries on a host, and can optionally (based on
> + * hypervisor setup) use 32 (old) or the new default in a guest.  The number
> + * of default entries is reflected in CPUID 8000_0021:EBX[23:16].
> + *
> + * With the ERAPS feature, RSB filling is not necessary anymore: the RSB is
> + * auto-cleared by hardware on context switches, TLB flushes, or some CR4
> + * writes.  Adapting the value of RSB_CLEAR_LOOPS below for ERAPS would change
> + * it to a runtime variable instead of the current compile-time constant, so
> + * leave it as-is, as this works for both older CPUs, as well as newer ones
> + * with ERAPS.
>   */

This comment feels out of place and noisy to me.  Why the heck would we
need to document the RSB size enumeration here and not use it?

Then this goes on to explain why this patch *didn't* bother to change
RSB_CLEAR_LOOPS.  That seems much more like changelog material than a
code comment to me.

To me, RSB_CLEAR_LOOPS, is for, well, clearing the RSB.  The existing
comments say that some CPUs don't need to clear the RSB.  I don't think
we need further explanation of why one specific CPU doesn't need to
clear the RSB.  The new CPU isn't special.

Something slightly more useful would be to actually read CPUID
8000_0021:EBX[23:16] and compare it to RSB_CLEAR_LOOPS:

void amd_check_rsb_clearing(void)
{
	/*
	 * Systems with both these features off
	 * do not perform RSB clearing:
	 */
	if (!boot_cpu_has(X86_FEATURE_RSB_CTXSW) &&
	    !boot_cpu_has(X86_FEATURE_RSB_VMEXIT))
		return;

	// with actual helpers and mask generation, not this mess:
	rsb_size = (cpuid_ebx(0x80000021) >> 16) & 0xff;

	WARN_ON_ONCE(rsb_size > RSB_CLEAR_LOOPS);
}

That's almost shorter than the proposed comment.

>  #define RETPOLINE_THUNK_SIZE	32
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 0aa629b5537d..02446815b0de 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1818,9 +1818,12 @@ static void __init spectre_v2_select_mitigation(void)
>  	pr_info("%s\n", spectre_v2_strings[mode]);
>  
>  	/*
> -	 * If Spectre v2 protection has been enabled, fill the RSB during a
> -	 * context switch.  In general there are two types of RSB attacks
> -	 * across context switches, for which the CALLs/RETs may be unbalanced.
> +	 * If Spectre v2 protection has been enabled, the RSB needs to be
> +	 * cleared during a context switch.  Either do it in software by
> +	 * filling the RSB, or in hardware via ERAPS.

I'd move this comment about using ERAPS down to the ERAPS check itself.

> +	 * In general there are two types of RSB attacks across context
> +	 * switches, for which the CALLs/RETs may be unbalanced.
>  	 *
>  	 * 1) RSB underflow
>  	 *
> @@ -1848,12 +1851,21 @@ static void __init spectre_v2_select_mitigation(void)
>  	 *    RSB clearing.
>  	 *
>  	 * So to mitigate all cases, unconditionally fill RSB on context
> -	 * switches.
> +	 * switches when ERAPS is not present.
>  	 */
> -	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> -	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
> +	if (!boot_cpu_has(X86_FEATURE_ERAPS)) {
> +		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> +		pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
>  
> -	spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
> +		/*
> +		 * For guest -> host (or vice versa) RSB poisoning scenarios,
> +		 * determine the mitigation mode here.  With ERAPS, RSB
> +		 * entries are tagged as host or guest - ensuring that neither
> +		 * the host nor the guest have to clear or fill RSB entries to
> +		 * avoid poisoning: skip RSB filling at VMEXIT in that case.
> +		 */
> +		spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
> +	}

This seems to be following a pattern of putting excessive amounts of
very AMD-specific commenting right in the middle of common code.
There's a *HUGE* comment in
spectre_v2_determine_rsb_fill_type_at_vmexit().  Wouldn't this be more
at home there?

>  	/*
>  	 * Retpoline protects the kernel, but doesn't protect firmware.  IBRS
> @@ -2866,7 +2878,7 @@ static ssize_t spectre_v2_show_state(char *buf)
>  	    spectre_v2_enabled == SPECTRE_V2_EIBRS_LFENCE)
>  		return sysfs_emit(buf, "Vulnerable: eIBRS+LFENCE with unprivileged eBPF and SMT\n");
>  
> -	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s\n",
> +	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s%s\n",
>  			  spectre_v2_strings[spectre_v2_enabled],
>  			  ibpb_state(),
>  			  boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? "; IBRS_FW" : "",
> @@ -2874,6 +2886,7 @@ static ssize_t spectre_v2_show_state(char *buf)
>  			  boot_cpu_has(X86_FEATURE_RSB_CTXSW) ? "; RSB filling" : "",
>  			  pbrsb_eibrs_state(),
>  			  spectre_bhi_state(),
> +			  boot_cpu_has(X86_FEATURE_ERAPS) ? "; ERAPS hardware RSB flush" : "",
>  			  /* this should always be at the end */
>  			  spectre_v2_module_string());
>  }
Re: [RFC PATCH v2 2/3] x86: cpu/bugs: add support for AMD ERAPS feature
Posted by Shah, Amit 1 week, 3 days ago
On Mon, 2024-11-11 at 10:57 -0800, Dave Hansen wrote:
> On 11/11/24 08:39, Amit Shah wrote:
> > Remove explicit RET stuffing / filling on VMEXITs and context
> > switches on AMD CPUs with the ERAPS feature (Zen5+).
> 
> I'm not a super big fan of how this changelog is constructed.  I
> personally really like the "background, problem, solution" form. 
> This
> starts with the solution before even telling us what it is.

The subject line is usually the one with the impact; but in this case I
couldn't fit the impact (add new feature + highlight we're getting rid
of existing mitigation mechanisms).  If you have ideas on a better
subject line, let me know.

> > With the Enhanced Return Address Prediction Security feature,  any
> > of
> > the following conditions triggers a flush of the RSB (aka RAP in
> > AMD
> > manuals) in hardware:
> > * context switch (e.g., move to CR3)
> 
> IMNHO, context switching is a broader topic that process-to-process
> switches.  A user=>kernel thread switch is definitely a context
> switch,
> but doesn't involve a CR3 write.  A user=>user switch in the same mm
> is
> also highly arguable to be a context switch and doesn't involve a CR3
> write.  There are userspace threading libraries that do "context
> switches".
> 
> This is further muddled by the very comments you are editing in this
> patch like: "unconditionally fill RSB on context switches".
> 
> Please be very, very precise in the language that's chosen here.

I'm not saying I disagree, and I'm also saying the same thing - context
switch is broader than just CR3 update.  But I take the point you're
really making - just limit this description to "CR3 update" -- because
that really is the case we're targeting with this feature in this patch
-- along with the arguments why the other cases (like user->kernel
switch or user->user switch in the same mm) are not affected by the
series.

> > * TLB flush
> > * some writes to CR4
> 
> ... and only one of those is relevant for this patch.  Aren't the
> others
> just noise?

I won't say so - for the purposes of this patch, yes.  But if some
other condition requires Linux to flush the RSB on a non-CR3-write
event in the future, they'll find this commit.  Leads to fewer "this is
unknown, let's assume the worst" conditions, somewhat like in the
comments in patch 1.

> > The feature also explicitly tags host and guest addresses in the
> > RSB -
> > eliminating the need for explicit flushing of the RSB on VMEXIT.
> > 
> > [RFC note: We'll wait for the APM to be updated with the real
> > wording,
> > but assuming it's going to say the ERAPS feature works the way
> > described
> > above, let's continue the discussion re: when the kernel currently
> > calls
> > FILL_RETURN_BUFFER, and what dropping it entirely means.
> > 
> > Dave Hansen pointed out __switch_to_asm fills the RSB each time
> > it's
> > called, so let's address the cases there:
> > 
> > 1. user->kernel switch: Switching from userspace to kernelspace,
> > and
> >    then using user-stuffed RSB entries in the kernel is not
> > possible
> >    thanks to SMEP.  We can safely drop the FILL_RETURN_BUFFER call
> > for
> >    this case.  In fact, this is how the original code was when
> > dwmw2
> >    added it originally in c995efd5a.  So while this case currently
> >    triggers an RSB flush (and will not after this ERAPS patch), the
> >    current flush isn't necessary for AMD systems with SMEP anyway.
> 
> This description makes me uneasy.  It basically boils down to, "SMEP
> guarantees that the kernel can't consume user-placed RSB entries."
> 
> It makes me uneasy because I recall that not holding true on some
> Intel
> CPUs. I believe some CPUs have a partial-width RSB.  Kernel
> consumption
> of a user-placed entry would induce the kernel to speculate to a
> *kernel* address so SMEP is rather ineffective.
> 
> So, instead of just saying, "SMEP magically fixes everything, trust
> us",
> could we please step through a few of the properties of the hardware
> and
> software that make SMEP effective?

That's fair.  And that's really the reason why I highlighted the first
line in the commit message: "we're really getting rid of software
mitigations here - focus on that please". Though I'll note here that
none of the questions raised in these series have come as a surprise to
me - I've had those exact questions asked of hardware engineers
internally, and answered to my satisfaction, and hence this patchset.


> First, I think that you're saying that AMD hardware has a full-width,
> precise RSB.  That ensures that speculation always happens back
> precisely to the original address, not some weird alias.  Second,
> ERAPS
> guarantees that the the address still has the same stuff mapped
> there.
> Any change to the address space involves either a CR3 write or a TLB
> flush, both of which would have flushed any user-placed content in
> the RSB.
> 
> 	Aside: Even the kernel-only text poking mm or EFI mm would
> "just
> 	work" in this scenario since they have their own mm_structs,
> 	page tables and root CR3 values.

All true.  David (Kaplan), please say if that's not right.

> > 2. user->user or kernel->kernel: If a user->user switch does not
> > result
> >    in a CR3 change, it's a different thread in the same process
> > context.
> >    That's the same case for kernel->kernel switch.  In this case,
> > the
> >    RSB entries are still valid in that context, just not the
> > correct
> >    ones in the new thread's context.  It's difficult to imagine
> > this
> >    being a security risk.  The current code clearing it, and this
> > patch
> >    not doing so for AMD-with-ERAPS, isn't a concern as far as I
> > see.
> > ]
> 
> The current rules are dirt simple: if the kernel stack gets switched,
> the RSB is flushed.  It's kinda hard to have a mismatched stack if
> it's
> never switched in the first place.  That makes the current rules dirt
> simple.  The problem (stack switching) is dirt simple to correlate
> 1:1
> with the fix (RSB stuffing).

... and it could be argued that's overkill.  These are security
mitigations -- and we're not mitigating security by clearing the RSB on
stack switches.  And perhaps we're inviting performance penalties by
doing more than what's required for security, and working against the
hardware rather than with it.

> This patch proposes separating the problem (stack switching) and the
> mitigations (implicit new microcode actions).  That's a hell of a lot
> more complicated and hardware to audit than the existing rules. 
> Agreed?

Partially.  We've got to trust the hardware and tools we've been given
- if the hardware says it behaves in a particular way on a given input
or event, we've got to trust that.  I'm building on top of what's been
given.

But: I like this thought experiment.  Let's continue.

> So, how are the rules relaxed?
> 
> First, it opens up the case where user threads consume RSB entries
> from
> other threads in the same process. Threads are usually at the same
> privilege level.  Instead of using a speculative attack, an attacker
> could just read the data directly.  The only place I can imagine this
> even remotely being a problem would be if threads were relying on
> protection keys to keep secrets from each other.

It's a valid scenario, albeit really badly designed security/crypto
software.  I would imagine that's difficult to exploit even then.

> The kernel consuming RSB entries from another kernel thread seems
> less
> clear.  I disagree with the idea that a valid entry in a given
> context
> is a _safe_ entry though.  Spectre v2 in _general_ involves nasty
> speculation to otherwise perfectly safe code. A problematic scenario
> would be where kswapd wakes up after waiting for I/O and starts
> speculating back along the return path of the userspace thread that
> kswapd interrupted. Userspace has some level of control over both
> stacks
> and it doesn't seem super far fetched to think that there could be
> some
> disclosure gadgets in there.
> 
> In short: the user-consumes-user-RSB-entry attacks seem fundamentally
> improbable because user threads are unlikely to have secrets from
> each
> other.
> 
> The kernel-consumes-kernel-RSB-entry attacks seem hard rather than
> fundamentally improbable.  I can't even count how many times our "oh
> that attack would be too hard to pull off" optimism has gone wrong.
> 
> What does that all _mean_?  ERAPS sucks?  Nah.  Maybe we just make
> sure
> that the existing spectre_v2=whatever controls can be used to stop
> relying on it if asked.

Hm, not yet convinced - but documenting this seems useful.  I can stick
this email verbatim in Documentation/ for the future, if you don't
mind?

Beyond that - there's the "mov cr3, cr3" we could plug in in
__switch_to_asm when ERAPS is active.  I don't like it - but at least
that's 1:1.

> > diff --git a/arch/x86/include/asm/nospec-branch.h
> > b/arch/x86/include/asm/nospec-branch.h
> > index 96b410b1d4e8..f5ee7fc71db5 100644
> > --- a/arch/x86/include/asm/nospec-branch.h
> > +++ b/arch/x86/include/asm/nospec-branch.h
> > @@ -117,6 +117,18 @@
> >   * We define a CPP macro such that it can be used from both .S
> > files and
> >   * inline assembly. It's possible to do a .macro and then include
> > that
> >   * from C via asm(".include <asm/nospec-branch.h>") but let's not
> > go there.
> > + *
> > + * AMD CPUs with the ERAPS feature may have a larger default RSB. 
> > These CPUs
> > + * use the default number of entries on a host, and can optionally
> > (based on
> > + * hypervisor setup) use 32 (old) or the new default in a guest. 
> > The number
> > + * of default entries is reflected in CPUID 8000_0021:EBX[23:16].
> > + *
> > + * With the ERAPS feature, RSB filling is not necessary anymore:
> > the RSB is
> > + * auto-cleared by hardware on context switches, TLB flushes, or
> > some CR4
> > + * writes.  Adapting the value of RSB_CLEAR_LOOPS below for ERAPS
> > would change
> > + * it to a runtime variable instead of the current compile-time
> > constant, so
> > + * leave it as-is, as this works for both older CPUs, as well as
> > newer ones
> > + * with ERAPS.
> >   */
> 
> This comment feels out of place and noisy to me.  Why the heck would
> we
> need to document the RSB size enumeration here and not use it?
> 
> Then this goes on to explain why this patch *didn't* bother to change
> RSB_CLEAR_LOOPS.  That seems much more like changelog material than a
> code comment to me.
> 
> To me, RSB_CLEAR_LOOPS, is for, well, clearing the RSB.  The existing
> comments say that some CPUs don't need to clear the RSB.  I don't
> think
> we need further explanation of why one specific CPU doesn't need to
> clear the RSB.  The new CPU isn't special.
> 
> Something slightly more useful would be to actually read CPUID
> 8000_0021:EBX[23:16] and compare it to RSB_CLEAR_LOOPS:
> 
> void amd_check_rsb_clearing(void)
> {
> 	/*
> 	 * Systems with both these features off
> 	 * do not perform RSB clearing:
> 	 */
> 	if (!boot_cpu_has(X86_FEATURE_RSB_CTXSW) &&
> 	    !boot_cpu_has(X86_FEATURE_RSB_VMEXIT))
> 		return;
> 
> 	// with actual helpers and mask generation, not this mess:
> 	rsb_size = (cpuid_ebx(0x80000021) >> 16) & 0xff;
> 
> 	WARN_ON_ONCE(rsb_size > RSB_CLEAR_LOOPS);
> }
> 
> That's almost shorter than the proposed comment.

That's a good idea.

The comment vs commit log is because the commit otherwise is in a
different file than the #define, and people reading the code can miss
it entirely.  I just prefer grep+read+git workflows much better for my
archaeology.

> >  #define RETPOLINE_THUNK_SIZE	32
> > diff --git a/arch/x86/kernel/cpu/bugs.c
> > b/arch/x86/kernel/cpu/bugs.c
> > index 0aa629b5537d..02446815b0de 100644
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -1818,9 +1818,12 @@ static void __init
> > spectre_v2_select_mitigation(void)
> >  	pr_info("%s\n", spectre_v2_strings[mode]);
> >  
> >  	/*
> > -	 * If Spectre v2 protection has been enabled, fill the RSB
> > during a
> > -	 * context switch.  In general there are two types of RSB
> > attacks
> > -	 * across context switches, for which the CALLs/RETs may
> > be unbalanced.
> > +	 * If Spectre v2 protection has been enabled, the RSB
> > needs to be
> > +	 * cleared during a context switch.  Either do it in
> > software by
> > +	 * filling the RSB, or in hardware via ERAPS.
> 
> I'd move this comment about using ERAPS down to the ERAPS check
> itself.
> 
> > +	 * In general there are two types of RSB attacks across
> > context
> > +	 * switches, for which the CALLs/RETs may be unbalanced.
> >  	 *
> >  	 * 1) RSB underflow
> >  	 *
> > @@ -1848,12 +1851,21 @@ static void __init
> > spectre_v2_select_mitigation(void)
> >  	 *    RSB clearing.
> >  	 *
> >  	 * So to mitigate all cases, unconditionally fill RSB on
> > context
> > -	 * switches.
> > +	 * switches when ERAPS is not present.
> >  	 */
> > -	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> > -	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB
> > on context switch\n");
> > +	if (!boot_cpu_has(X86_FEATURE_ERAPS)) {
> > +		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> > +		pr_info("Spectre v2 / SpectreRSB mitigation:
> > Filling RSB on context switch\n");
> >  
> > -	spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
> > +		/*
> > +		 * For guest -> host (or vice versa) RSB poisoning
> > scenarios,
> > +		 * determine the mitigation mode here.  With
> > ERAPS, RSB
> > +		 * entries are tagged as host or guest - ensuring
> > that neither
> > +		 * the host nor the guest have to clear or fill
> > RSB entries to
> > +		 * avoid poisoning: skip RSB filling at VMEXIT in
> > that case.
> > +		 */
> > +		spectre_v2_determine_rsb_fill_type_at_vmexit(mode)
> > ;
> > +	}
> 
> This seems to be following a pattern of putting excessive amounts of
> very AMD-specific commenting right in the middle of common code.
> There's a *HUGE* comment in
> spectre_v2_determine_rsb_fill_type_at_vmexit().  Wouldn't this be
> more
> at home there?

I think with Josh's reworded comments and stripping out of comments
from spectre_v2_determine_...() flows much better - so yea, will
update.

Thank you for the thorough review, David!

		Amit