[PATCH v2 0/2] x86/bugs: RSB tweaks

Josh Poimboeuf posted 2 patches 2 months, 1 week ago
arch/x86/kernel/cpu/bugs.c | 115 +++++++++++++++----------------------
arch/x86/mm/tlb.c          |   2 +-
2 files changed, 48 insertions(+), 69 deletions(-)
[PATCH v2 0/2] x86/bugs: RSB tweaks
Posted by Josh Poimboeuf 2 months, 1 week ago
v2:
- remove stray whitespace in printk
- clarify patch 2 comment message about prev/next tasks
- move error path to default switch case in spectre_v2_mitigate_rsb()
- add Reviewed-bys

Some RSB filling tweaks as discussed in the following thread:

  [RFC PATCH v2 0/3] Add support for the ERAPS feature
  https://lore.kernel.org/20241111163913.36139-1-amit@kernel.org

Josh Poimboeuf (2):
  x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpoline
  x86/bugs: Don't fill RSB on context switch with eIBRS

 arch/x86/kernel/cpu/bugs.c | 115 +++++++++++++++----------------------
 arch/x86/mm/tlb.c          |   2 +-
 2 files changed, 48 insertions(+), 69 deletions(-)

-- 
2.47.0
[RFC PATCH v3 0/2] Add support for the ERAPS feature
Posted by Amit Shah 2 months ago
Newer AMD CPUs (Zen5+) have the ERAPS feature bit that allows us to remove the
RSB filling loops required during context switches and VM exits.

This patchset implements the feature to:
* remove the need for RSB filling on context switches and VMEXITs in host and
  guests
* allow KVM guests to use the full default RSB stack

The feature isn't yet part of an APM update that details its working, so this
is still tagged as RFC.  The notes at

https://amitshah.net/2024/11/eraps-reduces-software-tax-for-hardware-bugs/

may help follow along till the APM is public.

v3:
* rebase on top of Josh's RSB tweaks series
  * with that rebase, only the non-AutoIBRS case needs special ERAPS support.
    AutoIBRS is currently disabled when SEV-SNP is active (commit acaa4b5c4c8)

* remove comment about RSB_CLEAR_LOOPS and the size of the RSB -- it's not
  necessary anymore with the rework

* remove comment from patch 2 in svm.c in favour of the commit message

v2:
* reword comments to highlight context switch as the main trigger for RSB
  flushes in hardware (Dave Hansen)
* Split out outdated comment updates in (v1) patch1 to be a standalone
  patch1 in this series, to reinforce RSB filling is only required for RSB
  poisoning cases for AMD
  * Remove mentions of BTC/BTC_NO (Andrew Cooper)
* Add braces in case stmt (kernel test robot)
* s/boot_cpu_has/cpu_feature_enabled (Boris Petkov)

Amit Shah (2):
  x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
  x86: kvm: svm: advertise ERAPS (larger RSB) support to guests

 Documentation/admin-guide/hw-vuln/spectre.rst |  5 ++--
 arch/x86/include/asm/cpufeatures.h            |  1 +
 arch/x86/include/asm/svm.h                    |  6 +++-
 arch/x86/kernel/cpu/bugs.c                    |  6 +++-
 arch/x86/kvm/cpuid.c                          | 18 ++++++++++--
 arch/x86/kvm/svm/svm.c                        | 29 +++++++++++++++++++
 arch/x86/kvm/svm/svm.h                        | 15 ++++++++++
 7 files changed, 74 insertions(+), 6 deletions(-)

-- 
2.47.0
[RFC PATCH v3 1/2] x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
Posted by Amit Shah 2 months ago
From: Amit Shah <amit.shah@amd.com>

When Automatic IBRS is disabled, Linux flushed the RSB on every context
switch.  This RSB flush is not necessary in software with the ERAPS
feature on Zen5+ CPUs that flushes the RSB in hardware on a context
switch (triggered by mov-to-CR3).

Additionally, the ERAPS feature also tags host and guest addresses in
the RSB - eliminating the need for software flushing of the RSB on
VMEXIT.

Disable all RSB flushing by Linux when the CPU has ERAPS.

Feature mentioned in AMD PPR 57238.  Will be resubmitted once APM is
public - which I'm told is imminent.

Signed-off-by: Amit Shah <amit.shah@amd.com>
---
 Documentation/admin-guide/hw-vuln/spectre.rst | 5 +++--
 arch/x86/include/asm/cpufeatures.h            | 1 +
 arch/x86/kernel/cpu/bugs.c                    | 6 +++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index 132e0bc6007e..647c10c0307a 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -417,9 +417,10 @@ The possible values in this file are:
 
   - Return stack buffer (RSB) protection status:
 
-  =============   ===========================================
+  =============   ========================================================
   'RSB filling'   Protection of RSB on context switch enabled
-  =============   ===========================================
+  'ERAPS'         Hardware RSB flush on context switches + guest/host tags
+  =============   ========================================================
 
   - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status:
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..79a1373050f7 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -461,6 +461,7 @@
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* Automatic IBRS */
 #define X86_FEATURE_NO_SMM_CTL_MSR	(20*32+ 9) /* SMM_CTL MSR is not present */
 
+#define X86_FEATURE_ERAPS		(20*32+24) /* Enhanced RAP / RSB / RAS Security */
 #define X86_FEATURE_SBPB		(20*32+27) /* Selective Branch Prediction Barrier */
 #define X86_FEATURE_IBPB_BRTYPE		(20*32+28) /* MSR_PRED_CMD[IBPB] flushes all branch type predictions */
 #define X86_FEATURE_SRSO_NO		(20*32+29) /* CPU is not affected by SRSO */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d5102b72f74d..d7af5f811776 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1634,6 +1634,9 @@ static void __init spectre_v2_mitigate_rsb(enum spectre_v2_mitigation mode)
 	case SPECTRE_V2_RETPOLINE:
 	case SPECTRE_V2_LFENCE:
 	case SPECTRE_V2_IBRS:
+		if (boot_cpu_has(X86_FEATURE_ERAPS))
+			break;
+
 		pr_info("Spectre v2 / SpectreRSB: Filling RSB on context switch and VMEXIT\n");
 		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
 		setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT);
@@ -2850,7 +2853,7 @@ static ssize_t spectre_v2_show_state(char *buf)
 	    spectre_v2_enabled == SPECTRE_V2_EIBRS_LFENCE)
 		return sysfs_emit(buf, "Vulnerable: eIBRS+LFENCE with unprivileged eBPF and SMT\n");
 
-	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s\n",
+	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s%s\n",
 			  spectre_v2_strings[spectre_v2_enabled],
 			  ibpb_state(),
 			  boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? "; IBRS_FW" : "",
@@ -2858,6 +2861,7 @@ static ssize_t spectre_v2_show_state(char *buf)
 			  boot_cpu_has(X86_FEATURE_RSB_CTXSW) ? "; RSB filling" : "",
 			  pbrsb_eibrs_state(),
 			  spectre_bhi_state(),
+			  boot_cpu_has(X86_FEATURE_ERAPS) ? "; ERAPS hardware RSB flush" : "",
 			  /* this should always be at the end */
 			  spectre_v2_module_string());
 }
-- 
2.47.0
Re: [RFC PATCH v3 1/2] x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
Posted by Dave Hansen 2 months ago
On 11/28/24 05:28, Amit Shah wrote:
> From: Amit Shah <amit.shah@amd.com>
> 
> When Automatic IBRS is disabled, Linux flushed the RSB on every context
> switch.  This RSB flush is not necessary in software with the ERAPS
> feature on Zen5+ CPUs that flushes the RSB in hardware on a context
> switch (triggered by mov-to-CR3).
> 
> Additionally, the ERAPS feature also tags host and guest addresses in
> the RSB - eliminating the need for software flushing of the RSB on
> VMEXIT.
> 
> Disable all RSB flushing by Linux when the CPU has ERAPS.
> 
> Feature mentioned in AMD PPR 57238.  Will be resubmitted once APM is
> public - which I'm told is imminent.

There was a _lot_ of discussion about this. But all of that discussion
seems to have been trimmed out and it seems like we're basically back
to: "this is new hardware supposed to mitigate SpectreRSB, thus it
mitigates SpectreRSB."

Could we please summarize the previous discussions in the changelog?
Otherwise, I fear it will be lost.
Re: [RFC PATCH v3 1/2] x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
Posted by Amit Shah 2 months ago
On Mon, 2024-12-02 at 09:26 -0800, Dave Hansen wrote:
> On 11/28/24 05:28, Amit Shah wrote:
> > From: Amit Shah <amit.shah@amd.com>
> > 
> > When Automatic IBRS is disabled, Linux flushed the RSB on every
> > context
> > switch.  This RSB flush is not necessary in software with the ERAPS
> > feature on Zen5+ CPUs that flushes the RSB in hardware on a context
> > switch (triggered by mov-to-CR3).
> > 
> > Additionally, the ERAPS feature also tags host and guest addresses
> > in
> > the RSB - eliminating the need for software flushing of the RSB on
> > VMEXIT.
> > 
> > Disable all RSB flushing by Linux when the CPU has ERAPS.
> > 
> > Feature mentioned in AMD PPR 57238.  Will be resubmitted once APM
> > is
> > public - which I'm told is imminent.
> 
> There was a _lot_ of discussion about this. But all of that
> discussion
> seems to have been trimmed out and it seems like we're basically back
> to: "this is new hardware supposed to mitigate SpectreRSB, thus it
> mitigates SpectreRSB."

Absolutely, I don't want that to get lost -- but I think that got
captured in Josh's rework patchset.  With that rework, I don't even
need this patchset for the hardware feature to work, because we now
rely on AutoIBRS to do the RSB clearing; and the hardware takes care of
AutoIBRS and ERAPS interaction in Zen5.

The only thing this patch now does is to handle the AutoIBRS-disabled
case -- which happens when SEV-SNP is turned on (i.e. let the hw clear
the RSB instead of stuffing it in Linux).

I can still include the summary of the discussion in this patch - I
just feel it isn't necessary with the rework.

		Amit
Re: [RFC PATCH v3 1/2] x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
Posted by Dave Hansen 2 months ago
On 12/2/24 10:09, Amit Shah wrote:
> I can still include the summary of the discussion in this patch - I
> just feel it isn't necessary with the rework.

It's necessary.

There's a new hardware feature. You want it to replace a software
sequence in certain situations. You have to make an actual, coherent
argument argument as to why the hardware feature is a suitable replacement.

For instance (and I'm pretty sure we've gone over this more than once),
the changelog here still makes the claim that a "context switch" and a
"mov-to-CR3" are the same thing.
Re: [RFC PATCH v3 1/2] x86: cpu/bugs: add AMD ERAPS support; hardware flushes RSB
Posted by Sean Christopherson 2 months ago
On Mon, Dec 02, 2024, Dave Hansen wrote:
> On 12/2/24 10:09, Amit Shah wrote:
> > I can still include the summary of the discussion in this patch - I
> > just feel it isn't necessary with the rework.
> 
> It's necessary.
> 
> There's a new hardware feature. You want it to replace a software
> sequence in certain situations. You have to make an actual, coherent
> argument argument as to why the hardware feature is a suitable replacement.
> 
> For instance (and I'm pretty sure we've gone over this more than once),
> the changelog here still makes the claim that a "context switch" and a
> "mov-to-CR3" are the same thing.

+1000.  I want a crisp, precise description of the actual hardware behavior, so
that KVM can do the right thing when virtualizing ERAPS.
[RFC PATCH v3 2/2] x86: kvm: svm: advertise ERAPS (larger RSB) support to guests
Posted by Amit Shah 2 months ago
From: Amit Shah <amit.shah@amd.com>

AMD CPUs with the ERAPS feature (Zen5+) have a larger RSB (aka RAP).
While the new default RSB size is used on the host without any software
modification necessary, the RSB size for guests is limited to the older
value (32 entries) for backwards compatibility.  With this patch, KVM
enables guest mode to also use the default number of entries by setting
the new ALLOW_LARGER_RAP bit in the VMCB.

The two cases for backward compatibility that need special handling are
nested guests, and guests using shadow paging (or when NPT is disabled):

For nested guests: the ERAPS feature adds host/guest tagging to entries
in the RSB, but does not distinguish between ASIDs.  On a nested exit,
the L0 hypervisor instructs the hardware (via another new VMCB bit,
FLUSH_RAP_ON_VMRUN) to flush the RSB on the next VMRUN to prevent RSB
poisoning attacks from an L2 guest to an L1 guest.  With that in place,
this feature can be exposed to guests.

For shadow paging guests: do not expose this feature to guests; only
expose if nested paging is enabled, to ensure a context switch within
a guest triggers a context switch on the CPU -- thereby ensuring guest
context switches flush guest RSB entries.  For shadow paging, the CPU's
CR3 is not used for guest processes, and hence cannot benefit from this
feature.

Signed-off-by: Amit Shah <amit.shah@amd.com>
---
 arch/x86/include/asm/svm.h |  6 +++++-
 arch/x86/kvm/cpuid.c       | 18 ++++++++++++++++--
 arch/x86/kvm/svm/svm.c     | 29 +++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h     | 15 +++++++++++++++
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 2b59b9951c90..f8584a63c859 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -129,7 +129,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 	u64 tsc_offset;
 	u32 asid;
 	u8 tlb_ctl;
-	u8 reserved_2[3];
+	u8 erap_ctl;
+	u8 reserved_2[2];
 	u32 int_ctl;
 	u32 int_vector;
 	u32 int_state;
@@ -175,6 +176,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define TLB_CONTROL_FLUSH_ASID 3
 #define TLB_CONTROL_FLUSH_ASID_LOCAL 7
 
+#define ERAP_CONTROL_ALLOW_LARGER_RAP 0
+#define ERAP_CONTROL_FLUSH_RAP 1
+
 #define V_TPR_MASK 0x0f
 
 #define V_IRQ_SHIFT 8
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 097bdc022d0f..dd589670a716 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -803,6 +803,8 @@ void kvm_set_cpu_caps(void)
 		F(WRMSR_XX_BASE_NS)
 	);
 
+	if (tdp_enabled)
+		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);
 	kvm_cpu_cap_check_and_set(X86_FEATURE_SBPB);
 	kvm_cpu_cap_check_and_set(X86_FEATURE_IBPB_BRTYPE);
 	kvm_cpu_cap_check_and_set(X86_FEATURE_SRSO_NO);
@@ -1362,10 +1364,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 	case 0x80000020:
 		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
 		break;
-	case 0x80000021:
-		entry->ebx = entry->ecx = entry->edx = 0;
+	case 0x80000021: {
+		unsigned int ebx_mask = 0;
+
+		entry->ecx = entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
+
+		/*
+		 * Bits 23:16 in EBX indicate the size of the RSB.
+		 * Expose the value in the hardware to the guest.
+		 */
+		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
+			ebx_mask |= GENMASK(23, 16);
+
+		entry->ebx &= ebx_mask;
 		break;
+	}
 	/* AMD Extended Performance Monitoring and Debug */
 	case 0x80000022: {
 		union cpuid_0x80000022_ebx ebx;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..9b055de079cb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1360,6 +1360,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
 
+	/*
+	 * If the hardware has a larger RSB, use it in the guest context as
+	 * well.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_ERAPS) && npt_enabled)
+		vmcb_set_larger_rap(svm->vmcb);
+
 	if (kvm_vcpu_apicv_active(vcpu))
 		avic_init_vmcb(svm, vmcb);
 
@@ -3395,6 +3402,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
 	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
 	pr_err("%-20s%d\n", "asid:", control->asid);
 	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
+	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
 	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
 	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
 	pr_err("%-20s%08x\n", "int_state:", control->int_state);
@@ -3561,6 +3569,27 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 
 		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
 
+		if (boot_cpu_has(X86_FEATURE_ERAPS)
+		    && vmcb_is_larger_rap(svm->vmcb01.ptr)) {
+			/*
+			 * XXX a few further optimizations can be made:
+			 *
+			 * 1. In pre_svm_run() we can reset this bit when a hw
+			 * TLB flush has happened - any context switch on a
+			 * CPU (which causes a TLB flush) auto-flushes the RSB
+			 * - eg when this vCPU is scheduled on a different
+			 * pCPU.
+			 *
+			 * 2. This is also not needed in the case where the
+			 * vCPU is being scheduled on the same pCPU, but there
+			 * was a context switch between the #VMEXIT and VMRUN.
+			 *
+			 * 3. If the guest returns to L2 again after this
+			 * #VMEXIT, there's no need to flush the RSB.
+			 */
+			vmcb_set_flush_rap(svm->vmcb01.ptr);
+		}
+
 		vmexit = nested_svm_exit_special(svm);
 
 		if (vmexit == NESTED_EXIT_CONTINUE)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 43fa6a16eb19..8a7877f46dc5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -500,6 +500,21 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
 	return vmcb_is_intercept(&svm->vmcb->control, bit);
 }
 
+static inline void vmcb_set_flush_rap(struct vmcb *vmcb)
+{
+	__set_bit(ERAP_CONTROL_FLUSH_RAP, (unsigned long *)&vmcb->control.erap_ctl);
+}
+
+static inline void vmcb_set_larger_rap(struct vmcb *vmcb)
+{
+	__set_bit(ERAP_CONTROL_ALLOW_LARGER_RAP, (unsigned long *)&vmcb->control.erap_ctl);
+}
+
+static inline bool vmcb_is_larger_rap(struct vmcb *vmcb)
+{
+	return test_bit(ERAP_CONTROL_ALLOW_LARGER_RAP, (unsigned long *)&vmcb->control.erap_ctl);
+}
+
 static inline bool nested_vgif_enabled(struct vcpu_svm *svm)
 {
 	return guest_can_use(&svm->vcpu, X86_FEATURE_VGIF) &&
-- 
2.47.0
Re: [RFC PATCH v3 2/2] x86: kvm: svm: advertise ERAPS (larger RSB) support to guests
Posted by Sean Christopherson 2 months ago
KVM: SVM:

Please through the relevant maintainer handbooks, there are warts all over.

  Documentation/process/maintainer-kvm-x86.rst
  Documentation/process/maintainer-tip.rst

And the shortlog is wrong.  The patch itself is also broken.  KVM should (a) add
support for virtualizing ERAPS and (b) advertise support to *userspace*.  The
userspace VMM ultimately decides what to expose/enable for the guest.

On Thu, Nov 28, 2024, Amit Shah wrote:
> From: Amit Shah <amit.shah@amd.com>
> 
> AMD CPUs with the ERAPS feature (Zen5+) have a larger RSB (aka RAP).

Please spell out ERAPS at least once (I assume it's "Enhanced RAPs"?) and very
briefly document what it does.

> While the new default RSB size is used on the host without any software
> modification necessary, the RSB size for guests is limited to the older

Please describe hardware behavior, and make it abundantly clear that the changelog
is talking about hardware behavior.  One of my pet peeves (understatement) with
the APM is that it does a shit job of explaining the actual architectural behavior.

> value (32 entries) for backwards compatibility.

Backwards compatibility with what?  And how far back?  E.g. have CPUs with a RAP
always had 32 entries?

> With this patch, KVM

No "this patch"

> enables guest mode 

Use imperative mood.

> to also use the default number of entries by setting

"default" is clearly wrong, since the *default* behavior is to use

> the new ALLOW_LARGER_RAP bit in the VMCB.

I detest the "ALLOW_LARGER" name.  "Allow" implies the guest somehow has a choice.
And "Larger" implies there's an even larger size

And again, please explicitly describe what this bit does.

> The two cases for backward compatibility that need special handling are
> nested guests, and guests using shadow paging

Guests don't use shadow paging, *KVM* uses 

> (or when NPT is disabled):

"i.e", not "or".  "Or" makes it sound like "NPT is disabled" is separate case
from shadow paging.

> For nested guests: the ERAPS feature adds host/guest tagging to entries
> in the RSB, but does not distinguish between ASIDs.  On a nested exit,
> the L0 hypervisor instructs the hardware (via another new VMCB bit,

I strongly suspect this was copied from the APM.  Please don't do that.  State
what change is being for *KVM*, not for "the L0 hypervisor".  This verbiage mixes
hardware behavior with software behavior, which again is why I hate much of the
APM's wording.

> FLUSH_RAP_ON_VMRUN) to flush the RSB on the next VMRUN to prevent RSB
> poisoning attacks from an L2 guest to an L1 guest.  With that in place,
> this feature can be exposed to guests.

ERAPS can also be advertised if nested virtualization is disabled, no?  I think
it makes sense to first add support for ERAPS if "!nested", and then in a separate
path, add support for ERAPS when nested virtualization is enabled.  Partly so that
it's easier for readers to understand why nested VMs are special, but mainly because
the nested virtualization support is sorely lacking.

> For shadow paging guests: do not expose this feature to guests; only
> expose if nested paging is enabled, to ensure a context switch within
> a guest triggers a context switch on the CPU -- thereby ensuring guest
> context switches flush guest RSB entries.

Huh?

> For shadow paging, the CPU's CR3 is not used for guest processes, and hence
> cannot benefit from this feature.

What does that have to do with anything?

> Signed-off-by: Amit Shah <amit.shah@amd.com>
> ---
>  arch/x86/include/asm/svm.h |  6 +++++-
>  arch/x86/kvm/cpuid.c       | 18 ++++++++++++++++--
>  arch/x86/kvm/svm/svm.c     | 29 +++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.h     | 15 +++++++++++++++
>  4 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 2b59b9951c90..f8584a63c859 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -129,7 +129,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  	u64 tsc_offset;
>  	u32 asid;
>  	u8 tlb_ctl;
> -	u8 reserved_2[3];
> +	u8 erap_ctl;
> +	u8 reserved_2[2];
>  	u32 int_ctl;
>  	u32 int_vector;
>  	u32 int_state;
> @@ -175,6 +176,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  #define TLB_CONTROL_FLUSH_ASID 3
>  #define TLB_CONTROL_FLUSH_ASID_LOCAL 7
>  
> +#define ERAP_CONTROL_ALLOW_LARGER_RAP 0
> +#define ERAP_CONTROL_FLUSH_RAP 1

Assuming the control enables using the full RAP size, these should be something
like:

#define ERAP_CONTROL_ENABLE_FULL_RAP_MASK	BIT(0)
#define ERAP_CONTROL_FLUSH_RAP_ON_VMRUN		BIT(1)

>  #define V_TPR_MASK 0x0f
>  
>  #define V_IRQ_SHIFT 8
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 097bdc022d0f..dd589670a716 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -803,6 +803,8 @@ void kvm_set_cpu_caps(void)
>  		F(WRMSR_XX_BASE_NS)
>  	);
>  
> +	if (tdp_enabled)
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_ERAPS);
>  	kvm_cpu_cap_check_and_set(X86_FEATURE_SBPB);
>  	kvm_cpu_cap_check_and_set(X86_FEATURE_IBPB_BRTYPE);
>  	kvm_cpu_cap_check_and_set(X86_FEATURE_SRSO_NO);
> @@ -1362,10 +1364,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  	case 0x80000020:
>  		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
>  		break;
> -	case 0x80000021:
> -		entry->ebx = entry->ecx = entry->edx = 0;
> +	case 0x80000021: {
> +		unsigned int ebx_mask = 0;
> +
> +		entry->ecx = entry->edx = 0;
>  		cpuid_entry_override(entry, CPUID_8000_0021_EAX);
> +
> +		/*
> +		 * Bits 23:16 in EBX indicate the size of the RSB.

Is this enumeration explicitly tied to ERAPS?

> +		 * Expose the value in the hardware to the guest.

__do_cpuid_func() is used to advertise KVM's supported CPUID to host userspace,
not to the guest.

Side topic, what happens when Zen6 adds EVEN_LARGER_RAP?  Enumerating the size of
the RAP suggets it's likely to change in the future.

> +		 */
> +		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
> +			ebx_mask |= GENMASK(23, 16);
> +
> +		entry->ebx &= ebx_mask;

This is a waste of code and makes it unnecessarily difficult to read.  Just do:

		if (kvm_cpu_cap_has(X86_FEATURE_ERAPS))
			entry->ebx &= GENMASK(23, 16);
		else
			entry->ebx = 0;

>  		break;
> +	}
>  	/* AMD Extended Performance Monitoring and Debug */
>  	case 0x80000022: {
>  		union cpuid_0x80000022_ebx ebx;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index dd15cc635655..9b055de079cb 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1360,6 +1360,13 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  	if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
>  		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
>  
> +	/*
> +	 * If the hardware has a larger RSB, use it in the guest context as
> +	 * well.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_ERAPS) && npt_enabled)

This is wrong.  Userspace owns the vCPU model, not KVM.  If userspace wants to
disable ERAPS for the guest, say because of a hardware vulnerability, then KVM
needs to honor that.

And this should be kvm_cpu_cap_has(), not copy+paste of the code that enables
the KVM capability.

> +		vmcb_set_larger_rap(svm->vmcb);

s/set/enable.  "set" implies a value in this context.

> +
>  	if (kvm_vcpu_apicv_active(vcpu))
>  		avic_init_vmcb(svm, vmcb);
>  
> @@ -3395,6 +3402,7 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
>  	pr_err("%-20s%016llx\n", "tsc_offset:", control->tsc_offset);
>  	pr_err("%-20s%d\n", "asid:", control->asid);
>  	pr_err("%-20s%d\n", "tlb_ctl:", control->tlb_ctl);
> +	pr_err("%-20s%d\n", "erap_ctl:", control->erap_ctl);
>  	pr_err("%-20s%08x\n", "int_ctl:", control->int_ctl);
>  	pr_err("%-20s%08x\n", "int_vector:", control->int_vector);
>  	pr_err("%-20s%08x\n", "int_state:", control->int_state);
> @@ -3561,6 +3569,27 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>  
>  		trace_kvm_nested_vmexit(vcpu, KVM_ISA_SVM);
>  
> +		if (boot_cpu_has(X86_FEATURE_ERAPS)
> +		    && 

&& goes on the previous line.

> vmcb_is_larger_rap(svm->vmcb01.ptr)) {

This should be something like "vmcb_is_full_rap_size_enabled()".  "is_larger_rap()"
begs the question: is larger than what? 

> +			/*
> +			 * XXX a few further optimizations can be made:
> +			 *
> +			 * 1. In pre_svm_run() we can reset this bit when a hw
> +			 * TLB flush has happened - any context switch on a
> +			 * CPU (which causes a TLB flush) auto-flushes the RSB
> +			 * - eg when this vCPU is scheduled on a different
> +			 * pCPU.
> +			 *
> +			 * 2. This is also not needed in the case where the
> +			 * vCPU is being scheduled on the same pCPU, but there
> +			 * was a context switch between the #VMEXIT and VMRUN.

Either do the optimizations straightaway, or call them out as possible optimizations
in the changelog and then explain why it's not worth doing them.

The above also mixes hardware behavior and software behavior, to the point where
I honestly have no idea who is doing what.  "A context switch" tells me nothing
useful.

> +			 *
> +			 * 3. If the guest returns to L2 again after this
> +			 * #VMEXIT, there's no need to flush the RSB.

This one in particular is trivially easy to implement correctly.

This also highlights the fact that KVM completely fails to emulate FLUSH_RAP_ON_VMRUN
if it's set in vmcb12, though that's somewhat of a moot point because unless I'm
missing something, KVM is responsible for emulating host vs. guest hardware tagging.

From L1's perspective, the (virtual) CPU, a.k.a. KVM, is responsible for isolating
guest (L2) RAP entries from host (L1) RAP entries.  And so KVM must flush the RAP
on every nested VM-Exit *and* nested VM-Enter, not just on nested VM-Exit.

> +			 */
> +			vmcb_set_flush_rap(svm->vmcb01.ptr);

Eh, follow the TLB flush helpers and just go with vmcb_flush_rap().

> +		}
> +
>  		vmexit = nested_svm_exit_special(svm);
>  
>  		if (vmexit == NESTED_EXIT_CONTINUE)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 43fa6a16eb19..8a7877f46dc5 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -500,6 +500,21 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, int bit)
>  	return vmcb_is_intercept(&svm->vmcb->control, bit);
>  }
>  
> +static inline void vmcb_set_flush_rap(struct vmcb *vmcb)
> +{
> +	__set_bit(ERAP_CONTROL_FLUSH_RAP, (unsigned long *)&vmcb->control.erap_ctl);

Eww.  Don't use the bitops helpers, casting a u8 to an unsigned long, and then
having to use the non-atomic helpers makes this way, way more complicated then
it actually is.

	vmcb->control.erap_ctl |= ERAP_CONTROL_FLUSH_RAP_ON_VMRUN;

> +}
> +
> +static inline void vmcb_set_larger_rap(struct vmcb *vmcb)
> +{
> +	__set_bit(ERAP_CONTROL_ALLOW_LARGER_RAP, (unsigned long *)&vmcb->control.erap_ctl);
> +}
> +
> +static inline bool vmcb_is_larger_rap(struct vmcb *vmcb)
> +{
> +	return test_bit(ERAP_CONTROL_ALLOW_LARGER_RAP, (unsigned long *)&vmcb->control.erap_ctl);
> +}
> +
>  static inline bool nested_vgif_enabled(struct vcpu_svm *svm)
>  {
>  	return guest_can_use(&svm->vcpu, X86_FEATURE_VGIF) &&
> -- 
> 2.47.0
>