Xen Security Advisory 452 v1 (CVE-2023-28746) - x86: Register File Data Sampling

Xen.org security team posted 1 patch 1 month, 2 weeks ago
Failed in applying to current master (apply log)
Xen Security Advisory 452 v1 (CVE-2023-28746) - x86: Register File Data Sampling
Posted by Xen.org security team 1 month, 2 weeks ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2023-28746 / XSA-452

                   x86: Register File Data Sampling

ISSUE DESCRIPTION
=================

Intel have disclosed RFDS, Register File Data Sampling, affecting some
Atom cores.

This came from internal validation work.  There is no information
provided about how an attacker might go about inferring data from the
register files.

For more details, see:
  https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html

IMPACT
======

An attacker might be able to infer the contents of data held previously
in floating point, vector and/or integer register files on the same
logical processor, including data from a more privileged context.

Note: None of the vulnerable processors support HyperThreading, so there
      is no instantaneous exposure of data from other threads.

VULNERABLE SYSTEMS
==================

Systems running all versions of Xen are affected.

RFDS is only known to affect certain Atom processors from Intel.  Other
Intel CPUs, and CPUs from other hardware vendors are not known to be
affected.

RFDS affects Atom processors between the Goldmont and Gracemont
microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
client systems which have a mix of Gracemont and other types of cores.

MITIGATION
==========

There is no mitigation.

RESOLUTION
==========

Intel are producing microcode update to address the issue for in-support
CPUs.  This is done by extending the VERW instruction with more
scrubbing side effects.  Consult your dom0 OS vendor and/or hardware
vendor for updated microcode.

In addition to the microcode, changes are required in Xen to reposition
the VERW scrubbing and to activate it when necessary, as well as to
inform guest kernels of when the extra side effect is present and/or
when the system is believed to be not vulnerable.  The appropriate set
of patches does this.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa452/xsa452-?.patch           xen-unstable
xsa452/xsa452-4.18-?.patch      Xen 4.18.x
xsa452/xsa452-4.17-?.patch      Xen 4.17.x
xsa452/xsa452-4.16-?.patch      Xen 4.16.x
xsa452/xsa452-4.15-?.patch      Xen 4.15.x

$ sha256sum xsa452*/*
9365456e85fc04947206075cdfe4a805c3d628d7c1f5b8020785d8fd84c93aa9  xsa452/xsa452-1.patch
89ce3001975352a1321dc1577d9d14273e6b383080900881603339e5a860e1fd  xsa452/xsa452-2.patch
775a2d57b7aa8e2522cce61b1ddebd267e36218ecdcc0f678db7ed0ed1f54c21  xsa452/xsa452-3.patch
0e56da437f3ea30b97f79fa1d247561815625e152c963dd504f11082863eaa32  xsa452/xsa452-4.15-1.patch
184f2fe90b614e3e5c7056669ea6c829242058f5c00407a3db1e34bcd4fb4aed  xsa452/xsa452-4.15-2.patch
237e9aa65122ef4a18f57e44f6841a80e967deac90e251ce629cba6ea2f66030  xsa452/xsa452-4.15-3.patch
59d5ec14b784b6c4f9ce2bb6258cb91ee6233fc01761f27c655f4582bdeb6830  xsa452/xsa452-4.15-4.patch
946a8d80f7c11a03a26a045eb2ba4e03be7e739f04df72e5e1f67279e374136e  xsa452/xsa452-4.15-5.patch
6eba7f56a67a101c39e2345b53530a4036b2fad50f4b745e39f8da1d0bffcbd5  xsa452/xsa452-4.15-6.patch
326571a214f358787bc4af8c71d96ae6455a9da80da4d43358af282eebb51e4d  xsa452/xsa452-4.15-7.patch
5aca7cf8ca97dd735769fc4c154dab576461da7ec1838ad152e90ceebb5af60d  xsa452/xsa452-4.16-1.patch
c7167c270a28cb639a9b94b898e656123767c21d0951fe48404bbbcf7d2be151  xsa452/xsa452-4.16-2.patch
55d61becc38663c6756baceb919645bc2cb4794b517cd067f9b452822fe11ecf  xsa452/xsa452-4.16-3.patch
6e7a93935d1a4df2dea5d9a6542127feb5d662b33cc766587a713746e4992841  xsa452/xsa452-4.16-4.patch
ee4bbf1988a05cc00c51512d5f258d310f3d5f21d23094d4a7b9ca3cf55ffcde  xsa452/xsa452-4.16-5.patch
f44dc3d957eca731834d13c1b7bf31cadfee5c4d354dbdb1e6aa317063c26420  xsa452/xsa452-4.16-6.patch
520188698c87ebfd42457b8f22d62e20e715d1bf28bfe43f93fbac4479485b15  xsa452/xsa452-4.16-7.patch
5ee4fffcb0418d34ec03605cf507d7c24d82355716fde250d3fd01308c40b29f  xsa452/xsa452-4.17-1.patch
a4081d6329c9ba7dd95b2f693ef6cfa61ef3a6148b0e4279f2cc8648be98b1ca  xsa452/xsa452-4.17-2.patch
bd6364569bb1d2841df6e9dad2d0c0d859b5cab5046141ba6c54a53ab7cbef76  xsa452/xsa452-4.17-3.patch
9c7aedd1a4f1e3dab344dd4ac0438de3ab25079f6aaa8d2f1b384b8f6f2df770  xsa452/xsa452-4.17-4.patch
7886b2da37de7c8bb0ed1bd9e8f001dbc46aa8802152c315fa1141f76e09dc77  xsa452/xsa452-4.17-5.patch
01dd485e5b2130b85905187ef6351d2fb6514cefb0096db3f710bff4345b8c29  xsa452/xsa452-4.17-6.patch
f64109a3e0a2237cc4fabc94f680c96a82e71d037e9d263ee7782fef0895fa32  xsa452/xsa452-4.17-7.patch
2fa4d889fc193e4ddd46e570e8c37d59e89fd667db52afb912d692d2775b25d6  xsa452/xsa452-4.18-1.patch
a4081d6329c9ba7dd95b2f693ef6cfa61ef3a6148b0e4279f2cc8648be98b1ca  xsa452/xsa452-4.18-2.patch
d4f61f50c9c6c17888ae6a371a2bde95cfad92d4e72c5e3ca54638fb4cc6fcfa  xsa452/xsa452-4.18-3.patch
7922255f39744c75fa2e84c3971a27432b1f1f177ddb40647bdc753eacea412c  xsa452/xsa452-4.18-4.patch
b262adff116cb00c371b45cffffb111c4ca359490a27a69ad7482a1ae92ac173  xsa452/xsa452-4.18-5.patch
0c0830b81f60b5a5b4d6bd339410ab6f512276491d30881587361b9c9fb7d0ff  xsa452/xsa452-4.18-6.patch
4ab5a0106c4ffdf713ebd3059eccd07ae8589e0d8348413685ecf0ff7d7b2a05  xsa452/xsa452-4.18-7.patch
51c1561026f32415cf69a362cca33a14aa361f34ddce3785667d99d25e922488  xsa452/xsa452-4.patch
518da7d12c295851a1ae3a03cc28b290bc0e9dee4c4446d20d341c88c9908961  xsa452/xsa452-5.patch
$
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmXwhmMMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZfzUIALkcXm6t44EmYio/o6hUaxtx/V13QAANeTVss/V/
jRblCgWLw5hb39IToDmoDaX46fIxNDjAzT6GqOB/rnLHj9vNv15zVEsiAxgKPQXs
YQyYZQxKB/4kb24JG/KhPLBc1iQOXWmK9BmNdgHgOlC1fqXzYHInZsm69BZhs6Dk
nScFOeCaT/zvLybhehRioHFpNKkiFXSxZnIuj7IB9zkVrbS0YzZX9+H56Rs/VAuF
wTqoCdqSZ0F5KnWsXsnWCYfz3Sd/mTiT5qvFROPCqbfNClEnU7NzCd4Mz2/QVjJJ
LXhN/CrllJKWcpAcFW6Bx250uDC3/oSBfHNL/D+AsC/abcM=
=N4gH
-----END PGP SIGNATURE-----
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index e3f60d5a82f7..1bead826caa3 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,17 +87,39 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        DO_SPEC_CTRL_COND_VERW
+        /*
+         * All speculation safety work happens to be elsewhere.  VERW is after
+         * popping the GPRs, while restoring the guest MSR_SPEC_CTRL is left
+         * to the MSR load list.
+         */
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
+        mov  %rax, %cr2
+
+        /*
+         * We need to perform two conditional actions (VERW, and Resume vs
+         * Launch) after popping GPRs.  With some cunning, we can encode both
+         * of these in eflags together.
+         *
+         * Parity is only calculated over the bottom byte of the answer, while
+         * Sign is simply the top bit.
+         *
+         * Therefore, the final OR instruction ends up producing:
+         *   SF = VCPU_vmx_launched
+         *   PF = !SCF_verw
+         */
+        BUILD_BUG_ON(SCF_verw & ~0xff)
+        movzbl VCPU_vmx_launched(%rbx), %ecx
+        shl  $31, %ecx
+        movzbl CPUINFO_spec_ctrl_flags(%rsp), %eax
+        and  $SCF_verw, %eax
+        or   %eax, %ecx
 
         pop  %r15
         pop  %r14
         pop  %r13
         pop  %r12
         pop  %rbp
-        mov  %rax,%cr2
-        cmpb $0,VCPU_vmx_launched(%rbx)
         pop  %rbx
         pop  %r11
         pop  %r10
@@ -108,7 +130,13 @@ UNLIKELY_END(realmode)
         pop  %rdx
         pop  %rsi
         pop  %rdi
-        je   .Lvmx_launch
+
+        jpe  .L_skip_verw
+        /* VERW clobbers ZF, but preserves all others, including SF. */
+        verw STK_REL(CPUINFO_verw_sel, CPUINFO_error_code)(%rsp)
+.L_skip_verw:
+
+        jns  .Lvmx_launch
 
 /*.Lvmx_resume:*/
         VMRESUME
diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index aecb91d84899..ba8e0ae28b1a 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -152,6 +152,13 @@
 #endif
 .endm
 
+/*
+ * Helper to improve the readibility of stack dispacements with %rsp in
+ * unusual positions.  Both @field and @top_of_stack should be constants from
+ * the same object.  @top_of_stack should be where %rsp is currently pointing.
+ */
+#define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
+
 .macro DO_SPEC_CTRL_COND_VERW
 /*
  * Requires %rsp=cpuinfo
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 02242a9b7363..8697404cd007 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -135,6 +135,7 @@ void __dummy__(void)
 #endif
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+    OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index ba8e0ae28b1a..629518cc6925 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -159,16 +159,23 @@
  */
 #define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
 
-.macro DO_SPEC_CTRL_COND_VERW
+.macro SPEC_CTRL_COND_VERW \
+    scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_error_code), \
+    sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_error_code)
 /*
- * Requires %rsp=cpuinfo
+ * Requires \scf and \sel as %rsp-relative expressions
+ * Clobbers eflags
+ *
+ * VERW needs to run after guest GPRs have been restored, where only %rsp is
+ * good to use.  Default to expecting %rsp pointing at CPUINFO_error_code.
+ * Contexts where this is not true must provide an alternative \scf and \sel.
  *
  * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
  * v1 gadget, but the IRET/VMEntry is serialising.
  */
-    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    testb $SCF_verw, \scf(%rsp)
     jz .L\@_verw_skip
-    verw CPUINFO_verw_sel(%rsp)
+    verw \sel(%rsp)
 .L\@_verw_skip:
 .endm
 
@@ -286,8 +293,6 @@
  */
     ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV
 
-    DO_SPEC_CTRL_COND_VERW
-
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 .endm
 
@@ -367,7 +372,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  */
 .macro SPEC_CTRL_EXIT_TO_XEN
 /*
- * Requires %r12=ist_exit, %r14=stack_end
+ * Requires %r12=ist_exit, %r14=stack_end, %rsp=regs
  * Clobbers %rax, %rbx, %rcx, %rdx
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
@@ -395,11 +400,18 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
     test %r12, %r12
     jz .L\@_skip_ist_exit
 
-    /* Logically DO_SPEC_CTRL_COND_VERW but without the %rsp=cpuinfo dependency */
-    testb $SCF_verw, %bl
-    jz .L\@_skip_verw
-    verw STACK_CPUINFO_FIELD(verw_sel)(%r14)
-.L\@_skip_verw:
+    /*
+     * Stash SCF and verw_sel above eflags in the case of an IST_exit.  The
+     * VERW logic needs to run after guest GPRs have been restored; i.e. where
+     * we cannot use %r12 or %r14 for the purposes they have here.
+     *
+     * When the CPU pushed this exception frame, it zero-extended eflags.
+     * Therefore it is safe for the VERW logic to look at the stashed SCF
+     * outside of the ist_exit condition.  Also, this stashing won't influence
+     * any other restore_all_guest() paths.
+     */
+    or $(__HYPERVISOR_DS32 << 16), %ebx
+    mov %ebx, UREGS_eflags + 4(%rsp) /* EFRAME_shadow_scf/sel */
 
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 8697404cd007..d8903a3ce9c7 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -55,14 +55,22 @@ void __dummy__(void)
      * EFRAME_* is for the entry/exit logic where %rsp is pointing at
      * UREGS_error_code and GPRs are still/already guest values.
      */
-#define OFFSET_EF(sym, mem)                                             \
+#define OFFSET_EF(sym, mem, ...)                                        \
     DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
-                offsetof(struct cpu_user_regs, error_code))
+                offsetof(struct cpu_user_regs, error_code) __VA_ARGS__)
 
     OFFSET_EF(EFRAME_entry_vector,    entry_vector);
     OFFSET_EF(EFRAME_rip,             rip);
     OFFSET_EF(EFRAME_cs,              cs);
     OFFSET_EF(EFRAME_eflags,          eflags);
+
+    /*
+     * These aren't real fields.  They're spare space, used by the IST
+     * exit-to-xen path.
+     */
+    OFFSET_EF(EFRAME_shadow_scf,      eflags, +4);
+    OFFSET_EF(EFRAME_shadow_sel,      eflags, +6);
+
     OFFSET_EF(EFRAME_rsp,             rsp);
     BLANK();
 
@@ -136,6 +144,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
     OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
+    OFFSET(CPUINFO_rip, struct cpu_info, guest_cpu_user_regs.rip);
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 9951b3e328f0..631f4f272ac3 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -159,6 +159,12 @@ FUNC(compat_restore_all_guest)
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL adj=8, compat=1
+
+        /* Account for ev/ec having already been popped off the stack. */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_rip), \
+            sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_rip)
+
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 END(compat_restore_all_guest)
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 1b846f3aaff0..7d686b762807 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -212,6 +212,9 @@ FUNC_LOCAL(restore_all_guest)
 #endif
 
         mov   EFRAME_rip(%rsp), %rcx
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
         mov   EFRAME_rsp(%rsp), %rsp
         je    1f
@@ -224,6 +227,9 @@ LABEL_LOCAL(.Lrestore_rcx_iret_exit_to_guest)
 iret_exit_to_guest:
         andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
         orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -688,9 +694,22 @@ UNLIKELY_START(ne, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN     /* Req: %r12=ist_exit %r14=end, Clob: abcd */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %r12=ist_exit %r14=end %rsp=regs, Clob: abcd */
 
         RESTORE_ALL adj=8
+
+        /*
+         * When the CPU pushed this exception frame, it zero-extended eflags.
+         * For an IST exit, SPEC_CTRL_EXIT_TO_XEN stashed shadow copies of
+         * spec_ctrl_flags and ver_sel above eflags, as we can't use any GPRs,
+         * and we're at a random place on the stack, not in a CPUFINFO block.
+         *
+         * Account for ev/ec having already been popped off the stack.
+         */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(EFRAME_shadow_scf, EFRAME_rip), \
+            sel=STK_REL(EFRAME_shadow_sel, EFRAME_rip)
+
         iretq
 END(restore_all_xen)
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 02896598df6f..a54d77528ae9 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2363,7 +2363,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio,gds-mit,div-scrub}=<bool> ]`
@@ -2388,7 +2388,7 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=` and `ibpb-entry=` options
 offer fine grained control over the primitives by Xen.  These impact Xen's
 ability to protect itself, and/or Xen's ability to virtualise support for
 guests to use.
@@ -2405,11 +2405,12 @@ guests to use.
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
 * `rsb=` offers control over whether to overwrite the Return Stack Buffer /
   Return Address Stack on entry to Xen and on idle.
-* `md-clear=` offers control over whether to use VERW to flush
-  microarchitectural buffers on idle and exit from Xen.  *Note: For
-  compatibility with development versions of this fix, `mds=` is also accepted
-  on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
-  preference to here.*
+* `verw=` offers control over whether to use VERW for its scrubbing side
+  effects at appropriate privilege transitions.  The exact side effects are
+  microarchitecture and microcode specific.  *Note: `md-clear=` is accepted as
+  a deprecated alias.  For compatibility with development versions of XSA-297,
+  `mds=` is also accepted on Xen 4.12 and earlier as an alias.  Consult vendor
+  documentation in preference to here.*
 * `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
   Barrier) is used on entry to Xen.  This is used by default on hardware
   vulnerable to Branch Type Confusion, and hardware vulnerable to Speculative
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 752225faa627..9cede5361d14 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -25,8 +25,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __ro_after_init opt_md_clear_pv = -1;
-static int8_t __ro_after_init opt_md_clear_hvm = -1;
+static int8_t __ro_after_init opt_verw_pv = -1;
+static int8_t __ro_after_init opt_verw_hvm = -1;
 
 static int8_t __ro_after_init opt_ibpb_entry_pv = -1;
 static int8_t __ro_after_init opt_ibpb_entry_hvm = -1;
@@ -66,7 +66,7 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 static bool __initdata opt_unpriv_mmio;
-static bool __ro_after_init opt_fb_clear_mmio;
+static bool __ro_after_init opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
@@ -108,8 +108,8 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         disable_common:
             opt_rsb_pv = false;
             opt_rsb_hvm = false;
-            opt_md_clear_pv = 0;
-            opt_md_clear_hvm = 0;
+            opt_verw_pv = 0;
+            opt_verw_hvm = 0;
             opt_ibpb_entry_pv = 0;
             opt_ibpb_entry_hvm = 0;
             opt_ibpb_entry_dom0 = false;
@@ -140,14 +140,14 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         {
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
-            opt_md_clear_pv = val;
+            opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
-            opt_md_clear_hvm = val;
+            opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
@@ -192,21 +192,22 @@ static int __init cf_check parse_spec_ctrl(const char *s)
                 break;
             }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
+        else if ( (val = parse_boolean("verw", s, ss)) != -1 ||
+                  (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
             switch ( val )
             {
             case 0:
             case 1:
-                opt_md_clear_pv = opt_md_clear_hvm = val;
+                opt_verw_pv = opt_verw_hvm = val;
                 break;
 
             case -2:
-                s += strlen("md-clear=");
+                s += (*s == 'v') ? strlen("verw=") : strlen("md-clear=");
                 if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                    opt_md_clear_pv = val;
+                    opt_verw_pv = val;
                 else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                    opt_md_clear_hvm = val;
+                    opt_verw_hvm = val;
                 else
             default:
                     rc = -EINVAL;
@@ -528,8 +529,8 @@ static void __init print_details(enum ind_thunk thunk)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm ||
-           opt_fb_clear_mmio                         ? " VERW"  : "",
+           opt_verw_pv || opt_verw_hvm ||
+           opt_verw_mmio                             ? " VERW"  : "",
            opt_div_scrub                             ? " DIV" : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
@@ -550,13 +551,13 @@ static void __init print_details(enum ind_thunk thunk)
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
             amd_virt_spec_ctrl ||
-            opt_eager_fpu || opt_md_clear_hvm)       ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             amd_virt_spec_ctrl)                      ? " MSR_VIRT_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_hvm                          ? " MD_CLEAR"      : "",
+           opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
@@ -565,11 +566,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_pv                           ? " MD_CLEAR"      : "",
+           opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
@@ -1502,8 +1503,8 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+    bool verw = ((pv ? opt_verw_pv : opt_verw_hvm) ||
+                 (opt_verw_mmio && is_iommu_enabled(d)));
 
     bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
                  (d->domain_id != 0 || opt_ibpb_entry_dom0));
@@ -1866,19 +1867,20 @@ void __init init_speculation_mitigations(void)
      * the return-to-guest path.
      */
     if ( opt_unpriv_mmio )
-        opt_fb_clear_mmio = cpu_has_fb_clear;
+        opt_verw_mmio = cpu_has_fb_clear;
 
     /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
      */
-    if ( opt_md_clear_pv == -1 )
-        opt_md_clear_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                           boot_cpu_has(X86_FEATURE_MD_CLEAR));
-    if ( opt_md_clear_hvm == -1 )
-        opt_md_clear_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                            boot_cpu_has(X86_FEATURE_MD_CLEAR));
+    if ( opt_verw_pv == -1 )
+        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                       cpu_has_md_clear);
+
+    if ( opt_verw_hvm == -1 )
+        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                        cpu_has_md_clear);
 
     /*
      * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
@@ -1891,12 +1893,12 @@ void __init init_speculation_mitigations(void)
      * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
-     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * opt_verw_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_md_clear_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 6dbf3bf697fe..9d105161db58 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -51,6 +51,23 @@ void __dummy__(void)
     OFFSET(UREGS_kernel_sizeof, struct cpu_user_regs, es);
     BLANK();
 
+    /*
+     * EFRAME_* is for the entry/exit logic where %rsp is pointing at
+     * UREGS_error_code and GPRs are still/already guest values.
+     */
+#define OFFSET_EF(sym, mem)                                             \
+    DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
+                offsetof(struct cpu_user_regs, error_code))
+
+    OFFSET_EF(EFRAME_entry_vector,    entry_vector);
+    OFFSET_EF(EFRAME_rip,             rip);
+    OFFSET_EF(EFRAME_cs,              cs);
+    OFFSET_EF(EFRAME_eflags,          eflags);
+    OFFSET_EF(EFRAME_rsp,             rsp);
+    BLANK();
+
+#undef OFFSET_EF
+
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
     OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index cde7702b4c68..20603ec4afc7 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -17,7 +17,7 @@ ENTRY(entry_int82)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $HYPERCALL_VECTOR, 4(%rsp)
+        movl  $HYPERCALL_VECTOR, EFRAME_entry_vector(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -214,7 +214,7 @@ ENTRY(cstar_enter)
         pushq $FLAT_USER_CS32
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index a086349841cc..fe052d2792e9 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -190,15 +190,15 @@ restore_all_guest:
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL
-        testw $TRAP_syscall,4(%rsp)
+        testw $TRAP_syscall, EFRAME_entry_vector(%rsp)
         jz    iret_exit_to_guest
 
-        movq  24(%rsp),%r11           # RFLAGS
+        mov   EFRAME_eflags(%rsp), %r11
         andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
         orq   $X86_EFLAGS_IF,%r11
 
         /* Don't use SYSRET path if the return address is not canonical. */
-        movq  8(%rsp),%rcx
+        mov   EFRAME_rip(%rsp), %rcx
         sarq  $47,%rcx
         incl  %ecx
         cmpl  $1,%ecx
@@ -213,20 +213,20 @@ restore_all_guest:
         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
 #endif
 
-        movq  8(%rsp), %rcx           # RIP
-        cmpw  $FLAT_USER_CS32,16(%rsp)# CS
-        movq  32(%rsp),%rsp           # RSP
+        mov   EFRAME_rip(%rsp), %rcx
+        cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
+        mov   EFRAME_rsp(%rsp), %rsp
         je    1f
         sysretq
 1:      sysretl
 
         ALIGN
 .Lrestore_rcx_iret_exit_to_guest:
-        movq  8(%rsp), %rcx           # RIP
+        mov   EFRAME_rip(%rsp), %rcx
 /* No special register assumptions. */
 iret_exit_to_guest:
-        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
-        orl   $X86_EFLAGS_IF,24(%rsp)
+        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
+        orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -257,7 +257,7 @@ ENTRY(lstar_enter)
         pushq $FLAT_KERNEL_CS64
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -295,7 +295,7 @@ GLOBAL(sysenter_eflags_saved)
         pushq $3 /* ring 3 null cs */
         pushq $0 /* null rip */
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -347,7 +347,7 @@ ENTRY(int80_direct_trap)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $0x80, 4(%rsp)
+        movl  $0x80, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -650,7 +650,7 @@ ENTRY(common_interrupt)
 
 ENTRY(page_fault)
         ENDBR64
-        movl  $TRAP_page_fault,4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
@@ -790,90 +790,90 @@ FATAL_exception_with_ints_disabled:
 ENTRY(divide_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_divide_error,4(%rsp)
+        movl  $TRAP_divide_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_copro_error,4(%rsp)
+        movl  $TRAP_copro_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(simd_coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_simd_error,4(%rsp)
+        movl  $TRAP_simd_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(device_not_available)
         ENDBR64
         pushq $0
-        movl  $TRAP_no_device,4(%rsp)
+        movl  $TRAP_no_device, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(debug)
         ENDBR64
         pushq $0
-        movl  $TRAP_debug,4(%rsp)
+        movl  $TRAP_debug, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 ENTRY(int3)
         ENDBR64
         pushq $0
-        movl  $TRAP_int3,4(%rsp)
+        movl  $TRAP_int3, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(overflow)
         ENDBR64
         pushq $0
-        movl  $TRAP_overflow,4(%rsp)
+        movl  $TRAP_overflow, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(bounds)
         ENDBR64
         pushq $0
-        movl  $TRAP_bounds,4(%rsp)
+        movl  $TRAP_bounds, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_op)
         ENDBR64
         pushq $0
-        movl  $TRAP_invalid_op,4(%rsp)
+        movl  $TRAP_invalid_op, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_TSS)
         ENDBR64
-        movl  $TRAP_invalid_tss,4(%rsp)
+        movl  $TRAP_invalid_tss, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(segment_not_present)
         ENDBR64
-        movl  $TRAP_no_segment,4(%rsp)
+        movl  $TRAP_no_segment, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(stack_segment)
         ENDBR64
-        movl  $TRAP_stack_error,4(%rsp)
+        movl  $TRAP_stack_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(general_protection)
         ENDBR64
-        movl  $TRAP_gp_fault,4(%rsp)
+        movl  $TRAP_gp_fault, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(alignment_check)
         ENDBR64
-        movl  $TRAP_alignment_check,4(%rsp)
+        movl  $TRAP_alignment_check, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_CP)
         ENDBR64
-        movl  $X86_EXC_CP, 4(%rsp)
+        movl  $X86_EXC_CP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(double_fault)
         ENDBR64
-        movl  $TRAP_double_fault,4(%rsp)
+        movl  $TRAP_double_fault, EFRAME_entry_vector(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         ALTERNATIVE "", stac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -899,7 +899,7 @@ ENTRY(double_fault)
         .pushsection .init.text, "ax", @progbits
 ENTRY(early_page_fault)
         ENDBR64
-        movl  $TRAP_page_fault,4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
         SAVE_ALL
         movq  %rsp,%rdi
         call  do_early_page_fault
@@ -909,7 +909,7 @@ ENTRY(early_page_fault)
 ENTRY(nmi)
         ENDBR64
         pushq $0
-        movl  $TRAP_nmi,4(%rsp)
+        movl  $TRAP_nmi, EFRAME_entry_vector(%rsp)
 handle_ist_exception:
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -1011,7 +1011,7 @@ handle_ist_exception:
 ENTRY(machine_check)
         ENDBR64
         pushq $0
-        movl  $TRAP_machine_check,4(%rsp)
+        movl  $TRAP_machine_check, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 /* No op trap handler.  Required for kexec crash path. */
@@ -1048,7 +1048,7 @@ autogen_stubs: /* Automatically generated stubs. */
 1:
         ENDBR64
         pushq $0
-        movb  $vec,4(%rsp)
+        movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   common_interrupt
 
         entrypoint 1b
@@ -1062,7 +1062,7 @@ autogen_stubs: /* Automatically generated stubs. */
         test  $8,%spl        /* 64bit exception frames are 16 byte aligned, but the word */
         jz    2f             /* size is 8 bytes.  Check whether the processor gave us an */
         pushq $0             /* error code, and insert an empty one if not.              */
-2:      movb  $vec,4(%rsp)
+2:      movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
         entrypoint 1b
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/cpu-policy: Allow for levelling of VERW side effects

MD_CLEAR and FB_CLEAR need OR-ing across a migrate pool.  Allow this, by
having them unconditinally set in max, with the host values reflected in
default.  Annotate the bits as having special properies.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit de17162cafd27f2865a3102a2ec0f386a02ed03d)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 792724db6f4e..c6cd91db0a0f 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -434,6 +434,16 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
         __set_bit(X86_FEATURE_RSBA, fs);
         __set_bit(X86_FEATURE_RRSBA, fs);
 
+        /*
+         * These bits indicate that the VERW instruction may have gained
+         * scrubbing side effects.  With pooling, they mean "you might migrate
+         * somewhere where scrubbing is necessary", and may need exposing on
+         * unaffected hardware.  This is fine, because the VERW instruction
+         * has been around since the 286.
+         */
+        __set_bit(X86_FEATURE_MD_CLEAR, fs);
+        __set_bit(X86_FEATURE_FB_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
@@ -468,6 +478,20 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
              cpu_has_rdrand && !is_forced_cpu_cap(X86_FEATURE_RDRAND) )
             __clear_bit(X86_FEATURE_RDRAND, fs);
 
+        /*
+         * These bits indicate that the VERW instruction may have gained
+         * scrubbing side effects.  The max policy has them set for migration
+         * reasons, so reset the default policy back to the host values in
+         * case we're unaffected.
+         */
+        __clear_bit(X86_FEATURE_MD_CLEAR, fs);
+        if ( cpu_has_md_clear )
+            __set_bit(X86_FEATURE_MD_CLEAR, fs);
+
+        __clear_bit(X86_FEATURE_FB_CLEAR, fs);
+        if ( cpu_has_fb_clear )
+            __set_bit(X86_FEATURE_FB_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index ecdc7d9b7171..3b116c6dd364 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -137,6 +137,7 @@
 #define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
 #define cpu_has_avx512_vp2intersect boot_cpu_has(X86_FEATURE_AVX512_VP2INTERSECT)
 #define cpu_has_srbds_ctrl      boot_cpu_has(X86_FEATURE_SRBDS_CTRL)
+#define cpu_has_md_clear        boot_cpu_has(X86_FEATURE_MD_CLEAR)
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index e434fc9bae22..b4f68a59b859 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -273,7 +273,7 @@ XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A  AVX512 Neural Network Instructions *
 XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A  AVX512 Multiply Accumulation Single Precision */
 XEN_CPUFEATURE(AVX512_VP2INTERSECT, 9*32+8) /*a  VP2INTERSECT{D,Q} insns */
 XEN_CPUFEATURE(SRBDS_CTRL,    9*32+ 9) /*   MSR_MCU_OPT_CTRL and RNGDS_MITG_DIS. */
-XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*A  VERW clears microarchitectural buffers */
+XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffers */
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
@@ -322,7 +322,7 @@ XEN_CPUFEATURE(DOITM,              16*32+12) /*   Data Operand Invariant Timing
 XEN_CPUFEATURE(SBDR_SSDP_NO,       16*32+13) /*A  No Shared Buffer Data Read or Sideband Stale Data Propagation */
 XEN_CPUFEATURE(FBSDP_NO,           16*32+14) /*A  No Fill Buffer Stale Data Propagation */
 XEN_CPUFEATURE(PSDP_NO,            16*32+15) /*A  No Primary Stale Data Propagation */
-XEN_CPUFEATURE(FB_CLEAR,           16*32+17) /*A  Fill Buffers cleared by VERW */
+XEN_CPUFEATURE(FB_CLEAR,           16*32+17) /*!A Fill Buffers cleared by VERW */
 XEN_CPUFEATURE(FB_CLEAR_CTRL,      16*32+18) /*   MSR_OPT_CPU_CTRL.FB_CLEAR_DIS */
 XEN_CPUFEATURE(RRSBA,              16*32+19) /*!  Restricted RSB Alternative */
 XEN_CPUFEATURE(BHI_NO,             16*32+20) /*A  No Branch History Injection  */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 5f5de45a1309..cdde76e13892 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,17 +87,39 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        DO_SPEC_CTRL_COND_VERW
+        /*
+         * All speculation safety work happens to be elsewhere.  VERW is after
+         * popping the GPRs, while restoring the guest MSR_SPEC_CTRL is left
+         * to the MSR load list.
+         */
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
+        mov  %rax, %cr2
+
+        /*
+         * We need to perform two conditional actions (VERW, and Resume vs
+         * Launch) after popping GPRs.  With some cunning, we can encode both
+         * of these in eflags together.
+         *
+         * Parity is only calculated over the bottom byte of the answer, while
+         * Sign is simply the top bit.
+         *
+         * Therefore, the final OR instruction ends up producing:
+         *   SF = VCPU_vmx_launched
+         *   PF = !SCF_verw
+         */
+        BUILD_BUG_ON(SCF_verw & ~0xff)
+        movzbl VCPU_vmx_launched(%rbx), %ecx
+        shl  $31, %ecx
+        movzbl CPUINFO_spec_ctrl_flags(%rsp), %eax
+        and  $SCF_verw, %eax
+        or   %eax, %ecx
 
         pop  %r15
         pop  %r14
         pop  %r13
         pop  %r12
         pop  %rbp
-        mov  %rax,%cr2
-        cmpb $0,VCPU_vmx_launched(%rbx)
         pop  %rbx
         pop  %r11
         pop  %r10
@@ -108,7 +130,13 @@ UNLIKELY_END(realmode)
         pop  %rdx
         pop  %rsi
         pop  %rdi
-        je   .Lvmx_launch
+
+        jpe  .L_skip_verw
+        /* VERW clobbers ZF, but preserves all others, including SF. */
+        verw STK_REL(CPUINFO_verw_sel, CPUINFO_error_code)(%rsp)
+.L_skip_verw:
+
+        jns  .Lvmx_launch
 
 /*.Lvmx_resume:*/
         VMRESUME
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 9d105161db58..eb4a619891cb 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -133,6 +133,7 @@ void __dummy__(void)
 #endif
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+    OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index d480a4461b84..4ab54f41cf96 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -81,6 +81,14 @@ register unsigned long current_stack_pointer asm("rsp");
 
 #ifdef __ASSEMBLY__
 
+.macro BUILD_BUG_ON condstr, cond:vararg
+        .if \cond
+        .error "Condition \"\condstr\" not satisfied"
+        .endif
+.endm
+/* preprocessor macro to make error message more user friendly */
+#define BUILD_BUG_ON(cond) BUILD_BUG_ON #cond, cond
+
 #ifdef HAVE_AS_QUOTED_SYM
 #define SUBSECTION_LBL(tag)                        \
         .ifndef .L.tag;                            \
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index c47399dbf261..b5e566943632 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -164,6 +164,13 @@
 #endif
 .endm
 
+/*
+ * Helper to improve the readibility of stack dispacements with %rsp in
+ * unusual positions.  Both @field and @top_of_stack should be constants from
+ * the same object.  @top_of_stack should be where %rsp is currently pointing.
+ */
+#define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
+
 .macro DO_SPEC_CTRL_COND_VERW
 /*
  * Requires %rsp=cpuinfo
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index eb4a619891cb..e3a38ae96743 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -55,14 +55,22 @@ void __dummy__(void)
      * EFRAME_* is for the entry/exit logic where %rsp is pointing at
      * UREGS_error_code and GPRs are still/already guest values.
      */
-#define OFFSET_EF(sym, mem)                                             \
+#define OFFSET_EF(sym, mem, ...)                                        \
     DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
-                offsetof(struct cpu_user_regs, error_code))
+                offsetof(struct cpu_user_regs, error_code) __VA_ARGS__)
 
     OFFSET_EF(EFRAME_entry_vector,    entry_vector);
     OFFSET_EF(EFRAME_rip,             rip);
     OFFSET_EF(EFRAME_cs,              cs);
     OFFSET_EF(EFRAME_eflags,          eflags);
+
+    /*
+     * These aren't real fields.  They're spare space, used by the IST
+     * exit-to-xen path.
+     */
+    OFFSET_EF(EFRAME_shadow_scf,      eflags, +4);
+    OFFSET_EF(EFRAME_shadow_sel,      eflags, +6);
+
     OFFSET_EF(EFRAME_rsp,             rsp);
     BLANK();
 
@@ -134,6 +142,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
     OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
+    OFFSET(CPUINFO_rip, struct cpu_info, guest_cpu_user_regs.rip);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 20603ec4afc7..becc7f1bb2f3 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -165,6 +165,12 @@ ENTRY(compat_restore_all_guest)
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL adj=8 compat=1
+
+        /* Account for ev/ec having already been popped off the stack. */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_rip), \
+            sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_rip)
+
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index fe052d2792e9..d855d4d1d78e 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -214,6 +214,9 @@ restore_all_guest:
 #endif
 
         mov   EFRAME_rip(%rsp), %rcx
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
         mov   EFRAME_rsp(%rsp), %rsp
         je    1f
@@ -227,6 +230,9 @@ restore_all_guest:
 iret_exit_to_guest:
         andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
         orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -613,9 +619,22 @@ UNLIKELY_START(ne, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN     /* Req: %r12=ist_exit %r14=end, Clob: abcd */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %r12=ist_exit %r14=end %rsp=regs, Clob: abcd */
 
         RESTORE_ALL adj=8
+
+        /*
+         * When the CPU pushed this exception frame, it zero-extended eflags.
+         * For an IST exit, SPEC_CTRL_EXIT_TO_XEN stashed shadow copies of
+         * spec_ctrl_flags and ver_sel above eflags, as we can't use any GPRs,
+         * and we're at a random place on the stack, not in a CPUFINFO block.
+         *
+         * Account for ev/ec having already been popped off the stack.
+         */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(EFRAME_shadow_scf, EFRAME_rip), \
+            sel=STK_REL(EFRAME_shadow_sel, EFRAME_rip)
+
         iretq
 
 ENTRY(common_interrupt)
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index b5e566943632..9650a3aca7c3 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -171,16 +171,23 @@
  */
 #define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
 
-.macro DO_SPEC_CTRL_COND_VERW
+.macro SPEC_CTRL_COND_VERW \
+    scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_error_code), \
+    sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_error_code)
 /*
- * Requires %rsp=cpuinfo
+ * Requires \scf and \sel as %rsp-relative expressions
+ * Clobbers eflags
+ *
+ * VERW needs to run after guest GPRs have been restored, where only %rsp is
+ * good to use.  Default to expecting %rsp pointing at CPUINFO_error_code.
+ * Contexts where this is not true must provide an alternative \scf and \sel.
  *
  * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
  * v1 gadget, but the IRET/VMEntry is serialising.
  */
-    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    testb $SCF_verw, \scf(%rsp)
     jz .L\@_verw_skip
-    verw CPUINFO_verw_sel(%rsp)
+    verw \sel(%rsp)
 .L\@_verw_skip:
 .endm
 
@@ -298,8 +305,6 @@
  */
     ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV
 
-    DO_SPEC_CTRL_COND_VERW
-
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 .endm
 
@@ -379,7 +384,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  */
 .macro SPEC_CTRL_EXIT_TO_XEN
 /*
- * Requires %r12=ist_exit, %r14=stack_end
+ * Requires %r12=ist_exit, %r14=stack_end, %rsp=regs
  * Clobbers %rax, %rbx, %rcx, %rdx
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
@@ -407,11 +412,18 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
     test %r12, %r12
     jz .L\@_skip_ist_exit
 
-    /* Logically DO_SPEC_CTRL_COND_VERW but without the %rsp=cpuinfo dependency */
-    testb $SCF_verw, %bl
-    jz .L\@_skip_verw
-    verw STACK_CPUINFO_FIELD(verw_sel)(%r14)
-.L\@_skip_verw:
+    /*
+     * Stash SCF and verw_sel above eflags in the case of an IST_exit.  The
+     * VERW logic needs to run after guest GPRs have been restored; i.e. where
+     * we cannot use %r12 or %r14 for the purposes they have here.
+     *
+     * When the CPU pushed this exception frame, it zero-extended eflags.
+     * Therefore it is safe for the VERW logic to look at the stashed SCF
+     * outside of the ist_exit condition.  Also, this stashing won't influence
+     * any other restore_all_guest() paths.
+     */
+    or $(__HYPERVISOR_DS32 << 16), %ebx
+    mov %ebx, UREGS_eflags + 4(%rsp) /* EFRAME_shadow_scf/sel */
 
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index f3d1009f2daf..d020d79dde25 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2186,7 +2186,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio,gds-mit,div-scrub}=<bool> ]`
@@ -2211,7 +2211,7 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=` and `ibpb-entry=` options
 offer fine grained control over the primitives by Xen.  These impact Xen's
 ability to protect itself, and/or Xen's ability to virtualise support for
 guests to use.
@@ -2228,11 +2228,12 @@ guests to use.
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
 * `rsb=` offers control over whether to overwrite the Return Stack Buffer /
   Return Address Stack on entry to Xen.
-* `md-clear=` offers control over whether to use VERW to flush
-  microarchitectural buffers on idle and exit from Xen.  *Note: For
-  compatibility with development versions of this fix, `mds=` is also accepted
-  on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
-  preference to here.*
+* `verw=` offers control over whether to use VERW for its scrubbing side
+  effects at appropriate privilege transitions.  The exact side effects are
+  microarchitecture and microcode specific.  *Note: `md-clear=` is accepted as
+  a deprecated alias.  For compatibility with development versions of XSA-297,
+  `mds=` is also accepted on Xen 4.12 and earlier as an alias.  Consult vendor
+  documentation in preference to here.*
 * `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
   Barrier) is used on entry to Xen.  This is used by default on hardware
   vulnerable to Branch Type Confusion, and hardware vulnerable to Speculative
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index f75124117b6d..43449b4c7a19 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -37,8 +37,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static bool __initdata opt_rsb_pv = true;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __read_mostly opt_md_clear_pv = -1;
-static int8_t __read_mostly opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_verw_pv = -1;
+static int8_t __read_mostly opt_verw_hvm = -1;
 
 static int8_t __read_mostly opt_ibpb_entry_pv = -1;
 static int8_t __read_mostly opt_ibpb_entry_hvm = -1;
@@ -77,7 +77,7 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 static bool __initdata opt_unpriv_mmio;
-static bool __read_mostly opt_fb_clear_mmio;
+static bool __read_mostly opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
@@ -119,8 +119,8 @@ static int __init parse_spec_ctrl(const char *s)
         disable_common:
             opt_rsb_pv = false;
             opt_rsb_hvm = false;
-            opt_md_clear_pv = 0;
-            opt_md_clear_hvm = 0;
+            opt_verw_pv = 0;
+            opt_verw_hvm = 0;
             opt_ibpb_entry_pv = 0;
             opt_ibpb_entry_hvm = 0;
             opt_ibpb_entry_dom0 = false;
@@ -151,14 +151,14 @@ static int __init parse_spec_ctrl(const char *s)
         {
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
-            opt_md_clear_pv = val;
+            opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
-            opt_md_clear_hvm = val;
+            opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
@@ -203,21 +203,22 @@ static int __init parse_spec_ctrl(const char *s)
                 break;
             }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
+        else if ( (val = parse_boolean("verw", s, ss)) != -1 ||
+                  (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
             switch ( val )
             {
             case 0:
             case 1:
-                opt_md_clear_pv = opt_md_clear_hvm = val;
+                opt_verw_pv = opt_verw_hvm = val;
                 break;
 
             case -2:
-                s += strlen("md-clear=");
+                s += (*s == 'v') ? strlen("verw=") : strlen("md-clear=");
                 if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                    opt_md_clear_pv = val;
+                    opt_verw_pv = val;
                 else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                    opt_md_clear_hvm = val;
+                    opt_verw_hvm = val;
                 else
             default:
                     rc = -EINVAL;
@@ -512,8 +513,8 @@ static void __init print_details(enum ind_thunk thunk)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm ||
-           opt_fb_clear_mmio                         ? " VERW"  : "",
+           opt_verw_pv || opt_verw_hvm ||
+           opt_verw_mmio                             ? " VERW"  : "",
            opt_div_scrub                             ? " DIV" : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
@@ -533,11 +534,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
-            opt_eager_fpu || opt_md_clear_hvm)       ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_hvm                          ? " MD_CLEAR"      : "",
+           opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
@@ -546,11 +547,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_pv                           ? " MD_CLEAR"      : "",
+           opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
@@ -1450,8 +1451,8 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+    bool verw = ((pv ? opt_verw_pv : opt_verw_hvm) ||
+                 (opt_verw_mmio && is_iommu_enabled(d)));
 
     bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
                  (d->domain_id != 0 || opt_ibpb_entry_dom0));
@@ -1765,19 +1766,20 @@ void __init init_speculation_mitigations(void)
      * the return-to-guest path.
      */
     if ( opt_unpriv_mmio )
-        opt_fb_clear_mmio = cpu_has_fb_clear;
+        opt_verw_mmio = cpu_has_fb_clear;
 
     /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
      */
-    if ( opt_md_clear_pv == -1 )
-        opt_md_clear_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                           boot_cpu_has(X86_FEATURE_MD_CLEAR));
-    if ( opt_md_clear_hvm == -1 )
-        opt_md_clear_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                            boot_cpu_has(X86_FEATURE_MD_CLEAR));
+    if ( opt_verw_pv == -1 )
+        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                       cpu_has_md_clear);
+
+    if ( opt_verw_hvm == -1 )
+        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                        cpu_has_md_clear);
 
     /*
      * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
@@ -1790,12 +1792,12 @@ void __init init_speculation_mitigations(void)
      * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
-     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * opt_verw_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_md_clear_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 43449b4c7a19..84a382781b17 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1467,7 +1467,7 @@ void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
     bool has_spec_ctrl, ibrs = false, hw_smt_enabled;
-    bool cpu_has_bug_taa, retpoline_safe;
+    bool cpu_has_bug_taa, cpu_has_useful_md_clear, retpoline_safe;
 
     hw_smt_enabled = check_smt_enabled();
 
@@ -1754,50 +1754,97 @@ void __init init_speculation_mitigations(void)
             "enabled.  Please assess your configuration and choose an\n"
             "explicit 'smt=<bool>' setting.  See XSA-273.\n");
 
+    /*
+     * A brief summary of VERW-related changes.
+     *
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     *
+     * Relevant ucodes:
+     *
+     * - May 2019, for MDS.  Introduces the MD_CLEAR CPUID bit and VERW side
+     *   effects to scrub Store/Load/Fill buffers as applicable.  MD_CLEAR
+     *   exists architecturally, even when the side effects have been removed.
+     *
+     *   Use VERW to scrub on return-to-guest.  Parts with L1D_FLUSH to
+     *   mitigate L1TF have the same side effect, so no need to do both.
+     *
+     *   Various Atoms suffer from Store-buffer sampling only.  Store buffers
+     *   are statically partitioned between non-idle threads, so scrubbing is
+     *   wanted when going idle too.
+     *
+     *   Load ports and Fill buffers are competitively shared between threads.
+     *   SMT must be disabled for VERW scrubbing to be fully effective.
+     *
+     * - November 2019, for TAA.  Extended VERW side effects to TSX-enabled
+     *   MDS_NO parts.
+     *
+     * - February 2022, for Client TSX de-feature.  Removed VERW side effects
+     *   from Client CPUs only.
+     *
+     * - May 2022, for MMIO Stale Data.  (Re)introduced Fill Buffer scrubbing
+     *   on all MMIO-affected parts which didn't already have it for MDS
+     *   reasons, enumerating FB_CLEAR on those parts only.
+     *
+     *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
+     *   side effects as VERW and cannot be used in its place.
+     */
     mds_calculations();
 
     /*
-     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
-     * reintroduced the VERW fill buffer flushing side effect because of a
-     * susceptibility to FBSDP.
+     * Parts which enumerate FB_CLEAR are those with now-updated microcode
+     * which weren't susceptible to the original MFBDS (and therefore didn't
+     * have Fill Buffer scrubbing side effects to begin with, or were Client
+     * MDS_NO non-TAA_NO parts where the scrubbing was removed), but have had
+     * the scrubbing reintroduced because of a susceptibility to FBSDP.
      *
      * If unprivileged guests have (or will have) MMIO mappings, we can
      * mitigate cross-domain leakage of fill buffer data by issuing VERW on
-     * the return-to-guest path.
+     * the return-to-guest path.  This is only a token effort if SMT is
+     * active.
      */
     if ( opt_unpriv_mmio )
         opt_verw_mmio = cpu_has_fb_clear;
 
     /*
-     * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
-     * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
-     * but it is somewhat better than nothing.
+     * MD_CLEAR is enumerated architecturally forevermore, even after the
+     * scrubbing side effects have been removed.  Create ourselves an version
+     * which expressed whether we think MD_CLEAR is having any useful side
+     * effect.
+     */
+    cpu_has_useful_md_clear = (cpu_has_md_clear &&
+                               (cpu_has_bug_mds || cpu_has_bug_msbds_only));
+
+    /*
+     * By default, use VERW scrubbing on applicable hardware, if we think it's
+     * going to have an effect.  This will only be a token effort for
+     * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                       cpu_has_md_clear);
+        opt_verw_pv = cpu_has_useful_md_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                        cpu_has_md_clear);
+        opt_verw_hvm = cpu_has_useful_md_clear;
 
     /*
-     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
-     * either the PV or HVM MDS defences are used, or if we may give MMIO
-     * access to untrusted guests.
-     *
-     * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
-     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
-     *
-     * After calculating the appropriate idle setting, simplify
-     * opt_verw_hvm to mean just "should we VERW on the way into HVM
-     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+     * If SMT is active, and we're protecting against MDS or MMIO stale data,
+     * we need to scrub before going idle as well as on return to guest.
+     * Various pipeline resources are repartitioned amongst non-idle threads.
      */
-    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
+    if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+          opt_verw_mmio) && hw_smt_enabled )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+
+    /*
+     * After calculating the appropriate idle setting, simplify opt_verw_hvm
+     * to mean just "should we VERW on the way into HVM guests", so
+     * spec_ctrl_init_domain() can calculate suitable settings.
+     *
+     * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
+     * only *_CLEAR we can see.
+     */
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+        opt_verw_hvm = false;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 860ef07352c1..3faf1e430487 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -169,7 +169,7 @@ static const char *const str_7d0[32] =
     [ 8] = "avx512-vp2intersect", [ 9] = "srbds-ctrl",
     [10] = "md-clear",            [11] = "rtm-always-abort",
     /* 12 */                [13] = "tsx-force-abort",
-    [14] = "serialize",
+    [14] = "serialize",     [15] = "hybrid",
     [16] = "tsxldtrk",
     [18] = "pconfig",
     [20] = "cet-ibt",
@@ -224,7 +224,8 @@ static const char *const str_m10Al[32] =
     [20] = "bhi-no",              [21] = "xapic-status",
     /* 22 */                      [23] = "ovrclk-status",
     [24] = "pbrsb-no",            [25] = "gds-ctrl",
-    [26] = "gds-no",
+    [26] = "gds-no",              [27] = "rfds-no",
+    [28] = "rfds-clear",
 };
 
 static const char *const str_m10Ah[32] =
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index c6cd91db0a0f..c98b8fc93a06 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -443,6 +443,7 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
          */
         __set_bit(X86_FEATURE_MD_CLEAR, fs);
         __set_bit(X86_FEATURE_FB_CLEAR, fs);
+        __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
 
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -492,6 +493,10 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
         if ( cpu_has_fb_clear )
             __set_bit(X86_FEATURE_FB_CLEAR, fs);
 
+        __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
+        if ( cpu_has_rfds_clear )
+            __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 84a382781b17..dd86b89bb153 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -432,7 +432,7 @@ static void __init print_details(enum ind_thunk thunk)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_EIBRS)                          ? " EIBRS"          : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -448,6 +448,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
            (caps & ARCH_CAPS_PBRSB_NO)                       ? " PBRSB_NO"       : "",
            (caps & ARCH_CAPS_GDS_NO)                         ? " GDS_NO"         : "",
+           (caps & ARCH_CAPS_RFDS_NO)                        ? " RFDS_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
@@ -458,7 +459,7 @@ static void __init print_details(enum ind_thunk thunk)
            (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO))        ? " SRSO_NO"        : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -476,6 +477,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
            (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "",
            (caps & ARCH_CAPS_GDS_CTRL)                       ? " GDS_CTRL"       : "",
+           (caps & ARCH_CAPS_RFDS_CLEAR)                     ? " RFDS_CLEAR"     : "",
            (e21a & cpufeat_mask(X86_FEATURE_SBPB))           ? " SBPB"           : "");
 
     /* Compiled-in support which pertains to mitigations. */
@@ -1295,6 +1297,83 @@ static __init void mds_calculations(void)
     }
 }
 
+/*
+ * Register File Data Sampling affects Atom cores from the Goldmont to
+ * Gracemont microarchitectures.  The March 2024 microcode adds RFDS_NO to
+ * some but not all unaffected parts, and RFDS_CLEAR to affected parts still
+ * in support.
+ *
+ * Alder Lake and Raptor Lake client CPUs have a mix of P cores
+ * (Golden/Raptor Cove, not vulnerable) and E cores (Gracemont,
+ * vulnerable), and both enumerate RFDS_CLEAR.
+ *
+ * Both exist in a Xeon SKU, which has the E cores (Gracemont) disabled by
+ * platform configuration, and enumerate RFDS_NO.
+ *
+ * With older parts, or with out-of-date microcode, synthesise RFDS_NO when
+ * safe to do so.
+ *
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ */
+static void __init rfds_calculations(void)
+{
+    /* RFDS is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return;
+
+    /*
+     * If RFDS_NO or RFDS_CLEAR are visible, we've either got suitable
+     * microcode, or an RFDS-aware hypervisor is levelling us in a pool.
+     */
+    if ( cpu_has_rfds_no || cpu_has_rfds_clear )
+        return;
+
+    /* If we're virtualised, don't attempt to synthesise RFDS_NO. */
+    if ( cpu_has_hypervisor )
+        return;
+
+    /*
+     * Not all CPUs are expected to get a microcode update enumerating one of
+     * RFDS_{NO,CLEAR}, or we might have out-of-date microcode.
+     */
+    switch ( boot_cpu_data.x86_model )
+    {
+    case 0x97: /* INTEL_FAM6_ALDERLAKE */
+    case 0xB7: /* INTEL_FAM6_RAPTORLAKE */
+        /*
+         * Alder Lake and Raptor Lake might be a client SKU (with the
+         * Gracemont cores active, and therefore vulnerable) or might be a
+         * server SKU (with the Gracemont cores disabled, and therefore not
+         * vulnerable).
+         *
+         * See if the CPU identifies as hybrid to distinguish the two cases.
+         */
+        if ( !cpu_has_hybrid )
+            break;
+        /* fallthrough */
+    case 0x9A: /* INTEL_FAM6_ALDERLAKE_L */
+    case 0xBA: /* INTEL_FAM6_RAPTORLAKE_P */
+    case 0xBF: /* INTEL_FAM6_RAPTORLAKE_S */
+
+    case 0x5C: /* INTEL_FAM6_ATOM_GOLDMONT */      /* Apollo Lake */
+    case 0x5F: /* INTEL_FAM6_ATOM_GOLDMONT_D */    /* Denverton */
+    case 0x7A: /* INTEL_FAM6_ATOM_GOLDMONT_PLUS */ /* Gemini Lake */
+    case 0x86: /* INTEL_FAM6_ATOM_TREMONT_D */     /* Snow Ridge / Parker Ridge */
+    case 0x96: /* INTEL_FAM6_ATOM_TREMONT */       /* Elkhart Lake */
+    case 0x9C: /* INTEL_FAM6_ATOM_TREMONT_L */     /* Jasper Lake */
+    case 0xBE: /* INTEL_FAM6_ATOM_GRACEMONT */     /* Alder Lake N */
+        return;
+    }
+
+    /*
+     * We appear to be on an unaffected CPU which didn't enumerate RFDS_NO,
+     * perhaps because of it's age or because of out-of-date microcode.
+     * Synthesise it.
+     */
+    setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
+}
+
 static bool __init cpu_has_gds(void)
 {
     /*
@@ -1759,6 +1838,7 @@ void __init init_speculation_mitigations(void)
      *
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
      *
      * Relevant ucodes:
      *
@@ -1788,8 +1868,12 @@ void __init init_speculation_mitigations(void)
      *
      *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
      *   side effects as VERW and cannot be used in its place.
+     *
+     * - March 2023, for RFDS.  Enumerate RFDS_CLEAR to mean that VERW now
+     *   scrubs non-architectural entries from certain register files.
      */
     mds_calculations();
+    rfds_calculations();
 
     /*
      * Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -1821,15 +1905,19 @@ void __init init_speculation_mitigations(void)
      * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = cpu_has_useful_md_clear;
+        opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = cpu_has_useful_md_clear;
+        opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     /*
      * If SMT is active, and we're protecting against MDS or MMIO stale data,
      * we need to scrub before going idle as well as on return to guest.
      * Various pipeline resources are repartitioned amongst non-idle threads.
+     *
+     * We don't need to scrub on idle for RFDS.  There are no affected cores
+     * which support SMT, despite there being affected cores in hybrid systems
+     * which have SMT elsewhere in the platform.
      */
     if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
           opt_verw_mmio) && hw_smt_enabled )
@@ -1843,7 +1931,8 @@ void __init init_speculation_mitigations(void)
      * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
      * only *_CLEAR we can see.
      */
-    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear &&
+         !cpu_has_rfds_clear )
         opt_verw_hvm = false;
 
     /*
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 3b116c6dd364..892af113849a 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -141,6 +141,7 @@
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_hybrid          boot_cpu_has(X86_FEATURE_HYBRID)
 #define cpu_has_arch_caps       boot_cpu_has(X86_FEATURE_ARCH_CAPS)
 
 /* CPUID level 0x00000007:1.eax */
@@ -160,6 +161,8 @@
 #define cpu_has_rrsba           boot_cpu_has(X86_FEATURE_RRSBA)
 #define cpu_has_gds_ctrl        boot_cpu_has(X86_FEATURE_GDS_CTRL)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
+#define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
+#define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 8b3ad575dbc0..17abfa0e4a2d 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -70,6 +70,8 @@
 #define  ARCH_CAPS_PBRSB_NO                 (_AC(1, ULL) << 24)
 #define  ARCH_CAPS_GDS_CTRL                 (_AC(1, ULL) << 25)
 #define  ARCH_CAPS_GDS_NO                   (_AC(1, ULL) << 26)
+#define  ARCH_CAPS_RFDS_NO                  (_AC(1, ULL) << 27)
+#define  ARCH_CAPS_RFDS_CLEAR               (_AC(1, ULL) << 28)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index b4f68a59b859..e027a31d67ac 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -277,6 +277,7 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffe
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
+XEN_CPUFEATURE(HYBRID,        9*32+15) /*   Heterogeneous platform */
 XEN_CPUFEATURE(TSXLDTRK,      9*32+16) /*a  TSX load tracking suspend/resume insns */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
@@ -331,6 +332,8 @@ XEN_CPUFEATURE(OVRCLK_STATUS,      16*32+23) /*   MSR_OVERCLOCKING_STATUS */
 XEN_CPUFEATURE(PBRSB_NO,           16*32+24) /*A  No Post-Barrier RSB predictions */
 XEN_CPUFEATURE(GDS_CTRL,           16*32+25) /*   MCU_OPT_CTRL.GDS_MIT_{DIS,LOCK} */
 XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
+XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
+XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
 /* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 287dac101ad4..31fa63b77fd1 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -51,6 +51,23 @@ void __dummy__(void)
     OFFSET(UREGS_kernel_sizeof, struct cpu_user_regs, es);
     BLANK();
 
+    /*
+     * EFRAME_* is for the entry/exit logic where %rsp is pointing at
+     * UREGS_error_code and GPRs are still/already guest values.
+     */
+#define OFFSET_EF(sym, mem)                                             \
+    DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
+                offsetof(struct cpu_user_regs, error_code))
+
+    OFFSET_EF(EFRAME_entry_vector,    entry_vector);
+    OFFSET_EF(EFRAME_rip,             rip);
+    OFFSET_EF(EFRAME_cs,              cs);
+    OFFSET_EF(EFRAME_eflags,          eflags);
+    OFFSET_EF(EFRAME_rsp,             rsp);
+    BLANK();
+
+#undef OFFSET_EF
+
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
     OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 253bb1688c4f..7c211314d885 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -15,7 +15,7 @@ ENTRY(entry_int82)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $HYPERCALL_VECTOR, 4(%rsp)
+        movl  $HYPERCALL_VECTOR, EFRAME_entry_vector(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 837a31b40524..10f11986d8b9 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -190,15 +190,15 @@ restore_all_guest:
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL
-        testw $TRAP_syscall,4(%rsp)
+        testw $TRAP_syscall, EFRAME_entry_vector(%rsp)
         jz    iret_exit_to_guest
 
-        movq  24(%rsp),%r11           # RFLAGS
+        mov   EFRAME_eflags(%rsp), %r11
         andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
         orq   $X86_EFLAGS_IF,%r11
 
         /* Don't use SYSRET path if the return address is not canonical. */
-        movq  8(%rsp),%rcx
+        mov   EFRAME_rip(%rsp), %rcx
         sarq  $47,%rcx
         incl  %ecx
         cmpl  $1,%ecx
@@ -213,20 +213,20 @@ restore_all_guest:
         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
 #endif
 
-        movq  8(%rsp), %rcx           # RIP
-        cmpw  $FLAT_USER_CS32,16(%rsp)# CS
-        movq  32(%rsp),%rsp           # RSP
+        mov   EFRAME_rip(%rsp), %rcx
+        cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
+        mov   EFRAME_rsp(%rsp), %rsp
         je    1f
         sysretq
 1:      sysretl
 
         ALIGN
 .Lrestore_rcx_iret_exit_to_guest:
-        movq  8(%rsp), %rcx           # RIP
+        mov   EFRAME_rip(%rsp), %rcx
 /* No special register assumptions. */
 iret_exit_to_guest:
-        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
-        orl   $X86_EFLAGS_IF,24(%rsp)
+        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
+        orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -257,7 +257,7 @@ ENTRY(lstar_enter)
         pushq $FLAT_KERNEL_CS64
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -294,7 +294,7 @@ ENTRY(cstar_enter)
         pushq $FLAT_USER_CS32
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -335,7 +335,7 @@ GLOBAL(sysenter_eflags_saved)
         pushq $3 /* ring 3 null cs */
         pushq $0 /* null rip */
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -389,7 +389,7 @@ ENTRY(int80_direct_trap)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $0x80, 4(%rsp)
+        movl  $0x80, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -707,7 +707,7 @@ ENTRY(common_interrupt)
 
 ENTRY(page_fault)
         ENDBR64
-        movl  $TRAP_page_fault,4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
@@ -849,90 +849,90 @@ FATAL_exception_with_ints_disabled:
 ENTRY(divide_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_divide_error,4(%rsp)
+        movl  $TRAP_divide_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_copro_error,4(%rsp)
+        movl  $TRAP_copro_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(simd_coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_simd_error,4(%rsp)
+        movl  $TRAP_simd_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(device_not_available)
         ENDBR64
         pushq $0
-        movl  $TRAP_no_device,4(%rsp)
+        movl  $TRAP_no_device, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(debug)
         ENDBR64
         pushq $0
-        movl  $TRAP_debug,4(%rsp)
+        movl  $TRAP_debug, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 ENTRY(int3)
         ENDBR64
         pushq $0
-        movl  $TRAP_int3,4(%rsp)
+        movl  $TRAP_int3, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(overflow)
         ENDBR64
         pushq $0
-        movl  $TRAP_overflow,4(%rsp)
+        movl  $TRAP_overflow, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(bounds)
         ENDBR64
         pushq $0
-        movl  $TRAP_bounds,4(%rsp)
+        movl  $TRAP_bounds, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_op)
         ENDBR64
         pushq $0
-        movl  $TRAP_invalid_op,4(%rsp)
+        movl  $TRAP_invalid_op, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_TSS)
         ENDBR64
-        movl  $TRAP_invalid_tss,4(%rsp)
+        movl  $TRAP_invalid_tss, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(segment_not_present)
         ENDBR64
-        movl  $TRAP_no_segment,4(%rsp)
+        movl  $TRAP_no_segment, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(stack_segment)
         ENDBR64
-        movl  $TRAP_stack_error,4(%rsp)
+        movl  $TRAP_stack_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(general_protection)
         ENDBR64
-        movl  $TRAP_gp_fault,4(%rsp)
+        movl  $TRAP_gp_fault, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(alignment_check)
         ENDBR64
-        movl  $TRAP_alignment_check,4(%rsp)
+        movl  $TRAP_alignment_check, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_CP)
         ENDBR64
-        movl  $X86_EXC_CP, 4(%rsp)
+        movl  $X86_EXC_CP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(double_fault)
         ENDBR64
-        movl  $TRAP_double_fault,4(%rsp)
+        movl  $TRAP_double_fault, EFRAME_entry_vector(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         ALTERNATIVE "", stac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -958,7 +958,7 @@ ENTRY(double_fault)
         .pushsection .init.text, "ax", @progbits
 ENTRY(early_page_fault)
         ENDBR64
-        movl  $TRAP_page_fault,4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
         SAVE_ALL
         movq  %rsp,%rdi
         call  do_early_page_fault
@@ -968,7 +968,7 @@ ENTRY(early_page_fault)
 ENTRY(nmi)
         ENDBR64
         pushq $0
-        movl  $TRAP_nmi,4(%rsp)
+        movl  $TRAP_nmi, EFRAME_entry_vector(%rsp)
 handle_ist_exception:
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -1075,7 +1075,7 @@ handle_ist_exception:
 ENTRY(machine_check)
         ENDBR64
         pushq $0
-        movl  $TRAP_machine_check,4(%rsp)
+        movl  $TRAP_machine_check, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 /* No op trap handler.  Required for kexec crash path. */
@@ -1112,7 +1112,7 @@ autogen_stubs: /* Automatically generated stubs. */
 1:
         ENDBR64
         pushq $0
-        movb  $vec,4(%rsp)
+        movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   common_interrupt
 
         entrypoint 1b
@@ -1126,7 +1126,7 @@ autogen_stubs: /* Automatically generated stubs. */
         test  $8,%spl        /* 64bit exception frames are 16 byte aligned, but the word */
         jz    2f             /* size is 8 bytes.  Check whether the processor gave us an */
         pushq $0             /* error code, and insert an empty one if not.              */
-2:      movb  $vec,4(%rsp)
+2:      movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
         entrypoint 1b
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/cpu-policy: Allow for levelling of VERW side effects

MD_CLEAR and FB_CLEAR need OR-ing across a migrate pool.  Allow this, by
having them unconditinally set in max, with the host values reflected in
default.  Annotate the bits as having special properies.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit de17162cafd27f2865a3102a2ec0f386a02ed03d)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index f38063b667b0..34f778dbafbb 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -434,6 +434,16 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
         __set_bit(X86_FEATURE_RSBA, fs);
         __set_bit(X86_FEATURE_RRSBA, fs);
 
+        /*
+         * These bits indicate that the VERW instruction may have gained
+         * scrubbing side effects.  With pooling, they mean "you might migrate
+         * somewhere where scrubbing is necessary", and may need exposing on
+         * unaffected hardware.  This is fine, because the VERW instruction
+         * has been around since the 286.
+         */
+        __set_bit(X86_FEATURE_MD_CLEAR, fs);
+        __set_bit(X86_FEATURE_FB_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
@@ -468,6 +478,20 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
              cpu_has_rdrand && !is_forced_cpu_cap(X86_FEATURE_RDRAND) )
             __clear_bit(X86_FEATURE_RDRAND, fs);
 
+        /*
+         * These bits indicate that the VERW instruction may have gained
+         * scrubbing side effects.  The max policy has them set for migration
+         * reasons, so reset the default policy back to the host values in
+         * case we're unaffected.
+         */
+        __clear_bit(X86_FEATURE_MD_CLEAR, fs);
+        if ( cpu_has_md_clear )
+            __set_bit(X86_FEATURE_MD_CLEAR, fs);
+
+        __clear_bit(X86_FEATURE_FB_CLEAR, fs);
+        if ( cpu_has_fb_clear )
+            __set_bit(X86_FEATURE_FB_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 1ac3d3a1f946..81ac4d76eea6 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -134,6 +134,7 @@
 #define cpu_has_avx512_4fmaps   boot_cpu_has(X86_FEATURE_AVX512_4FMAPS)
 #define cpu_has_avx512_vp2intersect boot_cpu_has(X86_FEATURE_AVX512_VP2INTERSECT)
 #define cpu_has_srbds_ctrl      boot_cpu_has(X86_FEATURE_SRBDS_CTRL)
+#define cpu_has_md_clear        boot_cpu_has(X86_FEATURE_MD_CLEAR)
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 0ee1d1d90330..2906eaa6c290 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -275,7 +275,7 @@ XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A  AVX512 Multiply Accumulation Single
 XEN_CPUFEATURE(FSRM,          9*32+ 4) /*A  Fast Short REP MOVS */
 XEN_CPUFEATURE(AVX512_VP2INTERSECT, 9*32+8) /*a  VP2INTERSECT{D,Q} insns */
 XEN_CPUFEATURE(SRBDS_CTRL,    9*32+ 9) /*   MSR_MCU_OPT_CTRL and RNGDS_MITG_DIS. */
-XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*A  VERW clears microarchitectural buffers */
+XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffers */
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
@@ -329,7 +329,7 @@ XEN_CPUFEATURE(DOITM,              16*32+12) /*   Data Operand Invariant Timing
 XEN_CPUFEATURE(SBDR_SSDP_NO,       16*32+13) /*A  No Shared Buffer Data Read or Sideband Stale Data Propagation */
 XEN_CPUFEATURE(FBSDP_NO,           16*32+14) /*A  No Fill Buffer Stale Data Propagation */
 XEN_CPUFEATURE(PSDP_NO,            16*32+15) /*A  No Primary Stale Data Propagation */
-XEN_CPUFEATURE(FB_CLEAR,           16*32+17) /*A  Fill Buffers cleared by VERW */
+XEN_CPUFEATURE(FB_CLEAR,           16*32+17) /*!A Fill Buffers cleared by VERW */
 XEN_CPUFEATURE(FB_CLEAR_CTRL,      16*32+18) /*   MSR_OPT_CPU_CTRL.FB_CLEAR_DIS */
 XEN_CPUFEATURE(RRSBA,              16*32+19) /*!  Restricted RSB Alternative */
 XEN_CPUFEATURE(BHI_NO,             16*32+20) /*A  No Branch History Injection  */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 5f5de45a1309..cdde76e13892 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,17 +87,39 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        DO_SPEC_CTRL_COND_VERW
+        /*
+         * All speculation safety work happens to be elsewhere.  VERW is after
+         * popping the GPRs, while restoring the guest MSR_SPEC_CTRL is left
+         * to the MSR load list.
+         */
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
+        mov  %rax, %cr2
+
+        /*
+         * We need to perform two conditional actions (VERW, and Resume vs
+         * Launch) after popping GPRs.  With some cunning, we can encode both
+         * of these in eflags together.
+         *
+         * Parity is only calculated over the bottom byte of the answer, while
+         * Sign is simply the top bit.
+         *
+         * Therefore, the final OR instruction ends up producing:
+         *   SF = VCPU_vmx_launched
+         *   PF = !SCF_verw
+         */
+        BUILD_BUG_ON(SCF_verw & ~0xff)
+        movzbl VCPU_vmx_launched(%rbx), %ecx
+        shl  $31, %ecx
+        movzbl CPUINFO_spec_ctrl_flags(%rsp), %eax
+        and  $SCF_verw, %eax
+        or   %eax, %ecx
 
         pop  %r15
         pop  %r14
         pop  %r13
         pop  %r12
         pop  %rbp
-        mov  %rax,%cr2
-        cmpb $0,VCPU_vmx_launched(%rbx)
         pop  %rbx
         pop  %r11
         pop  %r10
@@ -108,7 +130,13 @@ UNLIKELY_END(realmode)
         pop  %rdx
         pop  %rsi
         pop  %rdi
-        je   .Lvmx_launch
+
+        jpe  .L_skip_verw
+        /* VERW clobbers ZF, but preserves all others, including SF. */
+        verw STK_REL(CPUINFO_verw_sel, CPUINFO_error_code)(%rsp)
+.L_skip_verw:
+
+        jns  .Lvmx_launch
 
 /*.Lvmx_resume:*/
         VMRESUME
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 31fa63b77fd1..a4e94d693024 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -135,6 +135,7 @@ void __dummy__(void)
 #endif
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+    OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index d9431180cfba..abc6822b08c8 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -81,6 +81,14 @@ register unsigned long current_stack_pointer asm("rsp");
 
 #ifdef __ASSEMBLY__
 
+.macro BUILD_BUG_ON condstr, cond:vararg
+        .if \cond
+        .error "Condition \"\condstr\" not satisfied"
+        .endif
+.endm
+/* preprocessor macro to make error message more user friendly */
+#define BUILD_BUG_ON(cond) BUILD_BUG_ON #cond, cond
+
 #ifdef HAVE_AS_QUOTED_SYM
 #define SUBSECTION_LBL(tag)                        \
         .ifndef .L.tag;                            \
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 0e69971f663f..e807ff6d1db2 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -169,6 +169,13 @@
 #endif
 .endm
 
+/*
+ * Helper to improve the readibility of stack dispacements with %rsp in
+ * unusual positions.  Both @field and @top_of_stack should be constants from
+ * the same object.  @top_of_stack should be where %rsp is currently pointing.
+ */
+#define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
+
 .macro DO_SPEC_CTRL_COND_VERW
 /*
  * Requires %rsp=cpuinfo
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index a4e94d693024..4cd5938d7b9d 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -55,14 +55,22 @@ void __dummy__(void)
      * EFRAME_* is for the entry/exit logic where %rsp is pointing at
      * UREGS_error_code and GPRs are still/already guest values.
      */
-#define OFFSET_EF(sym, mem)                                             \
+#define OFFSET_EF(sym, mem, ...)                                        \
     DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
-                offsetof(struct cpu_user_regs, error_code))
+                offsetof(struct cpu_user_regs, error_code) __VA_ARGS__)
 
     OFFSET_EF(EFRAME_entry_vector,    entry_vector);
     OFFSET_EF(EFRAME_rip,             rip);
     OFFSET_EF(EFRAME_cs,              cs);
     OFFSET_EF(EFRAME_eflags,          eflags);
+
+    /*
+     * These aren't real fields.  They're spare space, used by the IST
+     * exit-to-xen path.
+     */
+    OFFSET_EF(EFRAME_shadow_scf,      eflags, +4);
+    OFFSET_EF(EFRAME_shadow_sel,      eflags, +6);
+
     OFFSET_EF(EFRAME_rsp,             rsp);
     BLANK();
 
@@ -136,6 +144,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
     OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
+    OFFSET(CPUINFO_rip, struct cpu_info, guest_cpu_user_regs.rip);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 7c211314d885..3b2fbcd8733a 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -161,6 +161,12 @@ ENTRY(compat_restore_all_guest)
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL adj=8 compat=1
+
+        /* Account for ev/ec having already been popped off the stack. */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_rip), \
+            sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_rip)
+
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 10f11986d8b9..9b1fa9ed192f 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -214,6 +214,9 @@ restore_all_guest:
 #endif
 
         mov   EFRAME_rip(%rsp), %rcx
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
         mov   EFRAME_rsp(%rsp), %rsp
         je    1f
@@ -227,6 +230,9 @@ restore_all_guest:
 iret_exit_to_guest:
         andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
         orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -670,9 +676,22 @@ UNLIKELY_START(ne, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN     /* Req: %r12=ist_exit %r14=end, Clob: abcd */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %r12=ist_exit %r14=end %rsp=regs, Clob: abcd */
 
         RESTORE_ALL adj=8
+
+        /*
+         * When the CPU pushed this exception frame, it zero-extended eflags.
+         * For an IST exit, SPEC_CTRL_EXIT_TO_XEN stashed shadow copies of
+         * spec_ctrl_flags and ver_sel above eflags, as we can't use any GPRs,
+         * and we're at a random place on the stack, not in a CPUFINFO block.
+         *
+         * Account for ev/ec having already been popped off the stack.
+         */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(EFRAME_shadow_scf, EFRAME_rip), \
+            sel=STK_REL(EFRAME_shadow_sel, EFRAME_rip)
+
         iretq
 
 ENTRY(common_interrupt)
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index e807ff6d1db2..6e7725c11f3a 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -176,16 +176,23 @@
  */
 #define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
 
-.macro DO_SPEC_CTRL_COND_VERW
+.macro SPEC_CTRL_COND_VERW \
+    scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_error_code), \
+    sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_error_code)
 /*
- * Requires %rsp=cpuinfo
+ * Requires \scf and \sel as %rsp-relative expressions
+ * Clobbers eflags
+ *
+ * VERW needs to run after guest GPRs have been restored, where only %rsp is
+ * good to use.  Default to expecting %rsp pointing at CPUINFO_error_code.
+ * Contexts where this is not true must provide an alternative \scf and \sel.
  *
  * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
  * v1 gadget, but the IRET/VMEntry is serialising.
  */
-    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    testb $SCF_verw, \scf(%rsp)
     jz .L\@_verw_skip
-    verw CPUINFO_verw_sel(%rsp)
+    verw \sel(%rsp)
 .L\@_verw_skip:
 .endm
 
@@ -303,8 +310,6 @@
  */
     ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV
 
-    DO_SPEC_CTRL_COND_VERW
-
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 .endm
 
@@ -384,7 +389,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  */
 .macro SPEC_CTRL_EXIT_TO_XEN
 /*
- * Requires %r12=ist_exit, %r14=stack_end
+ * Requires %r12=ist_exit, %r14=stack_end, %rsp=regs
  * Clobbers %rax, %rbx, %rcx, %rdx
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
@@ -412,11 +417,18 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
     test %r12, %r12
     jz .L\@_skip_ist_exit
 
-    /* Logically DO_SPEC_CTRL_COND_VERW but without the %rsp=cpuinfo dependency */
-    testb $SCF_verw, %bl
-    jz .L\@_skip_verw
-    verw STACK_CPUINFO_FIELD(verw_sel)(%r14)
-.L\@_skip_verw:
+    /*
+     * Stash SCF and verw_sel above eflags in the case of an IST_exit.  The
+     * VERW logic needs to run after guest GPRs have been restored; i.e. where
+     * we cannot use %r12 or %r14 for the purposes they have here.
+     *
+     * When the CPU pushed this exception frame, it zero-extended eflags.
+     * Therefore it is safe for the VERW logic to look at the stashed SCF
+     * outside of the ist_exit condition.  Also, this stashing won't influence
+     * any other restore_all_guest() paths.
+     */
+    or $(__HYPERVISOR_DS32 << 16), %ebx
+    mov %ebx, UREGS_eflags + 4(%rsp) /* EFRAME_shadow_scf/sel */
 
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index a7a1362bac28..029002fa82d6 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2260,7 +2260,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio,gds-mit,div-scrub}=<bool> ]`
@@ -2285,7 +2285,7 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=` and `ibpb-entry=` options
 offer fine grained control over the primitives by Xen.  These impact Xen's
 ability to protect itself, and/or Xen's ability to virtualise support for
 guests to use.
@@ -2302,11 +2302,12 @@ guests to use.
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
 * `rsb=` offers control over whether to overwrite the Return Stack Buffer /
   Return Address Stack on entry to Xen and on idle.
-* `md-clear=` offers control over whether to use VERW to flush
-  microarchitectural buffers on idle and exit from Xen.  *Note: For
-  compatibility with development versions of this fix, `mds=` is also accepted
-  on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
-  preference to here.*
+* `verw=` offers control over whether to use VERW for its scrubbing side
+  effects at appropriate privilege transitions.  The exact side effects are
+  microarchitecture and microcode specific.  *Note: `md-clear=` is accepted as
+  a deprecated alias.  For compatibility with development versions of XSA-297,
+  `mds=` is also accepted on Xen 4.12 and earlier as an alias.  Consult vendor
+  documentation in preference to here.*
 * `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
   Barrier) is used on entry to Xen.  This is used by default on hardware
   vulnerable to Branch Type Confusion, and hardware vulnerable to Speculative
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 6e82a126a3e2..292b5b1c7ba1 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -37,8 +37,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __read_mostly opt_md_clear_pv = -1;
-static int8_t __read_mostly opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_verw_pv = -1;
+static int8_t __read_mostly opt_verw_hvm = -1;
 
 static int8_t __read_mostly opt_ibpb_entry_pv = -1;
 static int8_t __read_mostly opt_ibpb_entry_hvm = -1;
@@ -77,7 +77,7 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 static bool __initdata opt_unpriv_mmio;
-static bool __read_mostly opt_fb_clear_mmio;
+static bool __read_mostly opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
@@ -119,8 +119,8 @@ static int __init parse_spec_ctrl(const char *s)
         disable_common:
             opt_rsb_pv = false;
             opt_rsb_hvm = false;
-            opt_md_clear_pv = 0;
-            opt_md_clear_hvm = 0;
+            opt_verw_pv = 0;
+            opt_verw_hvm = 0;
             opt_ibpb_entry_pv = 0;
             opt_ibpb_entry_hvm = 0;
             opt_ibpb_entry_dom0 = false;
@@ -151,14 +151,14 @@ static int __init parse_spec_ctrl(const char *s)
         {
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
-            opt_md_clear_pv = val;
+            opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
-            opt_md_clear_hvm = val;
+            opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
@@ -203,21 +203,22 @@ static int __init parse_spec_ctrl(const char *s)
                 break;
             }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
+        else if ( (val = parse_boolean("verw", s, ss)) != -1 ||
+                  (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
             switch ( val )
             {
             case 0:
             case 1:
-                opt_md_clear_pv = opt_md_clear_hvm = val;
+                opt_verw_pv = opt_verw_hvm = val;
                 break;
 
             case -2:
-                s += strlen("md-clear=");
+                s += (*s == 'v') ? strlen("verw=") : strlen("md-clear=");
                 if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                    opt_md_clear_pv = val;
+                    opt_verw_pv = val;
                 else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                    opt_md_clear_hvm = val;
+                    opt_verw_hvm = val;
                 else
             default:
                     rc = -EINVAL;
@@ -512,8 +513,8 @@ static void __init print_details(enum ind_thunk thunk)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm ||
-           opt_fb_clear_mmio                         ? " VERW"  : "",
+           opt_verw_pv || opt_verw_hvm ||
+           opt_verw_mmio                             ? " VERW"  : "",
            opt_div_scrub                             ? " DIV" : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
@@ -533,11 +534,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
-            opt_eager_fpu || opt_md_clear_hvm)       ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_hvm                          ? " MD_CLEAR"      : "",
+           opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
@@ -546,11 +547,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_pv                           ? " MD_CLEAR"      : "",
+           opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
@@ -1479,8 +1480,8 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+    bool verw = ((pv ? opt_verw_pv : opt_verw_hvm) ||
+                 (opt_verw_mmio && is_iommu_enabled(d)));
 
     bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
                  (d->domain_id != 0 || opt_ibpb_entry_dom0));
@@ -1838,19 +1839,20 @@ void __init init_speculation_mitigations(void)
      * the return-to-guest path.
      */
     if ( opt_unpriv_mmio )
-        opt_fb_clear_mmio = cpu_has_fb_clear;
+        opt_verw_mmio = cpu_has_fb_clear;
 
     /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
      */
-    if ( opt_md_clear_pv == -1 )
-        opt_md_clear_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                           boot_cpu_has(X86_FEATURE_MD_CLEAR));
-    if ( opt_md_clear_hvm == -1 )
-        opt_md_clear_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                            boot_cpu_has(X86_FEATURE_MD_CLEAR));
+    if ( opt_verw_pv == -1 )
+        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                       cpu_has_md_clear);
+
+    if ( opt_verw_hvm == -1 )
+        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                        cpu_has_md_clear);
 
     /*
      * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
@@ -1863,12 +1865,12 @@ void __init init_speculation_mitigations(void)
      * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
-     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * opt_verw_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_md_clear_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 292b5b1c7ba1..2e80e0871642 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1496,7 +1496,7 @@ void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
     bool has_spec_ctrl, ibrs = false, hw_smt_enabled;
-    bool cpu_has_bug_taa, retpoline_safe;
+    bool cpu_has_bug_taa, cpu_has_useful_md_clear, retpoline_safe;
 
     hw_smt_enabled = check_smt_enabled();
 
@@ -1827,50 +1827,97 @@ void __init init_speculation_mitigations(void)
             "enabled.  Please assess your configuration and choose an\n"
             "explicit 'smt=<bool>' setting.  See XSA-273.\n");
 
+    /*
+     * A brief summary of VERW-related changes.
+     *
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     *
+     * Relevant ucodes:
+     *
+     * - May 2019, for MDS.  Introduces the MD_CLEAR CPUID bit and VERW side
+     *   effects to scrub Store/Load/Fill buffers as applicable.  MD_CLEAR
+     *   exists architecturally, even when the side effects have been removed.
+     *
+     *   Use VERW to scrub on return-to-guest.  Parts with L1D_FLUSH to
+     *   mitigate L1TF have the same side effect, so no need to do both.
+     *
+     *   Various Atoms suffer from Store-buffer sampling only.  Store buffers
+     *   are statically partitioned between non-idle threads, so scrubbing is
+     *   wanted when going idle too.
+     *
+     *   Load ports and Fill buffers are competitively shared between threads.
+     *   SMT must be disabled for VERW scrubbing to be fully effective.
+     *
+     * - November 2019, for TAA.  Extended VERW side effects to TSX-enabled
+     *   MDS_NO parts.
+     *
+     * - February 2022, for Client TSX de-feature.  Removed VERW side effects
+     *   from Client CPUs only.
+     *
+     * - May 2022, for MMIO Stale Data.  (Re)introduced Fill Buffer scrubbing
+     *   on all MMIO-affected parts which didn't already have it for MDS
+     *   reasons, enumerating FB_CLEAR on those parts only.
+     *
+     *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
+     *   side effects as VERW and cannot be used in its place.
+     */
     mds_calculations();
 
     /*
-     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
-     * reintroduced the VERW fill buffer flushing side effect because of a
-     * susceptibility to FBSDP.
+     * Parts which enumerate FB_CLEAR are those with now-updated microcode
+     * which weren't susceptible to the original MFBDS (and therefore didn't
+     * have Fill Buffer scrubbing side effects to begin with, or were Client
+     * MDS_NO non-TAA_NO parts where the scrubbing was removed), but have had
+     * the scrubbing reintroduced because of a susceptibility to FBSDP.
      *
      * If unprivileged guests have (or will have) MMIO mappings, we can
      * mitigate cross-domain leakage of fill buffer data by issuing VERW on
-     * the return-to-guest path.
+     * the return-to-guest path.  This is only a token effort if SMT is
+     * active.
      */
     if ( opt_unpriv_mmio )
         opt_verw_mmio = cpu_has_fb_clear;
 
     /*
-     * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
-     * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
-     * but it is somewhat better than nothing.
+     * MD_CLEAR is enumerated architecturally forevermore, even after the
+     * scrubbing side effects have been removed.  Create ourselves an version
+     * which expressed whether we think MD_CLEAR is having any useful side
+     * effect.
+     */
+    cpu_has_useful_md_clear = (cpu_has_md_clear &&
+                               (cpu_has_bug_mds || cpu_has_bug_msbds_only));
+
+    /*
+     * By default, use VERW scrubbing on applicable hardware, if we think it's
+     * going to have an effect.  This will only be a token effort for
+     * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                       cpu_has_md_clear);
+        opt_verw_pv = cpu_has_useful_md_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                        cpu_has_md_clear);
+        opt_verw_hvm = cpu_has_useful_md_clear;
 
     /*
-     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
-     * either the PV or HVM MDS defences are used, or if we may give MMIO
-     * access to untrusted guests.
-     *
-     * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
-     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
-     *
-     * After calculating the appropriate idle setting, simplify
-     * opt_verw_hvm to mean just "should we VERW on the way into HVM
-     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+     * If SMT is active, and we're protecting against MDS or MMIO stale data,
+     * we need to scrub before going idle as well as on return to guest.
+     * Various pipeline resources are repartitioned amongst non-idle threads.
      */
-    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
+    if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+          opt_verw_mmio) && hw_smt_enabled )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+
+    /*
+     * After calculating the appropriate idle setting, simplify opt_verw_hvm
+     * to mean just "should we VERW on the way into HVM guests", so
+     * spec_ctrl_init_domain() can calculate suitable settings.
+     *
+     * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
+     * only *_CLEAR we can see.
+     */
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+        opt_verw_hvm = false;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index c55a6e767809..0c792679e594 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -170,7 +170,7 @@ static const char *const str_7d0[32] =
     [ 8] = "avx512-vp2intersect", [ 9] = "srbds-ctrl",
     [10] = "md-clear",            [11] = "rtm-always-abort",
     /* 12 */                [13] = "tsx-force-abort",
-    [14] = "serialize",
+    [14] = "serialize",     [15] = "hybrid",
     [16] = "tsxldtrk",
     [18] = "pconfig",
     [20] = "cet-ibt",
@@ -230,7 +230,8 @@ static const char *const str_m10Al[32] =
     [20] = "bhi-no",              [21] = "xapic-status",
     /* 22 */                      [23] = "ovrclk-status",
     [24] = "pbrsb-no",            [25] = "gds-ctrl",
-    [26] = "gds-no",
+    [26] = "gds-no",              [27] = "rfds-no",
+    [28] = "rfds-clear",
 };
 
 static const char *const str_m10Ah[32] =
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 34f778dbafbb..c872afda3e2b 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -443,6 +443,7 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
          */
         __set_bit(X86_FEATURE_MD_CLEAR, fs);
         __set_bit(X86_FEATURE_FB_CLEAR, fs);
+        __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
 
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -492,6 +493,10 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
         if ( cpu_has_fb_clear )
             __set_bit(X86_FEATURE_FB_CLEAR, fs);
 
+        __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
+        if ( cpu_has_rfds_clear )
+            __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 2e80e0871642..24bf98a018a0 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -432,7 +432,7 @@ static void __init print_details(enum ind_thunk thunk)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_EIBRS)                          ? " EIBRS"          : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -448,6 +448,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
            (caps & ARCH_CAPS_PBRSB_NO)                       ? " PBRSB_NO"       : "",
            (caps & ARCH_CAPS_GDS_NO)                         ? " GDS_NO"         : "",
+           (caps & ARCH_CAPS_RFDS_NO)                        ? " RFDS_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
@@ -458,7 +459,7 @@ static void __init print_details(enum ind_thunk thunk)
            (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO))        ? " SRSO_NO"        : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -476,6 +477,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
            (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "",
            (caps & ARCH_CAPS_GDS_CTRL)                       ? " GDS_CTRL"       : "",
+           (caps & ARCH_CAPS_RFDS_CLEAR)                     ? " RFDS_CLEAR"     : "",
            (e21a & cpufeat_mask(X86_FEATURE_SBPB))           ? " SBPB"           : "");
 
     /* Compiled-in support which pertains to mitigations. */
@@ -1324,6 +1326,83 @@ static __init void mds_calculations(void)
     }
 }
 
+/*
+ * Register File Data Sampling affects Atom cores from the Goldmont to
+ * Gracemont microarchitectures.  The March 2024 microcode adds RFDS_NO to
+ * some but not all unaffected parts, and RFDS_CLEAR to affected parts still
+ * in support.
+ *
+ * Alder Lake and Raptor Lake client CPUs have a mix of P cores
+ * (Golden/Raptor Cove, not vulnerable) and E cores (Gracemont,
+ * vulnerable), and both enumerate RFDS_CLEAR.
+ *
+ * Both exist in a Xeon SKU, which has the E cores (Gracemont) disabled by
+ * platform configuration, and enumerate RFDS_NO.
+ *
+ * With older parts, or with out-of-date microcode, synthesise RFDS_NO when
+ * safe to do so.
+ *
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ */
+static void __init rfds_calculations(void)
+{
+    /* RFDS is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return;
+
+    /*
+     * If RFDS_NO or RFDS_CLEAR are visible, we've either got suitable
+     * microcode, or an RFDS-aware hypervisor is levelling us in a pool.
+     */
+    if ( cpu_has_rfds_no || cpu_has_rfds_clear )
+        return;
+
+    /* If we're virtualised, don't attempt to synthesise RFDS_NO. */
+    if ( cpu_has_hypervisor )
+        return;
+
+    /*
+     * Not all CPUs are expected to get a microcode update enumerating one of
+     * RFDS_{NO,CLEAR}, or we might have out-of-date microcode.
+     */
+    switch ( boot_cpu_data.x86_model )
+    {
+    case 0x97: /* INTEL_FAM6_ALDERLAKE */
+    case 0xB7: /* INTEL_FAM6_RAPTORLAKE */
+        /*
+         * Alder Lake and Raptor Lake might be a client SKU (with the
+         * Gracemont cores active, and therefore vulnerable) or might be a
+         * server SKU (with the Gracemont cores disabled, and therefore not
+         * vulnerable).
+         *
+         * See if the CPU identifies as hybrid to distinguish the two cases.
+         */
+        if ( !cpu_has_hybrid )
+            break;
+        /* fallthrough */
+    case 0x9A: /* INTEL_FAM6_ALDERLAKE_L */
+    case 0xBA: /* INTEL_FAM6_RAPTORLAKE_P */
+    case 0xBF: /* INTEL_FAM6_RAPTORLAKE_S */
+
+    case 0x5C: /* INTEL_FAM6_ATOM_GOLDMONT */      /* Apollo Lake */
+    case 0x5F: /* INTEL_FAM6_ATOM_GOLDMONT_D */    /* Denverton */
+    case 0x7A: /* INTEL_FAM6_ATOM_GOLDMONT_PLUS */ /* Gemini Lake */
+    case 0x86: /* INTEL_FAM6_ATOM_TREMONT_D */     /* Snow Ridge / Parker Ridge */
+    case 0x96: /* INTEL_FAM6_ATOM_TREMONT */       /* Elkhart Lake */
+    case 0x9C: /* INTEL_FAM6_ATOM_TREMONT_L */     /* Jasper Lake */
+    case 0xBE: /* INTEL_FAM6_ATOM_GRACEMONT */     /* Alder Lake N */
+        return;
+    }
+
+    /*
+     * We appear to be on an unaffected CPU which didn't enumerate RFDS_NO,
+     * perhaps because of it's age or because of out-of-date microcode.
+     * Synthesise it.
+     */
+    setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
+}
+
 static bool __init cpu_has_gds(void)
 {
     /*
@@ -1832,6 +1911,7 @@ void __init init_speculation_mitigations(void)
      *
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
      *
      * Relevant ucodes:
      *
@@ -1861,8 +1941,12 @@ void __init init_speculation_mitigations(void)
      *
      *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
      *   side effects as VERW and cannot be used in its place.
+     *
+     * - March 2023, for RFDS.  Enumerate RFDS_CLEAR to mean that VERW now
+     *   scrubs non-architectural entries from certain register files.
      */
     mds_calculations();
+    rfds_calculations();
 
     /*
      * Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -1894,15 +1978,19 @@ void __init init_speculation_mitigations(void)
      * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = cpu_has_useful_md_clear;
+        opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = cpu_has_useful_md_clear;
+        opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     /*
      * If SMT is active, and we're protecting against MDS or MMIO stale data,
      * we need to scrub before going idle as well as on return to guest.
      * Various pipeline resources are repartitioned amongst non-idle threads.
+     *
+     * We don't need to scrub on idle for RFDS.  There are no affected cores
+     * which support SMT, despite there being affected cores in hybrid systems
+     * which have SMT elsewhere in the platform.
      */
     if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
           opt_verw_mmio) && hw_smt_enabled )
@@ -1916,7 +2004,8 @@ void __init init_speculation_mitigations(void)
      * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
      * only *_CLEAR we can see.
      */
-    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear &&
+         !cpu_has_rfds_clear )
         opt_verw_hvm = false;
 
     /*
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 81ac4d76eea6..1869732bcb9b 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -138,6 +138,7 @@
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_hybrid          boot_cpu_has(X86_FEATURE_HYBRID)
 #define cpu_has_arch_caps       boot_cpu_has(X86_FEATURE_ARCH_CAPS)
 
 /* CPUID level 0x00000007:1.eax */
@@ -157,6 +158,8 @@
 #define cpu_has_rrsba           boot_cpu_has(X86_FEATURE_RRSBA)
 #define cpu_has_gds_ctrl        boot_cpu_has(X86_FEATURE_GDS_CTRL)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
+#define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
+#define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 8251b8258b79..eb6295d8a7a4 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -77,6 +77,8 @@
 #define  ARCH_CAPS_PBRSB_NO                 (_AC(1, ULL) << 24)
 #define  ARCH_CAPS_GDS_CTRL                 (_AC(1, ULL) << 25)
 #define  ARCH_CAPS_GDS_NO                   (_AC(1, ULL) << 26)
+#define  ARCH_CAPS_RFDS_NO                  (_AC(1, ULL) << 27)
+#define  ARCH_CAPS_RFDS_CLEAR               (_AC(1, ULL) << 28)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 2906eaa6c290..7a9d8d05d3fb 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -279,6 +279,7 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffe
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
+XEN_CPUFEATURE(HYBRID,        9*32+15) /*   Heterogeneous platform */
 XEN_CPUFEATURE(TSXLDTRK,      9*32+16) /*a  TSX load tracking suspend/resume insns */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
@@ -338,6 +339,8 @@ XEN_CPUFEATURE(OVRCLK_STATUS,      16*32+23) /*   MSR_OVERCLOCKING_STATUS */
 XEN_CPUFEATURE(PBRSB_NO,           16*32+24) /*A  No Post-Barrier RSB predictions */
 XEN_CPUFEATURE(GDS_CTRL,           16*32+25) /*   MCU_OPT_CTRL.GDS_MIT_{DIS,LOCK} */
 XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
+XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
+XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
 /* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 287dac101ad4..31fa63b77fd1 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -51,6 +51,23 @@ void __dummy__(void)
     OFFSET(UREGS_kernel_sizeof, struct cpu_user_regs, es);
     BLANK();
 
+    /*
+     * EFRAME_* is for the entry/exit logic where %rsp is pointing at
+     * UREGS_error_code and GPRs are still/already guest values.
+     */
+#define OFFSET_EF(sym, mem)                                             \
+    DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
+                offsetof(struct cpu_user_regs, error_code))
+
+    OFFSET_EF(EFRAME_entry_vector,    entry_vector);
+    OFFSET_EF(EFRAME_rip,             rip);
+    OFFSET_EF(EFRAME_cs,              cs);
+    OFFSET_EF(EFRAME_eflags,          eflags);
+    OFFSET_EF(EFRAME_rsp,             rsp);
+    BLANK();
+
+#undef OFFSET_EF
+
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
     OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 253bb1688c4f..7c211314d885 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -15,7 +15,7 @@ ENTRY(entry_int82)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $HYPERCALL_VECTOR, 4(%rsp)
+        movl  $HYPERCALL_VECTOR, EFRAME_entry_vector(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 585b0c955191..412cbeb3eca4 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -190,15 +190,15 @@ restore_all_guest:
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL
-        testw $TRAP_syscall,4(%rsp)
+        testw $TRAP_syscall, EFRAME_entry_vector(%rsp)
         jz    iret_exit_to_guest
 
-        movq  24(%rsp),%r11           # RFLAGS
+        mov   EFRAME_eflags(%rsp), %r11
         andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
         orq   $X86_EFLAGS_IF,%r11
 
         /* Don't use SYSRET path if the return address is not canonical. */
-        movq  8(%rsp),%rcx
+        mov   EFRAME_rip(%rsp), %rcx
         sarq  $47,%rcx
         incl  %ecx
         cmpl  $1,%ecx
@@ -213,20 +213,20 @@ restore_all_guest:
         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
 #endif
 
-        movq  8(%rsp), %rcx           # RIP
-        cmpw  $FLAT_USER_CS32,16(%rsp)# CS
-        movq  32(%rsp),%rsp           # RSP
+        mov   EFRAME_rip(%rsp), %rcx
+        cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
+        mov   EFRAME_rsp(%rsp), %rsp
         je    1f
         sysretq
 1:      sysretl
 
         ALIGN
 .Lrestore_rcx_iret_exit_to_guest:
-        movq  8(%rsp), %rcx           # RIP
+        mov   EFRAME_rip(%rsp), %rcx
 /* No special register assumptions. */
 iret_exit_to_guest:
-        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
-        orl   $X86_EFLAGS_IF,24(%rsp)
+        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
+        orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -257,7 +257,7 @@ ENTRY(lstar_enter)
         pushq $FLAT_KERNEL_CS64
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -294,7 +294,7 @@ ENTRY(cstar_enter)
         pushq $FLAT_USER_CS32
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -335,7 +335,7 @@ GLOBAL(sysenter_eflags_saved)
         pushq $3 /* ring 3 null cs */
         pushq $0 /* null rip */
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -389,7 +389,7 @@ ENTRY(int80_direct_trap)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $0x80, 4(%rsp)
+        movl  $0x80, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -649,7 +649,7 @@ ret_from_intr:
         .section .init.text, "ax", @progbits
 ENTRY(early_page_fault)
         ENDBR64
-        movl  $TRAP_page_fault, 4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
         SAVE_ALL
         movq  %rsp, %rdi
         call  do_early_page_fault
@@ -716,7 +716,7 @@ ENTRY(common_interrupt)
 
 ENTRY(page_fault)
         ENDBR64
-        movl  $TRAP_page_fault,4(%rsp)
+        movl  $TRAP_page_fault, EFRAME_entry_vector(%rsp)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
@@ -892,90 +892,90 @@ FATAL_exception_with_ints_disabled:
 ENTRY(divide_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_divide_error,4(%rsp)
+        movl  $TRAP_divide_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_copro_error,4(%rsp)
+        movl  $TRAP_copro_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(simd_coprocessor_error)
         ENDBR64
         pushq $0
-        movl  $TRAP_simd_error,4(%rsp)
+        movl  $TRAP_simd_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(device_not_available)
         ENDBR64
         pushq $0
-        movl  $TRAP_no_device,4(%rsp)
+        movl  $TRAP_no_device, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(debug)
         ENDBR64
         pushq $0
-        movl  $TRAP_debug,4(%rsp)
+        movl  $TRAP_debug, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 ENTRY(int3)
         ENDBR64
         pushq $0
-        movl  $TRAP_int3,4(%rsp)
+        movl  $TRAP_int3, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(overflow)
         ENDBR64
         pushq $0
-        movl  $TRAP_overflow,4(%rsp)
+        movl  $TRAP_overflow, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(bounds)
         ENDBR64
         pushq $0
-        movl  $TRAP_bounds,4(%rsp)
+        movl  $TRAP_bounds, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_op)
         ENDBR64
         pushq $0
-        movl  $TRAP_invalid_op,4(%rsp)
+        movl  $TRAP_invalid_op, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(invalid_TSS)
         ENDBR64
-        movl  $TRAP_invalid_tss,4(%rsp)
+        movl  $TRAP_invalid_tss, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(segment_not_present)
         ENDBR64
-        movl  $TRAP_no_segment,4(%rsp)
+        movl  $TRAP_no_segment, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(stack_segment)
         ENDBR64
-        movl  $TRAP_stack_error,4(%rsp)
+        movl  $TRAP_stack_error, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(general_protection)
         ENDBR64
-        movl  $TRAP_gp_fault,4(%rsp)
+        movl  $TRAP_gp_fault, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(alignment_check)
         ENDBR64
-        movl  $TRAP_alignment_check,4(%rsp)
+        movl  $TRAP_alignment_check, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_CP)
         ENDBR64
-        movl  $X86_EXC_CP, 4(%rsp)
+        movl  $X86_EXC_CP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(double_fault)
         ENDBR64
-        movl  $TRAP_double_fault,4(%rsp)
+        movl  $TRAP_double_fault, EFRAME_entry_vector(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         ALTERNATIVE "", stac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -1001,7 +1001,7 @@ ENTRY(double_fault)
 ENTRY(nmi)
         ENDBR64
         pushq $0
-        movl  $TRAP_nmi,4(%rsp)
+        movl  $TRAP_nmi, EFRAME_entry_vector(%rsp)
 handle_ist_exception:
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -1134,7 +1134,7 @@ handle_ist_exception:
 ENTRY(machine_check)
         ENDBR64
         pushq $0
-        movl  $TRAP_machine_check,4(%rsp)
+        movl  $TRAP_machine_check, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 /* No op trap handler.  Required for kexec crash path. */
@@ -1171,7 +1171,7 @@ autogen_stubs: /* Automatically generated stubs. */
 1:
         ENDBR64
         pushq $0
-        movb  $vec,4(%rsp)
+        movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   common_interrupt
 
         entrypoint 1b
@@ -1185,7 +1185,7 @@ autogen_stubs: /* Automatically generated stubs. */
         test  $8,%spl        /* 64bit exception frames are 16 byte aligned, but the word */
         jz    2f             /* size is 8 bytes.  Check whether the processor gave us an */
         pushq $0             /* error code, and insert an empty one if not.              */
-2:      movb  $vec,4(%rsp)
+2:      movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
         entrypoint 1b
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86: Resync intel-family.h from Linux

From v6.8-rc6

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 195e75371b13c4f7ecdf7b5c50aed0d02f2d7ce8)

diff --git a/xen/arch/x86/include/asm/intel-family.h b/xen/arch/x86/include/asm/intel-family.h
index ffc49151befe..b65e9c46b922 100644
--- a/xen/arch/x86/include/asm/intel-family.h
+++ b/xen/arch/x86/include/asm/intel-family.h
@@ -26,6 +26,9 @@
  *		_G	- parts with extra graphics on
  *		_X	- regular server parts
  *		_D	- micro server parts
+ *		_N,_P	- other mobile parts
+ *		_H	- premium mobile parts
+ *		_S	- other client parts
  *
  *		Historical OPTDIFFs:
  *
@@ -37,6 +40,9 @@
  * their own names :-(
  */
 
+/* Wildcard match for FAM6 so X86_MATCH_INTEL_FAM6_MODEL(ANY) works */
+#define INTEL_FAM6_ANY			X86_MODEL_ANY
+
 #define INTEL_FAM6_CORE_YONAH		0x0E
 
 #define INTEL_FAM6_CORE2_MEROM		0x0F
@@ -93,8 +99,6 @@
 #define INTEL_FAM6_ICELAKE_L		0x7E	/* Sunny Cove */
 #define INTEL_FAM6_ICELAKE_NNPI		0x9D	/* Sunny Cove */
 
-#define INTEL_FAM6_LAKEFIELD		0x8A	/* Sunny Cove / Tremont */
-
 #define INTEL_FAM6_ROCKETLAKE		0xA7	/* Cypress Cove */
 
 #define INTEL_FAM6_TIGERLAKE_L		0x8C	/* Willow Cove */
@@ -102,12 +106,31 @@
 
 #define INTEL_FAM6_SAPPHIRERAPIDS_X	0x8F	/* Golden Cove */
 
+#define INTEL_FAM6_EMERALDRAPIDS_X	0xCF
+
+#define INTEL_FAM6_GRANITERAPIDS_X	0xAD
+#define INTEL_FAM6_GRANITERAPIDS_D	0xAE
+
+/* "Hybrid" Processors (P-Core/E-Core) */
+
+#define INTEL_FAM6_LAKEFIELD		0x8A	/* Sunny Cove / Tremont */
+
 #define INTEL_FAM6_ALDERLAKE		0x97	/* Golden Cove / Gracemont */
 #define INTEL_FAM6_ALDERLAKE_L		0x9A	/* Golden Cove / Gracemont */
 
-#define INTEL_FAM6_RAPTORLAKE		0xB7
+#define INTEL_FAM6_RAPTORLAKE		0xB7	/* Raptor Cove / Enhanced Gracemont */
+#define INTEL_FAM6_RAPTORLAKE_P		0xBA
+#define INTEL_FAM6_RAPTORLAKE_S		0xBF
+
+#define INTEL_FAM6_METEORLAKE		0xAC
+#define INTEL_FAM6_METEORLAKE_L		0xAA
+
+#define INTEL_FAM6_ARROWLAKE_H		0xC5
+#define INTEL_FAM6_ARROWLAKE		0xC6
+
+#define INTEL_FAM6_LUNARLAKE_M		0xBD
 
-/* "Small Core" Processors (Atom) */
+/* "Small Core" Processors (Atom/E-Core) */
 
 #define INTEL_FAM6_ATOM_BONNELL		0x1C /* Diamondville, Pineview */
 #define INTEL_FAM6_ATOM_BONNELL_MID	0x26 /* Silverthorne, Lincroft */
@@ -134,6 +157,13 @@
 #define INTEL_FAM6_ATOM_TREMONT		0x96 /* Elkhart Lake */
 #define INTEL_FAM6_ATOM_TREMONT_L	0x9C /* Jasper Lake */
 
+#define INTEL_FAM6_ATOM_GRACEMONT	0xBE /* Alderlake N */
+
+#define INTEL_FAM6_ATOM_CRESTMONT_X	0xAF /* Sierra Forest */
+#define INTEL_FAM6_ATOM_CRESTMONT	0xB6 /* Grand Ridge */
+
+#define INTEL_FAM6_ATOM_DARKMONT_X	0xDD /* Clearwater Forest */
+
 /* Xeon Phi */
 
 #define INTEL_FAM6_XEON_PHI_KNL		0x57 /* Knights Landing */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 5f5de45a1309..cdde76e13892 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,17 +87,39 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        DO_SPEC_CTRL_COND_VERW
+        /*
+         * All speculation safety work happens to be elsewhere.  VERW is after
+         * popping the GPRs, while restoring the guest MSR_SPEC_CTRL is left
+         * to the MSR load list.
+         */
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
+        mov  %rax, %cr2
+
+        /*
+         * We need to perform two conditional actions (VERW, and Resume vs
+         * Launch) after popping GPRs.  With some cunning, we can encode both
+         * of these in eflags together.
+         *
+         * Parity is only calculated over the bottom byte of the answer, while
+         * Sign is simply the top bit.
+         *
+         * Therefore, the final OR instruction ends up producing:
+         *   SF = VCPU_vmx_launched
+         *   PF = !SCF_verw
+         */
+        BUILD_BUG_ON(SCF_verw & ~0xff)
+        movzbl VCPU_vmx_launched(%rbx), %ecx
+        shl  $31, %ecx
+        movzbl CPUINFO_spec_ctrl_flags(%rsp), %eax
+        and  $SCF_verw, %eax
+        or   %eax, %ecx
 
         pop  %r15
         pop  %r14
         pop  %r13
         pop  %r12
         pop  %rbp
-        mov  %rax,%cr2
-        cmpb $0,VCPU_vmx_launched(%rbx)
         pop  %rbx
         pop  %r11
         pop  %r10
@@ -108,7 +130,13 @@ UNLIKELY_END(realmode)
         pop  %rdx
         pop  %rsi
         pop  %rdi
-        je   .Lvmx_launch
+
+        jpe  .L_skip_verw
+        /* VERW clobbers ZF, but preserves all others, including SF. */
+        verw STK_REL(CPUINFO_verw_sel, CPUINFO_error_code)(%rsp)
+.L_skip_verw:
+
+        jns  .Lvmx_launch
 
 /*.Lvmx_resume:*/
         VMRESUME
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index d9431180cfba..abc6822b08c8 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -81,6 +81,14 @@ register unsigned long current_stack_pointer asm("rsp");
 
 #ifdef __ASSEMBLY__
 
+.macro BUILD_BUG_ON condstr, cond:vararg
+        .if \cond
+        .error "Condition \"\condstr\" not satisfied"
+        .endif
+.endm
+/* preprocessor macro to make error message more user friendly */
+#define BUILD_BUG_ON(cond) BUILD_BUG_ON #cond, cond
+
 #ifdef HAVE_AS_QUOTED_SYM
 #define SUBSECTION_LBL(tag)                        \
         .ifndef .L.tag;                            \
diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index f4b8b9d9561c..ca9cb0f5dd1d 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -164,6 +164,13 @@
 #endif
 .endm
 
+/*
+ * Helper to improve the readibility of stack dispacements with %rsp in
+ * unusual positions.  Both @field and @top_of_stack should be constants from
+ * the same object.  @top_of_stack should be where %rsp is currently pointing.
+ */
+#define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
+
 .macro DO_SPEC_CTRL_COND_VERW
 /*
  * Requires %rsp=cpuinfo
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 31fa63b77fd1..a4e94d693024 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -135,6 +135,7 @@ void __dummy__(void)
 #endif
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+    OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index ca9cb0f5dd1d..97a97b2b82c9 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -171,16 +171,23 @@
  */
 #define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
 
-.macro DO_SPEC_CTRL_COND_VERW
+.macro SPEC_CTRL_COND_VERW \
+    scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_error_code), \
+    sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_error_code)
 /*
- * Requires %rsp=cpuinfo
+ * Requires \scf and \sel as %rsp-relative expressions
+ * Clobbers eflags
+ *
+ * VERW needs to run after guest GPRs have been restored, where only %rsp is
+ * good to use.  Default to expecting %rsp pointing at CPUINFO_error_code.
+ * Contexts where this is not true must provide an alternative \scf and \sel.
  *
  * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
  * v1 gadget, but the IRET/VMEntry is serialising.
  */
-    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    testb $SCF_verw, \scf(%rsp)
     jz .L\@_verw_skip
-    verw CPUINFO_verw_sel(%rsp)
+    verw \sel(%rsp)
 .L\@_verw_skip:
 .endm
 
@@ -298,8 +305,6 @@
  */
     ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV
 
-    DO_SPEC_CTRL_COND_VERW
-
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 .endm
 
@@ -379,7 +384,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  */
 .macro SPEC_CTRL_EXIT_TO_XEN
 /*
- * Requires %r12=ist_exit, %r14=stack_end
+ * Requires %r12=ist_exit, %r14=stack_end, %rsp=regs
  * Clobbers %rax, %rbx, %rcx, %rdx
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
@@ -407,11 +412,18 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
     test %r12, %r12
     jz .L\@_skip_ist_exit
 
-    /* Logically DO_SPEC_CTRL_COND_VERW but without the %rsp=cpuinfo dependency */
-    testb $SCF_verw, %bl
-    jz .L\@_skip_verw
-    verw STACK_CPUINFO_FIELD(verw_sel)(%r14)
-.L\@_skip_verw:
+    /*
+     * Stash SCF and verw_sel above eflags in the case of an IST_exit.  The
+     * VERW logic needs to run after guest GPRs have been restored; i.e. where
+     * we cannot use %r12 or %r14 for the purposes they have here.
+     *
+     * When the CPU pushed this exception frame, it zero-extended eflags.
+     * Therefore it is safe for the VERW logic to look at the stashed SCF
+     * outside of the ist_exit condition.  Also, this stashing won't influence
+     * any other restore_all_guest() paths.
+     */
+    or $(__HYPERVISOR_DS32 << 16), %ebx
+    mov %ebx, UREGS_eflags + 4(%rsp) /* EFRAME_shadow_scf/sel */
 
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index a4e94d693024..4cd5938d7b9d 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -55,14 +55,22 @@ void __dummy__(void)
      * EFRAME_* is for the entry/exit logic where %rsp is pointing at
      * UREGS_error_code and GPRs are still/already guest values.
      */
-#define OFFSET_EF(sym, mem)                                             \
+#define OFFSET_EF(sym, mem, ...)                                        \
     DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
-                offsetof(struct cpu_user_regs, error_code))
+                offsetof(struct cpu_user_regs, error_code) __VA_ARGS__)
 
     OFFSET_EF(EFRAME_entry_vector,    entry_vector);
     OFFSET_EF(EFRAME_rip,             rip);
     OFFSET_EF(EFRAME_cs,              cs);
     OFFSET_EF(EFRAME_eflags,          eflags);
+
+    /*
+     * These aren't real fields.  They're spare space, used by the IST
+     * exit-to-xen path.
+     */
+    OFFSET_EF(EFRAME_shadow_scf,      eflags, +4);
+    OFFSET_EF(EFRAME_shadow_sel,      eflags, +6);
+
     OFFSET_EF(EFRAME_rsp,             rsp);
     BLANK();
 
@@ -136,6 +144,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
     OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
+    OFFSET(CPUINFO_rip, struct cpu_info, guest_cpu_user_regs.rip);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 7c211314d885..3b2fbcd8733a 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -161,6 +161,12 @@ ENTRY(compat_restore_all_guest)
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL adj=8 compat=1
+
+        /* Account for ev/ec having already been popped off the stack. */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_rip), \
+            sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_rip)
+
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 412cbeb3eca4..ef517e2945b0 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -214,6 +214,9 @@ restore_all_guest:
 #endif
 
         mov   EFRAME_rip(%rsp), %rcx
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
         mov   EFRAME_rsp(%rsp), %rsp
         je    1f
@@ -227,6 +230,9 @@ restore_all_guest:
 iret_exit_to_guest:
         andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
         orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -679,9 +685,22 @@ UNLIKELY_START(ne, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN     /* Req: %r12=ist_exit %r14=end, Clob: abcd */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %r12=ist_exit %r14=end %rsp=regs, Clob: abcd */
 
         RESTORE_ALL adj=8
+
+        /*
+         * When the CPU pushed this exception frame, it zero-extended eflags.
+         * For an IST exit, SPEC_CTRL_EXIT_TO_XEN stashed shadow copies of
+         * spec_ctrl_flags and ver_sel above eflags, as we can't use any GPRs,
+         * and we're at a random place on the stack, not in a CPUFINFO block.
+         *
+         * Account for ev/ec having already been popped off the stack.
+         */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(EFRAME_shadow_scf, EFRAME_rip), \
+            sel=STK_REL(EFRAME_shadow_sel, EFRAME_rip)
+
         iretq
 
 ENTRY(common_interrupt)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 2006697226de..d909ec94fe7c 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2324,7 +2324,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio,gds-mit,div-scrub}=<bool> ]`
@@ -2349,7 +2349,7 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=` and `ibpb-entry=` options
 offer fine grained control over the primitives by Xen.  These impact Xen's
 ability to protect itself, and/or Xen's ability to virtualise support for
 guests to use.
@@ -2366,11 +2366,12 @@ guests to use.
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
 * `rsb=` offers control over whether to overwrite the Return Stack Buffer /
   Return Address Stack on entry to Xen and on idle.
-* `md-clear=` offers control over whether to use VERW to flush
-  microarchitectural buffers on idle and exit from Xen.  *Note: For
-  compatibility with development versions of this fix, `mds=` is also accepted
-  on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
-  preference to here.*
+* `verw=` offers control over whether to use VERW for its scrubbing side
+  effects at appropriate privilege transitions.  The exact side effects are
+  microarchitecture and microcode specific.  *Note: `md-clear=` is accepted as
+  a deprecated alias.  For compatibility with development versions of XSA-297,
+  `mds=` is also accepted on Xen 4.12 and earlier as an alias.  Consult vendor
+  documentation in preference to here.*
 * `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
   Barrier) is used on entry to Xen.  This is used by default on hardware
   vulnerable to Branch Type Confusion, and hardware vulnerable to Speculative
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 25a18ac598fa..e12ec9930cf7 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -37,8 +37,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __ro_after_init opt_md_clear_pv = -1;
-static int8_t __ro_after_init opt_md_clear_hvm = -1;
+static int8_t __ro_after_init opt_verw_pv = -1;
+static int8_t __ro_after_init opt_verw_hvm = -1;
 
 static int8_t __ro_after_init opt_ibpb_entry_pv = -1;
 static int8_t __ro_after_init opt_ibpb_entry_hvm = -1;
@@ -78,7 +78,7 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 static bool __initdata opt_unpriv_mmio;
-static bool __ro_after_init opt_fb_clear_mmio;
+static bool __ro_after_init opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
@@ -120,8 +120,8 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         disable_common:
             opt_rsb_pv = false;
             opt_rsb_hvm = false;
-            opt_md_clear_pv = 0;
-            opt_md_clear_hvm = 0;
+            opt_verw_pv = 0;
+            opt_verw_hvm = 0;
             opt_ibpb_entry_pv = 0;
             opt_ibpb_entry_hvm = 0;
             opt_ibpb_entry_dom0 = false;
@@ -152,14 +152,14 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         {
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
-            opt_md_clear_pv = val;
+            opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
-            opt_md_clear_hvm = val;
+            opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
@@ -204,21 +204,22 @@ static int __init cf_check parse_spec_ctrl(const char *s)
                 break;
             }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
+        else if ( (val = parse_boolean("verw", s, ss)) != -1 ||
+                  (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
             switch ( val )
             {
             case 0:
             case 1:
-                opt_md_clear_pv = opt_md_clear_hvm = val;
+                opt_verw_pv = opt_verw_hvm = val;
                 break;
 
             case -2:
-                s += strlen("md-clear=");
+                s += (*s == 'v') ? strlen("verw=") : strlen("md-clear=");
                 if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                    opt_md_clear_pv = val;
+                    opt_verw_pv = val;
                 else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                    opt_md_clear_hvm = val;
+                    opt_verw_hvm = val;
                 else
             default:
                     rc = -EINVAL;
@@ -540,8 +541,8 @@ static void __init print_details(enum ind_thunk thunk)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm ||
-           opt_fb_clear_mmio                         ? " VERW"  : "",
+           opt_verw_pv || opt_verw_hvm ||
+           opt_verw_mmio                             ? " VERW"  : "",
            opt_div_scrub                             ? " DIV" : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
@@ -562,13 +563,13 @@ static void __init print_details(enum ind_thunk thunk)
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
             amd_virt_spec_ctrl ||
-            opt_eager_fpu || opt_md_clear_hvm)       ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             amd_virt_spec_ctrl)                      ? " MSR_VIRT_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_hvm                          ? " MD_CLEAR"      : "",
+           opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
@@ -577,11 +578,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_pv                           ? " MD_CLEAR"      : "",
+           opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
@@ -1514,8 +1515,8 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+    bool verw = ((pv ? opt_verw_pv : opt_verw_hvm) ||
+                 (opt_verw_mmio && is_iommu_enabled(d)));
 
     bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
                  (d->domain_id != 0 || opt_ibpb_entry_dom0));
@@ -1878,19 +1879,20 @@ void __init init_speculation_mitigations(void)
      * the return-to-guest path.
      */
     if ( opt_unpriv_mmio )
-        opt_fb_clear_mmio = cpu_has_fb_clear;
+        opt_verw_mmio = cpu_has_fb_clear;
 
     /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
      */
-    if ( opt_md_clear_pv == -1 )
-        opt_md_clear_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                           boot_cpu_has(X86_FEATURE_MD_CLEAR));
-    if ( opt_md_clear_hvm == -1 )
-        opt_md_clear_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                            boot_cpu_has(X86_FEATURE_MD_CLEAR));
+    if ( opt_verw_pv == -1 )
+        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                       cpu_has_md_clear);
+
+    if ( opt_verw_hvm == -1 )
+        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                        cpu_has_md_clear);
 
     /*
      * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
@@ -1903,12 +1905,12 @@ void __init init_speculation_mitigations(void)
      * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
-     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * opt_verw_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_md_clear_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index e12ec9930cf7..adb6bc74e8e6 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1531,7 +1531,7 @@ void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
     bool has_spec_ctrl, ibrs = false, hw_smt_enabled;
-    bool cpu_has_bug_taa, retpoline_safe;
+    bool cpu_has_bug_taa, cpu_has_useful_md_clear, retpoline_safe;
 
     hw_smt_enabled = check_smt_enabled();
 
@@ -1867,50 +1867,97 @@ void __init init_speculation_mitigations(void)
             "enabled.  Please assess your configuration and choose an\n"
             "explicit 'smt=<bool>' setting.  See XSA-273.\n");
 
+    /*
+     * A brief summary of VERW-related changes.
+     *
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     *
+     * Relevant ucodes:
+     *
+     * - May 2019, for MDS.  Introduces the MD_CLEAR CPUID bit and VERW side
+     *   effects to scrub Store/Load/Fill buffers as applicable.  MD_CLEAR
+     *   exists architecturally, even when the side effects have been removed.
+     *
+     *   Use VERW to scrub on return-to-guest.  Parts with L1D_FLUSH to
+     *   mitigate L1TF have the same side effect, so no need to do both.
+     *
+     *   Various Atoms suffer from Store-buffer sampling only.  Store buffers
+     *   are statically partitioned between non-idle threads, so scrubbing is
+     *   wanted when going idle too.
+     *
+     *   Load ports and Fill buffers are competitively shared between threads.
+     *   SMT must be disabled for VERW scrubbing to be fully effective.
+     *
+     * - November 2019, for TAA.  Extended VERW side effects to TSX-enabled
+     *   MDS_NO parts.
+     *
+     * - February 2022, for Client TSX de-feature.  Removed VERW side effects
+     *   from Client CPUs only.
+     *
+     * - May 2022, for MMIO Stale Data.  (Re)introduced Fill Buffer scrubbing
+     *   on all MMIO-affected parts which didn't already have it for MDS
+     *   reasons, enumerating FB_CLEAR on those parts only.
+     *
+     *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
+     *   side effects as VERW and cannot be used in its place.
+     */
     mds_calculations();
 
     /*
-     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
-     * reintroduced the VERW fill buffer flushing side effect because of a
-     * susceptibility to FBSDP.
+     * Parts which enumerate FB_CLEAR are those with now-updated microcode
+     * which weren't susceptible to the original MFBDS (and therefore didn't
+     * have Fill Buffer scrubbing side effects to begin with, or were Client
+     * MDS_NO non-TAA_NO parts where the scrubbing was removed), but have had
+     * the scrubbing reintroduced because of a susceptibility to FBSDP.
      *
      * If unprivileged guests have (or will have) MMIO mappings, we can
      * mitigate cross-domain leakage of fill buffer data by issuing VERW on
-     * the return-to-guest path.
+     * the return-to-guest path.  This is only a token effort if SMT is
+     * active.
      */
     if ( opt_unpriv_mmio )
         opt_verw_mmio = cpu_has_fb_clear;
 
     /*
-     * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
-     * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
-     * but it is somewhat better than nothing.
+     * MD_CLEAR is enumerated architecturally forevermore, even after the
+     * scrubbing side effects have been removed.  Create ourselves an version
+     * which expressed whether we think MD_CLEAR is having any useful side
+     * effect.
+     */
+    cpu_has_useful_md_clear = (cpu_has_md_clear &&
+                               (cpu_has_bug_mds || cpu_has_bug_msbds_only));
+
+    /*
+     * By default, use VERW scrubbing on applicable hardware, if we think it's
+     * going to have an effect.  This will only be a token effort for
+     * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                       cpu_has_md_clear);
+        opt_verw_pv = cpu_has_useful_md_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                        cpu_has_md_clear);
+        opt_verw_hvm = cpu_has_useful_md_clear;
 
     /*
-     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
-     * either the PV or HVM MDS defences are used, or if we may give MMIO
-     * access to untrusted guests.
-     *
-     * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
-     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
-     *
-     * After calculating the appropriate idle setting, simplify
-     * opt_verw_hvm to mean just "should we VERW on the way into HVM
-     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+     * If SMT is active, and we're protecting against MDS or MMIO stale data,
+     * we need to scrub before going idle as well as on return to guest.
+     * Various pipeline resources are repartitioned amongst non-idle threads.
      */
-    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
+    if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+          opt_verw_mmio) && hw_smt_enabled )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+
+    /*
+     * After calculating the appropriate idle setting, simplify opt_verw_hvm
+     * to mean just "should we VERW on the way into HVM guests", so
+     * spec_ctrl_init_domain() can calculate suitable settings.
+     *
+     * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
+     * only *_CLEAR we can see.
+     */
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+        opt_verw_hvm = false;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index aefc140d6651..5ceea8be073b 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -172,7 +172,7 @@ static const char *const str_7d0[32] =
     [ 8] = "avx512-vp2intersect", [ 9] = "srbds-ctrl",
     [10] = "md-clear",            [11] = "rtm-always-abort",
     /* 12 */                [13] = "tsx-force-abort",
-    [14] = "serialize",
+    [14] = "serialize",     [15] = "hybrid",
     [16] = "tsxldtrk",
     [18] = "pconfig",
     [20] = "cet-ibt",
@@ -237,7 +237,8 @@ static const char *const str_m10Al[32] =
     [20] = "bhi-no",              [21] = "xapic-status",
     /* 22 */                      [23] = "ovrclk-status",
     [24] = "pbrsb-no",            [25] = "gds-ctrl",
-    [26] = "gds-no",
+    [26] = "gds-no",              [27] = "rfds-no",
+    [28] = "rfds-clear",
 };
 
 static const char *const str_m10Ah[32] =
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 7b875a722142..96c2cee1a857 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -444,6 +444,7 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
          */
         __set_bit(X86_FEATURE_MD_CLEAR, fs);
         __set_bit(X86_FEATURE_FB_CLEAR, fs);
+        __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
 
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -493,6 +494,10 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
         if ( cpu_has_fb_clear )
             __set_bit(X86_FEATURE_FB_CLEAR, fs);
 
+        __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
+        if ( cpu_has_rfds_clear )
+            __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index ec824e895498..a6b8af12964c 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -140,6 +140,7 @@
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_hybrid          boot_cpu_has(X86_FEATURE_HYBRID)
 #define cpu_has_avx512_fp16     boot_cpu_has(X86_FEATURE_AVX512_FP16)
 #define cpu_has_arch_caps       boot_cpu_has(X86_FEATURE_ARCH_CAPS)
 
@@ -161,6 +162,8 @@
 #define cpu_has_rrsba           boot_cpu_has(X86_FEATURE_RRSBA)
 #define cpu_has_gds_ctrl        boot_cpu_has(X86_FEATURE_GDS_CTRL)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
+#define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
+#define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 6abf7bc34a4f..9b5f67711f0c 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -88,6 +88,8 @@
 #define  ARCH_CAPS_PBRSB_NO                 (_AC(1, ULL) << 24)
 #define  ARCH_CAPS_GDS_CTRL                 (_AC(1, ULL) << 25)
 #define  ARCH_CAPS_GDS_NO                   (_AC(1, ULL) << 26)
+#define  ARCH_CAPS_RFDS_NO                  (_AC(1, ULL) << 27)
+#define  ARCH_CAPS_RFDS_CLEAR               (_AC(1, ULL) << 28)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index adb6bc74e8e6..1ee81e2dfe79 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -24,6 +24,7 @@
 
 #include <asm/amd.h>
 #include <asm/hvm/svm/svm.h>
+#include <asm/intel-family.h>
 #include <asm/microcode.h>
 #include <asm/msr.h>
 #include <asm/pv/domain.h>
@@ -447,7 +448,7 @@ static void __init print_details(enum ind_thunk thunk)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_EIBRS)                          ? " EIBRS"          : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -463,6 +464,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
            (caps & ARCH_CAPS_PBRSB_NO)                       ? " PBRSB_NO"       : "",
            (caps & ARCH_CAPS_GDS_NO)                         ? " GDS_NO"         : "",
+           (caps & ARCH_CAPS_RFDS_NO)                        ? " RFDS_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
@@ -473,7 +475,7 @@ static void __init print_details(enum ind_thunk thunk)
            (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO))        ? " SRSO_NO"        : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -491,6 +493,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
            (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "",
            (caps & ARCH_CAPS_GDS_CTRL)                       ? " GDS_CTRL"       : "",
+           (caps & ARCH_CAPS_RFDS_CLEAR)                     ? " RFDS_CLEAR"     : "",
            (e21a & cpufeat_mask(X86_FEATURE_SBPB))           ? " SBPB"           : "");
 
     /* Compiled-in support which pertains to mitigations. */
@@ -1359,6 +1362,83 @@ static __init void mds_calculations(void)
     }
 }
 
+/*
+ * Register File Data Sampling affects Atom cores from the Goldmont to
+ * Gracemont microarchitectures.  The March 2024 microcode adds RFDS_NO to
+ * some but not all unaffected parts, and RFDS_CLEAR to affected parts still
+ * in support.
+ *
+ * Alder Lake and Raptor Lake client CPUs have a mix of P cores
+ * (Golden/Raptor Cove, not vulnerable) and E cores (Gracemont,
+ * vulnerable), and both enumerate RFDS_CLEAR.
+ *
+ * Both exist in a Xeon SKU, which has the E cores (Gracemont) disabled by
+ * platform configuration, and enumerate RFDS_NO.
+ *
+ * With older parts, or with out-of-date microcode, synthesise RFDS_NO when
+ * safe to do so.
+ *
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ */
+static void __init rfds_calculations(void)
+{
+    /* RFDS is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return;
+
+    /*
+     * If RFDS_NO or RFDS_CLEAR are visible, we've either got suitable
+     * microcode, or an RFDS-aware hypervisor is levelling us in a pool.
+     */
+    if ( cpu_has_rfds_no || cpu_has_rfds_clear )
+        return;
+
+    /* If we're virtualised, don't attempt to synthesise RFDS_NO. */
+    if ( cpu_has_hypervisor )
+        return;
+
+    /*
+     * Not all CPUs are expected to get a microcode update enumerating one of
+     * RFDS_{NO,CLEAR}, or we might have out-of-date microcode.
+     */
+    switch ( boot_cpu_data.x86_model )
+    {
+    case INTEL_FAM6_ALDERLAKE:
+    case INTEL_FAM6_RAPTORLAKE:
+        /*
+         * Alder Lake and Raptor Lake might be a client SKU (with the
+         * Gracemont cores active, and therefore vulnerable) or might be a
+         * server SKU (with the Gracemont cores disabled, and therefore not
+         * vulnerable).
+         *
+         * See if the CPU identifies as hybrid to distinguish the two cases.
+         */
+        if ( !cpu_has_hybrid )
+            break;
+        fallthrough;
+    case INTEL_FAM6_ALDERLAKE_L:
+    case INTEL_FAM6_RAPTORLAKE_P:
+    case INTEL_FAM6_RAPTORLAKE_S:
+
+    case INTEL_FAM6_ATOM_GOLDMONT:      /* Apollo Lake */
+    case INTEL_FAM6_ATOM_GOLDMONT_D:    /* Denverton */
+    case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* Gemini Lake */
+    case INTEL_FAM6_ATOM_TREMONT_D:     /* Snow Ridge / Parker Ridge */
+    case INTEL_FAM6_ATOM_TREMONT:       /* Elkhart Lake */
+    case INTEL_FAM6_ATOM_TREMONT_L:     /* Jasper Lake */
+    case INTEL_FAM6_ATOM_GRACEMONT:     /* Alder Lake N */
+        return;
+    }
+
+    /*
+     * We appear to be on an unaffected CPU which didn't enumerate RFDS_NO,
+     * perhaps because of it's age or because of out-of-date microcode.
+     * Synthesise it.
+     */
+    setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
+}
+
 static bool __init cpu_has_gds(void)
 {
     /*
@@ -1872,6 +1952,7 @@ void __init init_speculation_mitigations(void)
      *
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
      *
      * Relevant ucodes:
      *
@@ -1901,8 +1982,12 @@ void __init init_speculation_mitigations(void)
      *
      *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
      *   side effects as VERW and cannot be used in its place.
+     *
+     * - March 2023, for RFDS.  Enumerate RFDS_CLEAR to mean that VERW now
+     *   scrubs non-architectural entries from certain register files.
      */
     mds_calculations();
+    rfds_calculations();
 
     /*
      * Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -1934,15 +2019,19 @@ void __init init_speculation_mitigations(void)
      * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = cpu_has_useful_md_clear;
+        opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = cpu_has_useful_md_clear;
+        opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     /*
      * If SMT is active, and we're protecting against MDS or MMIO stale data,
      * we need to scrub before going idle as well as on return to guest.
      * Various pipeline resources are repartitioned amongst non-idle threads.
+     *
+     * We don't need to scrub on idle for RFDS.  There are no affected cores
+     * which support SMT, despite there being affected cores in hybrid systems
+     * which have SMT elsewhere in the platform.
      */
     if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
           opt_verw_mmio) && hw_smt_enabled )
@@ -1956,7 +2045,8 @@ void __init init_speculation_mitigations(void)
      * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
      * only *_CLEAR we can see.
      */
-    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear &&
+         !cpu_has_rfds_clear )
         opt_verw_hvm = false;
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index aec1407613c3..113e6cadc17d 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -264,6 +264,7 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffe
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*A  SERIALIZE insn */
+XEN_CPUFEATURE(HYBRID,        9*32+15) /*   Heterogeneous platform */
 XEN_CPUFEATURE(TSXLDTRK,      9*32+16) /*a  TSX load tracking suspend/resume insns */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
 XEN_CPUFEATURE(AVX512_FP16,   9*32+23) /*   AVX512 FP16 instructions */
@@ -330,6 +331,8 @@ XEN_CPUFEATURE(OVRCLK_STATUS,      16*32+23) /*   MSR_OVERCLOCKING_STATUS */
 XEN_CPUFEATURE(PBRSB_NO,           16*32+24) /*A  No Post-Barrier RSB predictions */
 XEN_CPUFEATURE(GDS_CTRL,           16*32+25) /*   MCU_OPT_CTRL.GDS_MIT_{DIS,LOCK} */
 XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
+XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
+XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
 /* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 57b73a4e6214..2fc4d9130a4d 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -51,6 +51,23 @@ void __dummy__(void)
     OFFSET(UREGS_kernel_sizeof, struct cpu_user_regs, es);
     BLANK();
 
+    /*
+     * EFRAME_* is for the entry/exit logic where %rsp is pointing at
+     * UREGS_error_code and GPRs are still/already guest values.
+     */
+#define OFFSET_EF(sym, mem)                                             \
+    DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
+                offsetof(struct cpu_user_regs, error_code))
+
+    OFFSET_EF(EFRAME_entry_vector,    entry_vector);
+    OFFSET_EF(EFRAME_rip,             rip);
+    OFFSET_EF(EFRAME_cs,              cs);
+    OFFSET_EF(EFRAME_eflags,          eflags);
+    OFFSET_EF(EFRAME_rsp,             rsp);
+    BLANK();
+
+#undef OFFSET_EF
+
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
     OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index fcc3a721f147..cb473f08eebd 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -15,7 +15,7 @@ ENTRY(entry_int82)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $HYPERCALL_VECTOR, 4(%rsp)
+        movl  $HYPERCALL_VECTOR, EFRAME_entry_vector(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 9a7b129aa7e4..968da9d727b1 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -190,15 +190,15 @@ restore_all_guest:
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL
-        testw $TRAP_syscall,4(%rsp)
+        testw $TRAP_syscall, EFRAME_entry_vector(%rsp)
         jz    iret_exit_to_guest
 
-        movq  24(%rsp),%r11           # RFLAGS
+        mov   EFRAME_eflags(%rsp), %r11
         andq  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), %r11
         orq   $X86_EFLAGS_IF,%r11
 
         /* Don't use SYSRET path if the return address is not canonical. */
-        movq  8(%rsp),%rcx
+        mov   EFRAME_rip(%rsp), %rcx
         sarq  $47,%rcx
         incl  %ecx
         cmpl  $1,%ecx
@@ -213,20 +213,20 @@ restore_all_guest:
         ALTERNATIVE "", rag_clrssbsy, X86_FEATURE_XEN_SHSTK
 #endif
 
-        movq  8(%rsp), %rcx           # RIP
-        cmpw  $FLAT_USER_CS32,16(%rsp)# CS
-        movq  32(%rsp),%rsp           # RSP
+        mov   EFRAME_rip(%rsp), %rcx
+        cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
+        mov   EFRAME_rsp(%rsp), %rsp
         je    1f
         sysretq
 1:      sysretl
 
         ALIGN
 .Lrestore_rcx_iret_exit_to_guest:
-        movq  8(%rsp), %rcx           # RIP
+        mov   EFRAME_rip(%rsp), %rcx
 /* No special register assumptions. */
 iret_exit_to_guest:
-        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), 24(%rsp)
-        orl   $X86_EFLAGS_IF,24(%rsp)
+        andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
+        orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -257,7 +257,7 @@ ENTRY(lstar_enter)
         pushq $FLAT_KERNEL_CS64
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -294,7 +294,7 @@ ENTRY(cstar_enter)
         pushq $FLAT_USER_CS32
         pushq %rcx
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -335,7 +335,7 @@ GLOBAL(sysenter_eflags_saved)
         pushq $3 /* ring 3 null cs */
         pushq $0 /* null rip */
         pushq $0
-        movl  $TRAP_syscall, 4(%rsp)
+        movl  $TRAP_syscall, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -389,7 +389,7 @@ ENTRY(int80_direct_trap)
         ENDBR64
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         pushq $0
-        movl  $0x80, 4(%rsp)
+        movl  $0x80, EFRAME_entry_vector(%rsp)
         SAVE_ALL
 
         SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, %rdx=0, Clob: acd */
@@ -649,7 +649,7 @@ ret_from_intr:
         .section .init.text, "ax", @progbits
 ENTRY(early_page_fault)
         ENDBR64
-        movl  $X86_EXC_PF, 4(%rsp)
+        movl  $X86_EXC_PF, EFRAME_entry_vector(%rsp)
         SAVE_ALL
         movq  %rsp, %rdi
         call  do_early_page_fault
@@ -716,7 +716,7 @@ ENTRY(common_interrupt)
 
 ENTRY(entry_PF)
         ENDBR64
-        movl  $X86_EXC_PF, 4(%rsp)
+        movl  $X86_EXC_PF, EFRAME_entry_vector(%rsp)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
@@ -890,90 +890,90 @@ FATAL_exception_with_ints_disabled:
 ENTRY(entry_DE)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_DE, 4(%rsp)
+        movl  $X86_EXC_DE, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_MF)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_MF, 4(%rsp)
+        movl  $X86_EXC_MF, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_XM)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_XM, 4(%rsp)
+        movl  $X86_EXC_XM, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_NM)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_NM, 4(%rsp)
+        movl  $X86_EXC_NM, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_DB)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_DB, 4(%rsp)
+        movl  $X86_EXC_DB, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 ENTRY(entry_BP)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_BP, 4(%rsp)
+        movl  $X86_EXC_BP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_OF)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_OF, 4(%rsp)
+        movl  $X86_EXC_OF, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_BR)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_BR, 4(%rsp)
+        movl  $X86_EXC_BR, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_UD)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_UD, 4(%rsp)
+        movl  $X86_EXC_UD, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_TS)
         ENDBR64
-        movl  $X86_EXC_TS, 4(%rsp)
+        movl  $X86_EXC_TS, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_NP)
         ENDBR64
-        movl  $X86_EXC_NP, 4(%rsp)
+        movl  $X86_EXC_NP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_SS)
         ENDBR64
-        movl  $X86_EXC_SS, 4(%rsp)
+        movl  $X86_EXC_SS, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_GP)
         ENDBR64
-        movl  $X86_EXC_GP, 4(%rsp)
+        movl  $X86_EXC_GP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_AC)
         ENDBR64
-        movl  $X86_EXC_AC, 4(%rsp)
+        movl  $X86_EXC_AC, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_CP)
         ENDBR64
-        movl  $X86_EXC_CP, 4(%rsp)
+        movl  $X86_EXC_CP, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
 ENTRY(entry_DF)
         ENDBR64
-        movl  $X86_EXC_DF, 4(%rsp)
+        movl  $X86_EXC_DF, EFRAME_entry_vector(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         ALTERNATIVE "", stac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -998,7 +998,7 @@ ENTRY(entry_DF)
 ENTRY(entry_NMI)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_NMI, 4(%rsp)
+        movl  $X86_EXC_NMI, EFRAME_entry_vector(%rsp)
 handle_ist_exception:
         ALTERNATIVE "", clac, X86_FEATURE_XEN_SMAP
         SAVE_ALL
@@ -1130,7 +1130,7 @@ handle_ist_exception:
 ENTRY(entry_MC)
         ENDBR64
         pushq $0
-        movl  $X86_EXC_MC, 4(%rsp)
+        movl  $X86_EXC_MC, EFRAME_entry_vector(%rsp)
         jmp   handle_ist_exception
 
 /* No op trap handler.  Required for kexec crash path. */
@@ -1167,7 +1167,7 @@ autogen_stubs: /* Automatically generated stubs. */
 1:
         ENDBR64
         pushq $0
-        movb  $vec,4(%rsp)
+        movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   common_interrupt
 
         entrypoint 1b
@@ -1181,7 +1181,7 @@ autogen_stubs: /* Automatically generated stubs. */
         test  $8,%spl        /* 64bit exception frames are 16 byte aligned, but the word */
         jz    2f             /* size is 8 bytes.  Check whether the processor gave us an */
         pushq $0             /* error code, and insert an empty one if not.              */
-2:      movb  $vec,4(%rsp)
+2:      movb  $vec, EFRAME_entry_vector(%rsp)
         jmp   handle_exception
 
         entrypoint 1b
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86: Resync intel-family.h from Linux

From v6.8-rc6

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 195e75371b13c4f7ecdf7b5c50aed0d02f2d7ce8)

diff --git a/xen/arch/x86/include/asm/intel-family.h b/xen/arch/x86/include/asm/intel-family.h
index ffc49151befe..b65e9c46b922 100644
--- a/xen/arch/x86/include/asm/intel-family.h
+++ b/xen/arch/x86/include/asm/intel-family.h
@@ -26,6 +26,9 @@
  *		_G	- parts with extra graphics on
  *		_X	- regular server parts
  *		_D	- micro server parts
+ *		_N,_P	- other mobile parts
+ *		_H	- premium mobile parts
+ *		_S	- other client parts
  *
  *		Historical OPTDIFFs:
  *
@@ -37,6 +40,9 @@
  * their own names :-(
  */
 
+/* Wildcard match for FAM6 so X86_MATCH_INTEL_FAM6_MODEL(ANY) works */
+#define INTEL_FAM6_ANY			X86_MODEL_ANY
+
 #define INTEL_FAM6_CORE_YONAH		0x0E
 
 #define INTEL_FAM6_CORE2_MEROM		0x0F
@@ -93,8 +99,6 @@
 #define INTEL_FAM6_ICELAKE_L		0x7E	/* Sunny Cove */
 #define INTEL_FAM6_ICELAKE_NNPI		0x9D	/* Sunny Cove */
 
-#define INTEL_FAM6_LAKEFIELD		0x8A	/* Sunny Cove / Tremont */
-
 #define INTEL_FAM6_ROCKETLAKE		0xA7	/* Cypress Cove */
 
 #define INTEL_FAM6_TIGERLAKE_L		0x8C	/* Willow Cove */
@@ -102,12 +106,31 @@
 
 #define INTEL_FAM6_SAPPHIRERAPIDS_X	0x8F	/* Golden Cove */
 
+#define INTEL_FAM6_EMERALDRAPIDS_X	0xCF
+
+#define INTEL_FAM6_GRANITERAPIDS_X	0xAD
+#define INTEL_FAM6_GRANITERAPIDS_D	0xAE
+
+/* "Hybrid" Processors (P-Core/E-Core) */
+
+#define INTEL_FAM6_LAKEFIELD		0x8A	/* Sunny Cove / Tremont */
+
 #define INTEL_FAM6_ALDERLAKE		0x97	/* Golden Cove / Gracemont */
 #define INTEL_FAM6_ALDERLAKE_L		0x9A	/* Golden Cove / Gracemont */
 
-#define INTEL_FAM6_RAPTORLAKE		0xB7
+#define INTEL_FAM6_RAPTORLAKE		0xB7	/* Raptor Cove / Enhanced Gracemont */
+#define INTEL_FAM6_RAPTORLAKE_P		0xBA
+#define INTEL_FAM6_RAPTORLAKE_S		0xBF
+
+#define INTEL_FAM6_METEORLAKE		0xAC
+#define INTEL_FAM6_METEORLAKE_L		0xAA
+
+#define INTEL_FAM6_ARROWLAKE_H		0xC5
+#define INTEL_FAM6_ARROWLAKE		0xC6
+
+#define INTEL_FAM6_LUNARLAKE_M		0xBD
 
-/* "Small Core" Processors (Atom) */
+/* "Small Core" Processors (Atom/E-Core) */
 
 #define INTEL_FAM6_ATOM_BONNELL		0x1C /* Diamondville, Pineview */
 #define INTEL_FAM6_ATOM_BONNELL_MID	0x26 /* Silverthorne, Lincroft */
@@ -134,6 +157,13 @@
 #define INTEL_FAM6_ATOM_TREMONT		0x96 /* Elkhart Lake */
 #define INTEL_FAM6_ATOM_TREMONT_L	0x9C /* Jasper Lake */
 
+#define INTEL_FAM6_ATOM_GRACEMONT	0xBE /* Alderlake N */
+
+#define INTEL_FAM6_ATOM_CRESTMONT_X	0xAF /* Sierra Forest */
+#define INTEL_FAM6_ATOM_CRESTMONT	0xB6 /* Grand Ridge */
+
+#define INTEL_FAM6_ATOM_DARKMONT_X	0xDD /* Clearwater Forest */
+
 /* Xeon Phi */
 
 #define INTEL_FAM6_XEON_PHI_KNL		0x57 /* Knights Landing */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index e3f60d5a82f7..1bead826caa3 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,17 +87,39 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        DO_SPEC_CTRL_COND_VERW
+        /*
+         * All speculation safety work happens to be elsewhere.  VERW is after
+         * popping the GPRs, while restoring the guest MSR_SPEC_CTRL is left
+         * to the MSR load list.
+         */
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
+        mov  %rax, %cr2
+
+        /*
+         * We need to perform two conditional actions (VERW, and Resume vs
+         * Launch) after popping GPRs.  With some cunning, we can encode both
+         * of these in eflags together.
+         *
+         * Parity is only calculated over the bottom byte of the answer, while
+         * Sign is simply the top bit.
+         *
+         * Therefore, the final OR instruction ends up producing:
+         *   SF = VCPU_vmx_launched
+         *   PF = !SCF_verw
+         */
+        BUILD_BUG_ON(SCF_verw & ~0xff)
+        movzbl VCPU_vmx_launched(%rbx), %ecx
+        shl  $31, %ecx
+        movzbl CPUINFO_spec_ctrl_flags(%rsp), %eax
+        and  $SCF_verw, %eax
+        or   %eax, %ecx
 
         pop  %r15
         pop  %r14
         pop  %r13
         pop  %r12
         pop  %rbp
-        mov  %rax,%cr2
-        cmpb $0,VCPU_vmx_launched(%rbx)
         pop  %rbx
         pop  %r11
         pop  %r10
@@ -108,7 +130,13 @@ UNLIKELY_END(realmode)
         pop  %rdx
         pop  %rsi
         pop  %rdi
-        je   .Lvmx_launch
+
+        jpe  .L_skip_verw
+        /* VERW clobbers ZF, but preserves all others, including SF. */
+        verw STK_REL(CPUINFO_verw_sel, CPUINFO_error_code)(%rsp)
+.L_skip_verw:
+
+        jns  .Lvmx_launch
 
 /*.Lvmx_resume:*/
         VMRESUME
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index baaaccb26e17..56ae26e54265 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -81,6 +81,14 @@ register unsigned long current_stack_pointer asm("rsp");
 
 #ifdef __ASSEMBLY__
 
+.macro BUILD_BUG_ON condstr, cond:vararg
+        .if \cond
+        .error "Condition \"\condstr\" not satisfied"
+        .endif
+.endm
+/* preprocessor macro to make error message more user friendly */
+#define BUILD_BUG_ON(cond) BUILD_BUG_ON #cond, cond
+
 #ifdef HAVE_AS_QUOTED_SYM
 #define SUBSECTION_LBL(tag)                        \
         .ifndef .L.tag;                            \
diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index 6cb7c1b9491e..525745a06608 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -152,6 +152,13 @@
 #endif
 .endm
 
+/*
+ * Helper to improve the readibility of stack dispacements with %rsp in
+ * unusual positions.  Both @field and @top_of_stack should be constants from
+ * the same object.  @top_of_stack should be where %rsp is currently pointing.
+ */
+#define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
+
 .macro DO_SPEC_CTRL_COND_VERW
 /*
  * Requires %rsp=cpuinfo
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 2fc4d9130a4d..0d336788989f 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -135,6 +135,7 @@ void __dummy__(void)
 #endif
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
+    OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index 525745a06608..13acebc75dff 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -159,16 +159,23 @@
  */
 #define STK_REL(field, top_of_stk) ((field) - (top_of_stk))
 
-.macro DO_SPEC_CTRL_COND_VERW
+.macro SPEC_CTRL_COND_VERW \
+    scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_error_code), \
+    sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_error_code)
 /*
- * Requires %rsp=cpuinfo
+ * Requires \scf and \sel as %rsp-relative expressions
+ * Clobbers eflags
+ *
+ * VERW needs to run after guest GPRs have been restored, where only %rsp is
+ * good to use.  Default to expecting %rsp pointing at CPUINFO_error_code.
+ * Contexts where this is not true must provide an alternative \scf and \sel.
  *
  * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
  * v1 gadget, but the IRET/VMEntry is serialising.
  */
-    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    testb $SCF_verw, \scf(%rsp)
     jz .L\@_verw_skip
-    verw CPUINFO_verw_sel(%rsp)
+    verw \sel(%rsp)
 .L\@_verw_skip:
 .endm
 
@@ -286,8 +293,6 @@
  */
     ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV
 
-    DO_SPEC_CTRL_COND_VERW
-
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 .endm
 
@@ -367,7 +372,7 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
  */
 .macro SPEC_CTRL_EXIT_TO_XEN
 /*
- * Requires %r12=ist_exit, %r14=stack_end
+ * Requires %r12=ist_exit, %r14=stack_end, %rsp=regs
  * Clobbers %rax, %rbx, %rcx, %rdx
  */
     movzbl STACK_CPUINFO_FIELD(spec_ctrl_flags)(%r14), %ebx
@@ -395,11 +400,18 @@ UNLIKELY_DISPATCH_LABEL(\@_serialise):
     test %r12, %r12
     jz .L\@_skip_ist_exit
 
-    /* Logically DO_SPEC_CTRL_COND_VERW but without the %rsp=cpuinfo dependency */
-    testb $SCF_verw, %bl
-    jz .L\@_skip_verw
-    verw STACK_CPUINFO_FIELD(verw_sel)(%r14)
-.L\@_skip_verw:
+    /*
+     * Stash SCF and verw_sel above eflags in the case of an IST_exit.  The
+     * VERW logic needs to run after guest GPRs have been restored; i.e. where
+     * we cannot use %r12 or %r14 for the purposes they have here.
+     *
+     * When the CPU pushed this exception frame, it zero-extended eflags.
+     * Therefore it is safe for the VERW logic to look at the stashed SCF
+     * outside of the ist_exit condition.  Also, this stashing won't influence
+     * any other restore_all_guest() paths.
+     */
+    or $(__HYPERVISOR_DS32 << 16), %ebx
+    mov %ebx, UREGS_eflags + 4(%rsp) /* EFRAME_shadow_scf/sel */
 
     ALTERNATIVE "", DO_SPEC_CTRL_DIV, X86_FEATURE_SC_DIV
 
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 0d336788989f..85c7d0c98967 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -55,14 +55,22 @@ void __dummy__(void)
      * EFRAME_* is for the entry/exit logic where %rsp is pointing at
      * UREGS_error_code and GPRs are still/already guest values.
      */
-#define OFFSET_EF(sym, mem)                                             \
+#define OFFSET_EF(sym, mem, ...)                                        \
     DEFINE(sym, offsetof(struct cpu_user_regs, mem) -                   \
-                offsetof(struct cpu_user_regs, error_code))
+                offsetof(struct cpu_user_regs, error_code) __VA_ARGS__)
 
     OFFSET_EF(EFRAME_entry_vector,    entry_vector);
     OFFSET_EF(EFRAME_rip,             rip);
     OFFSET_EF(EFRAME_cs,              cs);
     OFFSET_EF(EFRAME_eflags,          eflags);
+
+    /*
+     * These aren't real fields.  They're spare space, used by the IST
+     * exit-to-xen path.
+     */
+    OFFSET_EF(EFRAME_shadow_scf,      eflags, +4);
+    OFFSET_EF(EFRAME_shadow_sel,      eflags, +6);
+
     OFFSET_EF(EFRAME_rsp,             rsp);
     BLANK();
 
@@ -136,6 +144,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_guest_cpu_user_regs, struct cpu_info, guest_cpu_user_regs);
     OFFSET(CPUINFO_error_code, struct cpu_info, guest_cpu_user_regs.error_code);
+    OFFSET(CPUINFO_rip, struct cpu_info, guest_cpu_user_regs.rip);
     OFFSET(CPUINFO_verw_sel, struct cpu_info, verw_sel);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_per_cpu_offset, struct cpu_info, per_cpu_offset);
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index cb473f08eebd..3bbe3a79a5b7 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -161,6 +161,12 @@ ENTRY(compat_restore_all_guest)
         SPEC_CTRL_EXIT_TO_PV    /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
 
         RESTORE_ALL adj=8 compat=1
+
+        /* Account for ev/ec having already been popped off the stack. */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(CPUINFO_spec_ctrl_flags, CPUINFO_rip), \
+            sel=STK_REL(CPUINFO_verw_sel,        CPUINFO_rip)
+
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 968da9d727b1..2c7512130f49 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -214,6 +214,9 @@ restore_all_guest:
 #endif
 
         mov   EFRAME_rip(%rsp), %rcx
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         cmpw  $FLAT_USER_CS32, EFRAME_cs(%rsp)
         mov   EFRAME_rsp(%rsp), %rsp
         je    1f
@@ -227,6 +230,9 @@ restore_all_guest:
 iret_exit_to_guest:
         andl  $~(X86_EFLAGS_IOPL | X86_EFLAGS_VM), EFRAME_eflags(%rsp)
         orl   $X86_EFLAGS_IF, EFRAME_eflags(%rsp)
+
+        SPEC_CTRL_COND_VERW     /* Req: %rsp=eframe                    Clob: efl */
+
         addq  $8,%rsp
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -679,9 +685,22 @@ UNLIKELY_START(ne, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN     /* Req: %r12=ist_exit %r14=end, Clob: abcd */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %r12=ist_exit %r14=end %rsp=regs, Clob: abcd */
 
         RESTORE_ALL adj=8
+
+        /*
+         * When the CPU pushed this exception frame, it zero-extended eflags.
+         * For an IST exit, SPEC_CTRL_EXIT_TO_XEN stashed shadow copies of
+         * spec_ctrl_flags and ver_sel above eflags, as we can't use any GPRs,
+         * and we're at a random place on the stack, not in a CPUFINFO block.
+         *
+         * Account for ev/ec having already been popped off the stack.
+         */
+        SPEC_CTRL_COND_VERW \
+            scf=STK_REL(EFRAME_shadow_scf, EFRAME_rip), \
+            sel=STK_REL(EFRAME_shadow_sel, EFRAME_rip)
+
         iretq
 
 ENTRY(common_interrupt)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 582d6741d182..fbf16839249a 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2370,7 +2370,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,md-clear,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,ibpb-entry}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
 >              unpriv-mmio,gds-mit,div-scrub}=<bool> ]`
@@ -2395,7 +2395,7 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `md-clear=` and `ibpb-entry=` options
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=` and `ibpb-entry=` options
 offer fine grained control over the primitives by Xen.  These impact Xen's
 ability to protect itself, and/or Xen's ability to virtualise support for
 guests to use.
@@ -2412,11 +2412,12 @@ guests to use.
   guests and if disabled, guests will be unable to use IBRS/STIBP/SSBD/etc.
 * `rsb=` offers control over whether to overwrite the Return Stack Buffer /
   Return Address Stack on entry to Xen and on idle.
-* `md-clear=` offers control over whether to use VERW to flush
-  microarchitectural buffers on idle and exit from Xen.  *Note: For
-  compatibility with development versions of this fix, `mds=` is also accepted
-  on Xen 4.12 and earlier as an alias.  Consult vendor documentation in
-  preference to here.*
+* `verw=` offers control over whether to use VERW for its scrubbing side
+  effects at appropriate privilege transitions.  The exact side effects are
+  microarchitecture and microcode specific.  *Note: `md-clear=` is accepted as
+  a deprecated alias.  For compatibility with development versions of XSA-297,
+  `mds=` is also accepted on Xen 4.12 and earlier as an alias.  Consult vendor
+  documentation in preference to here.*
 * `ibpb-entry=` offers control over whether IBPB (Indirect Branch Prediction
   Barrier) is used on entry to Xen.  This is used by default on hardware
   vulnerable to Branch Type Confusion, and hardware vulnerable to Speculative
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index a965b6db28ba..c42d8cdc22d6 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -25,8 +25,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __ro_after_init opt_md_clear_pv = -1;
-static int8_t __ro_after_init opt_md_clear_hvm = -1;
+static int8_t __ro_after_init opt_verw_pv = -1;
+static int8_t __ro_after_init opt_verw_hvm = -1;
 
 static int8_t __ro_after_init opt_ibpb_entry_pv = -1;
 static int8_t __ro_after_init opt_ibpb_entry_hvm = -1;
@@ -66,7 +66,7 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 static bool __initdata opt_unpriv_mmio;
-static bool __ro_after_init opt_fb_clear_mmio;
+static bool __ro_after_init opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
@@ -108,8 +108,8 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         disable_common:
             opt_rsb_pv = false;
             opt_rsb_hvm = false;
-            opt_md_clear_pv = 0;
-            opt_md_clear_hvm = 0;
+            opt_verw_pv = 0;
+            opt_verw_hvm = 0;
             opt_ibpb_entry_pv = 0;
             opt_ibpb_entry_hvm = 0;
             opt_ibpb_entry_dom0 = false;
@@ -140,14 +140,14 @@ static int __init cf_check parse_spec_ctrl(const char *s)
         {
             opt_msr_sc_pv = val;
             opt_rsb_pv = val;
-            opt_md_clear_pv = val;
+            opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
             opt_msr_sc_hvm = val;
             opt_rsb_hvm = val;
-            opt_md_clear_hvm = val;
+            opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
@@ -192,21 +192,22 @@ static int __init cf_check parse_spec_ctrl(const char *s)
                 break;
             }
         }
-        else if ( (val = parse_boolean("md-clear", s, ss)) != -1 )
+        else if ( (val = parse_boolean("verw", s, ss)) != -1 ||
+                  (val = parse_boolean("md-clear", s, ss)) != -1 )
         {
             switch ( val )
             {
             case 0:
             case 1:
-                opt_md_clear_pv = opt_md_clear_hvm = val;
+                opt_verw_pv = opt_verw_hvm = val;
                 break;
 
             case -2:
-                s += strlen("md-clear=");
+                s += (*s == 'v') ? strlen("verw=") : strlen("md-clear=");
                 if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                    opt_md_clear_pv = val;
+                    opt_verw_pv = val;
                 else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                    opt_md_clear_hvm = val;
+                    opt_verw_hvm = val;
                 else
             default:
                     rc = -EINVAL;
@@ -528,8 +529,8 @@ static void __init print_details(enum ind_thunk thunk)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb_ctxt_switch                      ? " IBPB-ctxt" : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm ||
-           opt_fb_clear_mmio                         ? " VERW"  : "",
+           opt_verw_pv || opt_verw_hvm ||
+           opt_verw_mmio                             ? " VERW"  : "",
            opt_div_scrub                             ? " DIV" : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
@@ -550,13 +551,13 @@ static void __init print_details(enum ind_thunk thunk)
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
             amd_virt_spec_ctrl ||
-            opt_eager_fpu || opt_md_clear_hvm)       ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             amd_virt_spec_ctrl)                      ? " MSR_VIRT_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_HVM)      ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_hvm                          ? " MD_CLEAR"      : "",
+           opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "");
 
 #endif
@@ -565,11 +566,11 @@ static void __init print_details(enum ind_thunk thunk)
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_eager_fpu || opt_md_clear_pv)        ? ""               : " None",
+            opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
-           opt_md_clear_pv                           ? " MD_CLEAR"      : "",
+           opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
@@ -1502,8 +1503,8 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    bool verw = ((pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
-                 (opt_fb_clear_mmio && is_iommu_enabled(d)));
+    bool verw = ((pv ? opt_verw_pv : opt_verw_hvm) ||
+                 (opt_verw_mmio && is_iommu_enabled(d)));
 
     bool ibpb = ((pv ? opt_ibpb_entry_pv : opt_ibpb_entry_hvm) &&
                  (d->domain_id != 0 || opt_ibpb_entry_dom0));
@@ -1866,19 +1867,20 @@ void __init init_speculation_mitigations(void)
      * the return-to-guest path.
      */
     if ( opt_unpriv_mmio )
-        opt_fb_clear_mmio = cpu_has_fb_clear;
+        opt_verw_mmio = cpu_has_fb_clear;
 
     /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
      */
-    if ( opt_md_clear_pv == -1 )
-        opt_md_clear_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                           boot_cpu_has(X86_FEATURE_MD_CLEAR));
-    if ( opt_md_clear_hvm == -1 )
-        opt_md_clear_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                            boot_cpu_has(X86_FEATURE_MD_CLEAR));
+    if ( opt_verw_pv == -1 )
+        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                       cpu_has_md_clear);
+
+    if ( opt_verw_hvm == -1 )
+        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
+                        cpu_has_md_clear);
 
     /*
      * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
@@ -1891,12 +1893,12 @@ void __init init_speculation_mitigations(void)
      * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
-     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * opt_verw_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_md_clear_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index c42d8cdc22d6..a4afcd8570e2 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1519,7 +1519,7 @@ void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
     bool has_spec_ctrl, ibrs = false, hw_smt_enabled;
-    bool cpu_has_bug_taa, retpoline_safe;
+    bool cpu_has_bug_taa, cpu_has_useful_md_clear, retpoline_safe;
 
     hw_smt_enabled = check_smt_enabled();
 
@@ -1855,50 +1855,97 @@ void __init init_speculation_mitigations(void)
             "enabled.  Please assess your configuration and choose an\n"
             "explicit 'smt=<bool>' setting.  See XSA-273.\n");
 
+    /*
+     * A brief summary of VERW-related changes.
+     *
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     *
+     * Relevant ucodes:
+     *
+     * - May 2019, for MDS.  Introduces the MD_CLEAR CPUID bit and VERW side
+     *   effects to scrub Store/Load/Fill buffers as applicable.  MD_CLEAR
+     *   exists architecturally, even when the side effects have been removed.
+     *
+     *   Use VERW to scrub on return-to-guest.  Parts with L1D_FLUSH to
+     *   mitigate L1TF have the same side effect, so no need to do both.
+     *
+     *   Various Atoms suffer from Store-buffer sampling only.  Store buffers
+     *   are statically partitioned between non-idle threads, so scrubbing is
+     *   wanted when going idle too.
+     *
+     *   Load ports and Fill buffers are competitively shared between threads.
+     *   SMT must be disabled for VERW scrubbing to be fully effective.
+     *
+     * - November 2019, for TAA.  Extended VERW side effects to TSX-enabled
+     *   MDS_NO parts.
+     *
+     * - February 2022, for Client TSX de-feature.  Removed VERW side effects
+     *   from Client CPUs only.
+     *
+     * - May 2022, for MMIO Stale Data.  (Re)introduced Fill Buffer scrubbing
+     *   on all MMIO-affected parts which didn't already have it for MDS
+     *   reasons, enumerating FB_CLEAR on those parts only.
+     *
+     *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
+     *   side effects as VERW and cannot be used in its place.
+     */
     mds_calculations();
 
     /*
-     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
-     * reintroduced the VERW fill buffer flushing side effect because of a
-     * susceptibility to FBSDP.
+     * Parts which enumerate FB_CLEAR are those with now-updated microcode
+     * which weren't susceptible to the original MFBDS (and therefore didn't
+     * have Fill Buffer scrubbing side effects to begin with, or were Client
+     * MDS_NO non-TAA_NO parts where the scrubbing was removed), but have had
+     * the scrubbing reintroduced because of a susceptibility to FBSDP.
      *
      * If unprivileged guests have (or will have) MMIO mappings, we can
      * mitigate cross-domain leakage of fill buffer data by issuing VERW on
-     * the return-to-guest path.
+     * the return-to-guest path.  This is only a token effort if SMT is
+     * active.
      */
     if ( opt_unpriv_mmio )
         opt_verw_mmio = cpu_has_fb_clear;
 
     /*
-     * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
-     * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
-     * but it is somewhat better than nothing.
+     * MD_CLEAR is enumerated architecturally forevermore, even after the
+     * scrubbing side effects have been removed.  Create ourselves an version
+     * which expressed whether we think MD_CLEAR is having any useful side
+     * effect.
+     */
+    cpu_has_useful_md_clear = (cpu_has_md_clear &&
+                               (cpu_has_bug_mds || cpu_has_bug_msbds_only));
+
+    /*
+     * By default, use VERW scrubbing on applicable hardware, if we think it's
+     * going to have an effect.  This will only be a token effort for
+     * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                       cpu_has_md_clear);
+        opt_verw_pv = cpu_has_useful_md_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                        cpu_has_md_clear);
+        opt_verw_hvm = cpu_has_useful_md_clear;
 
     /*
-     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
-     * either the PV or HVM MDS defences are used, or if we may give MMIO
-     * access to untrusted guests.
-     *
-     * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
-     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
-     *
-     * After calculating the appropriate idle setting, simplify
-     * opt_verw_hvm to mean just "should we VERW on the way into HVM
-     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+     * If SMT is active, and we're protecting against MDS or MMIO stale data,
+     * we need to scrub before going idle as well as on return to guest.
+     * Various pipeline resources are repartitioned amongst non-idle threads.
      */
-    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
+    if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+          opt_verw_mmio) && hw_smt_enabled )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+
+    /*
+     * After calculating the appropriate idle setting, simplify opt_verw_hvm
+     * to mean just "should we VERW on the way into HVM guests", so
+     * spec_ctrl_init_domain() can calculate suitable settings.
+     *
+     * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
+     * only *_CLEAR we can see.
+     */
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+        opt_verw_hvm = false;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 7370f1b56ef9..52e451a806c1 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -172,7 +172,7 @@ static const char *const str_7d0[32] =
     [ 8] = "avx512-vp2intersect", [ 9] = "srbds-ctrl",
     [10] = "md-clear",            [11] = "rtm-always-abort",
     /* 12 */                [13] = "tsx-force-abort",
-    [14] = "serialize",
+    [14] = "serialize",     [15] = "hybrid",
     [16] = "tsxldtrk",
     [18] = "pconfig",
     [20] = "cet-ibt",
@@ -245,7 +245,8 @@ static const char *const str_m10Al[32] =
     [20] = "bhi-no",              [21] = "xapic-status",
     /* 22 */                      [23] = "ovrclk-status",
     [24] = "pbrsb-no",            [25] = "gds-ctrl",
-    [26] = "gds-no",
+    [26] = "gds-no",              [27] = "rfds-no",
+    [28] = "rfds-clear",
 };
 
 static const char *const str_m10Ah[32] =
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index c7c5e99b7b4c..12e621b97de6 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -451,6 +451,7 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
          */
         __set_bit(X86_FEATURE_MD_CLEAR, fs);
         __set_bit(X86_FEATURE_FB_CLEAR, fs);
+        __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
 
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -500,6 +501,10 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
         if ( cpu_has_fb_clear )
             __set_bit(X86_FEATURE_FB_CLEAR, fs);
 
+        __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
+        if ( cpu_has_rfds_clear )
+            __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 76ef2aeb1de6..3c57f55de075 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -181,6 +181,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_hybrid          boot_cpu_has(X86_FEATURE_HYBRID)
 #define cpu_has_avx512_fp16     boot_cpu_has(X86_FEATURE_AVX512_FP16)
 #define cpu_has_arch_caps       boot_cpu_has(X86_FEATURE_ARCH_CAPS)
 
@@ -208,6 +209,8 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_rrsba           boot_cpu_has(X86_FEATURE_RRSBA)
 #define cpu_has_gds_ctrl        boot_cpu_has(X86_FEATURE_GDS_CTRL)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
+#define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
+#define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 82a81bd0a232..85ef28a612e0 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -89,6 +89,8 @@
 #define  ARCH_CAPS_PBRSB_NO                 (_AC(1, ULL) << 24)
 #define  ARCH_CAPS_GDS_CTRL                 (_AC(1, ULL) << 25)
 #define  ARCH_CAPS_GDS_NO                   (_AC(1, ULL) << 26)
+#define  ARCH_CAPS_RFDS_NO                  (_AC(1, ULL) << 27)
+#define  ARCH_CAPS_RFDS_CLEAR               (_AC(1, ULL) << 28)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index a4afcd8570e2..8165379fed94 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -12,6 +12,7 @@
 
 #include <asm/amd.h>
 #include <asm/hvm/svm/svm.h>
+#include <asm/intel-family.h>
 #include <asm/microcode.h>
 #include <asm/msr.h>
 #include <asm/pv/domain.h>
@@ -435,7 +436,7 @@ static void __init print_details(enum ind_thunk thunk)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_EIBRS)                          ? " EIBRS"          : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -451,6 +452,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
            (caps & ARCH_CAPS_PBRSB_NO)                       ? " PBRSB_NO"       : "",
            (caps & ARCH_CAPS_GDS_NO)                         ? " GDS_NO"         : "",
+           (caps & ARCH_CAPS_RFDS_NO)                        ? " RFDS_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
@@ -461,7 +463,7 @@ static void __init print_details(enum ind_thunk thunk)
            (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO))        ? " SRSO_NO"        : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -479,6 +481,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
            (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "",
            (caps & ARCH_CAPS_GDS_CTRL)                       ? " GDS_CTRL"       : "",
+           (caps & ARCH_CAPS_RFDS_CLEAR)                     ? " RFDS_CLEAR"     : "",
            (e21a & cpufeat_mask(X86_FEATURE_SBPB))           ? " SBPB"           : "");
 
     /* Compiled-in support which pertains to mitigations. */
@@ -1347,6 +1350,83 @@ static __init void mds_calculations(void)
     }
 }
 
+/*
+ * Register File Data Sampling affects Atom cores from the Goldmont to
+ * Gracemont microarchitectures.  The March 2024 microcode adds RFDS_NO to
+ * some but not all unaffected parts, and RFDS_CLEAR to affected parts still
+ * in support.
+ *
+ * Alder Lake and Raptor Lake client CPUs have a mix of P cores
+ * (Golden/Raptor Cove, not vulnerable) and E cores (Gracemont,
+ * vulnerable), and both enumerate RFDS_CLEAR.
+ *
+ * Both exist in a Xeon SKU, which has the E cores (Gracemont) disabled by
+ * platform configuration, and enumerate RFDS_NO.
+ *
+ * With older parts, or with out-of-date microcode, synthesise RFDS_NO when
+ * safe to do so.
+ *
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ */
+static void __init rfds_calculations(void)
+{
+    /* RFDS is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return;
+
+    /*
+     * If RFDS_NO or RFDS_CLEAR are visible, we've either got suitable
+     * microcode, or an RFDS-aware hypervisor is levelling us in a pool.
+     */
+    if ( cpu_has_rfds_no || cpu_has_rfds_clear )
+        return;
+
+    /* If we're virtualised, don't attempt to synthesise RFDS_NO. */
+    if ( cpu_has_hypervisor )
+        return;
+
+    /*
+     * Not all CPUs are expected to get a microcode update enumerating one of
+     * RFDS_{NO,CLEAR}, or we might have out-of-date microcode.
+     */
+    switch ( boot_cpu_data.x86_model )
+    {
+    case INTEL_FAM6_ALDERLAKE:
+    case INTEL_FAM6_RAPTORLAKE:
+        /*
+         * Alder Lake and Raptor Lake might be a client SKU (with the
+         * Gracemont cores active, and therefore vulnerable) or might be a
+         * server SKU (with the Gracemont cores disabled, and therefore not
+         * vulnerable).
+         *
+         * See if the CPU identifies as hybrid to distinguish the two cases.
+         */
+        if ( !cpu_has_hybrid )
+            break;
+        fallthrough;
+    case INTEL_FAM6_ALDERLAKE_L:
+    case INTEL_FAM6_RAPTORLAKE_P:
+    case INTEL_FAM6_RAPTORLAKE_S:
+
+    case INTEL_FAM6_ATOM_GOLDMONT:      /* Apollo Lake */
+    case INTEL_FAM6_ATOM_GOLDMONT_D:    /* Denverton */
+    case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* Gemini Lake */
+    case INTEL_FAM6_ATOM_TREMONT_D:     /* Snow Ridge / Parker Ridge */
+    case INTEL_FAM6_ATOM_TREMONT:       /* Elkhart Lake */
+    case INTEL_FAM6_ATOM_TREMONT_L:     /* Jasper Lake */
+    case INTEL_FAM6_ATOM_GRACEMONT:     /* Alder Lake N */
+        return;
+    }
+
+    /*
+     * We appear to be on an unaffected CPU which didn't enumerate RFDS_NO,
+     * perhaps because of it's age or because of out-of-date microcode.
+     * Synthesise it.
+     */
+    setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
+}
+
 static bool __init cpu_has_gds(void)
 {
     /*
@@ -1860,6 +1940,7 @@ void __init init_speculation_mitigations(void)
      *
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
      *
      * Relevant ucodes:
      *
@@ -1889,8 +1970,12 @@ void __init init_speculation_mitigations(void)
      *
      *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
      *   side effects as VERW and cannot be used in its place.
+     *
+     * - March 2023, for RFDS.  Enumerate RFDS_CLEAR to mean that VERW now
+     *   scrubs non-architectural entries from certain register files.
      */
     mds_calculations();
+    rfds_calculations();
 
     /*
      * Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -1922,15 +2007,19 @@ void __init init_speculation_mitigations(void)
      * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = cpu_has_useful_md_clear;
+        opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = cpu_has_useful_md_clear;
+        opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     /*
      * If SMT is active, and we're protecting against MDS or MMIO stale data,
      * we need to scrub before going idle as well as on return to guest.
      * Various pipeline resources are repartitioned amongst non-idle threads.
+     *
+     * We don't need to scrub on idle for RFDS.  There are no affected cores
+     * which support SMT, despite there being affected cores in hybrid systems
+     * which have SMT elsewhere in the platform.
      */
     if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
           opt_verw_mmio) && hw_smt_enabled )
@@ -1944,7 +2033,8 @@ void __init init_speculation_mitigations(void)
      * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
      * only *_CLEAR we can see.
      */
-    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear &&
+         !cpu_has_rfds_clear )
         opt_verw_hvm = false;
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 337aaa9c770b..8e17ef670fff 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -266,6 +266,7 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffe
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*A  SERIALIZE insn */
+XEN_CPUFEATURE(HYBRID,        9*32+15) /*   Heterogeneous platform */
 XEN_CPUFEATURE(TSXLDTRK,      9*32+16) /*a  TSX load tracking suspend/resume insns */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
 XEN_CPUFEATURE(AVX512_FP16,   9*32+23) /*A  AVX512 FP16 instructions */
@@ -338,6 +339,8 @@ XEN_CPUFEATURE(OVRCLK_STATUS,      16*32+23) /*   MSR_OVERCLOCKING_STATUS */
 XEN_CPUFEATURE(PBRSB_NO,           16*32+24) /*A  No Post-Barrier RSB predictions */
 XEN_CPUFEATURE(GDS_CTRL,           16*32+25) /*   MCU_OPT_CTRL.GDS_MIT_{DIS,LOCK} */
 XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
+XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
+XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
 /* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
 
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 9cede5361d14..b31b13146fb1 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1519,7 +1519,7 @@ void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
     bool has_spec_ctrl, ibrs = false, hw_smt_enabled;
-    bool cpu_has_bug_taa, retpoline_safe;
+    bool cpu_has_bug_taa, cpu_has_useful_md_clear, retpoline_safe;
 
     hw_smt_enabled = check_smt_enabled();
 
@@ -1855,50 +1855,97 @@ void __init init_speculation_mitigations(void)
             "enabled.  Please assess your configuration and choose an\n"
             "explicit 'smt=<bool>' setting.  See XSA-273.\n");
 
+    /*
+     * A brief summary of VERW-related changes.
+     *
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     *
+     * Relevant ucodes:
+     *
+     * - May 2019, for MDS.  Introduces the MD_CLEAR CPUID bit and VERW side
+     *   effects to scrub Store/Load/Fill buffers as applicable.  MD_CLEAR
+     *   exists architecturally, even when the side effects have been removed.
+     *
+     *   Use VERW to scrub on return-to-guest.  Parts with L1D_FLUSH to
+     *   mitigate L1TF have the same side effect, so no need to do both.
+     *
+     *   Various Atoms suffer from Store-buffer sampling only.  Store buffers
+     *   are statically partitioned between non-idle threads, so scrubbing is
+     *   wanted when going idle too.
+     *
+     *   Load ports and Fill buffers are competitively shared between threads.
+     *   SMT must be disabled for VERW scrubbing to be fully effective.
+     *
+     * - November 2019, for TAA.  Extended VERW side effects to TSX-enabled
+     *   MDS_NO parts.
+     *
+     * - February 2022, for Client TSX de-feature.  Removed VERW side effects
+     *   from Client CPUs only.
+     *
+     * - May 2022, for MMIO Stale Data.  (Re)introduced Fill Buffer scrubbing
+     *   on all MMIO-affected parts which didn't already have it for MDS
+     *   reasons, enumerating FB_CLEAR on those parts only.
+     *
+     *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
+     *   side effects as VERW and cannot be used in its place.
+     */
     mds_calculations();
 
     /*
-     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
-     * reintroduced the VERW fill buffer flushing side effect because of a
-     * susceptibility to FBSDP.
+     * Parts which enumerate FB_CLEAR are those with now-updated microcode
+     * which weren't susceptible to the original MFBDS (and therefore didn't
+     * have Fill Buffer scrubbing side effects to begin with, or were Client
+     * MDS_NO non-TAA_NO parts where the scrubbing was removed), but have had
+     * the scrubbing reintroduced because of a susceptibility to FBSDP.
      *
      * If unprivileged guests have (or will have) MMIO mappings, we can
      * mitigate cross-domain leakage of fill buffer data by issuing VERW on
-     * the return-to-guest path.
+     * the return-to-guest path.  This is only a token effort if SMT is
+     * active.
      */
     if ( opt_unpriv_mmio )
         opt_verw_mmio = cpu_has_fb_clear;
 
     /*
-     * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
-     * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
-     * but it is somewhat better than nothing.
+     * MD_CLEAR is enumerated architecturally forevermore, even after the
+     * scrubbing side effects have been removed.  Create ourselves an version
+     * which expressed whether we think MD_CLEAR is having any useful side
+     * effect.
+     */
+    cpu_has_useful_md_clear = (cpu_has_md_clear &&
+                               (cpu_has_bug_mds || cpu_has_bug_msbds_only));
+
+    /*
+     * By default, use VERW scrubbing on applicable hardware, if we think it's
+     * going to have an effect.  This will only be a token effort for
+     * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                       cpu_has_md_clear);
+        opt_verw_pv = cpu_has_useful_md_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = ((cpu_has_bug_mds || cpu_has_bug_msbds_only) &&
-                        cpu_has_md_clear);
+        opt_verw_hvm = cpu_has_useful_md_clear;
 
     /*
-     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
-     * either the PV or HVM MDS defences are used, or if we may give MMIO
-     * access to untrusted guests.
-     *
-     * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
-     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
-     *
-     * After calculating the appropriate idle setting, simplify
-     * opt_verw_hvm to mean just "should we VERW on the way into HVM
-     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+     * If SMT is active, and we're protecting against MDS or MMIO stale data,
+     * we need to scrub before going idle as well as on return to guest.
+     * Various pipeline resources are repartitioned amongst non-idle threads.
      */
-    if ( opt_verw_pv || opt_verw_hvm || opt_verw_mmio )
+    if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+          opt_verw_mmio) && hw_smt_enabled )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    opt_verw_hvm &= !cpu_has_skip_l1dfl && !opt_l1d_flush;
+
+    /*
+     * After calculating the appropriate idle setting, simplify opt_verw_hvm
+     * to mean just "should we VERW on the way into HVM guests", so
+     * spec_ctrl_init_domain() can calculate suitable settings.
+     *
+     * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
+     * only *_CLEAR we can see.
+     */
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+        opt_verw_hvm = false;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 46abdb914092..51efbff579e6 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -172,7 +172,7 @@ static const char *const str_7d0[32] =
     [ 8] = "avx512-vp2intersect", [ 9] = "srbds-ctrl",
     [10] = "md-clear",            [11] = "rtm-always-abort",
     /* 12 */                [13] = "tsx-force-abort",
-    [14] = "serialize",
+    [14] = "serialize",     [15] = "hybrid",
     [16] = "tsxldtrk",
     [18] = "pconfig",
     [20] = "cet-ibt",
@@ -251,7 +251,8 @@ static const char *const str_m10Al[32] =
     [20] = "bhi-no",              [21] = "xapic-status",
     /* 22 */                      [23] = "ovrclk-status",
     [24] = "pbrsb-no",            [25] = "gds-ctrl",
-    [26] = "gds-no",
+    [26] = "gds-no",              [27] = "rfds-no",
+    [28] = "rfds-clear",
 };
 
 static const char *const str_m10Ah[32] =
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 2c6f03057b43..2acc27632f9a 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -451,6 +451,7 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
          */
         __set_bit(X86_FEATURE_MD_CLEAR, fs);
         __set_bit(X86_FEATURE_FB_CLEAR, fs);
+        __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
 
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -510,6 +511,10 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
         if ( cpu_has_fb_clear )
             __set_bit(X86_FEATURE_FB_CLEAR, fs);
 
+        __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
+        if ( cpu_has_rfds_clear )
+            __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+
         /*
          * The Gather Data Sampling microcode mitigation (August 2023) has an
          * adverse performance impact on the CLWB instruction on SKX/CLX/CPX.
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index ad24d0fa888b..9bc553681f4a 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -182,6 +182,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_rtm_always_abort boot_cpu_has(X86_FEATURE_RTM_ALWAYS_ABORT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_hybrid          boot_cpu_has(X86_FEATURE_HYBRID)
 #define cpu_has_avx512_fp16     boot_cpu_has(X86_FEATURE_AVX512_FP16)
 #define cpu_has_arch_caps       boot_cpu_has(X86_FEATURE_ARCH_CAPS)
 
@@ -213,6 +214,8 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_rrsba           boot_cpu_has(X86_FEATURE_RRSBA)
 #define cpu_has_gds_ctrl        boot_cpu_has(X86_FEATURE_GDS_CTRL)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
+#define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
+#define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index fc82afc246fa..92dd9fa4962c 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -89,6 +89,8 @@
 #define  ARCH_CAPS_PBRSB_NO                 (_AC(1, ULL) << 24)
 #define  ARCH_CAPS_GDS_CTRL                 (_AC(1, ULL) << 25)
 #define  ARCH_CAPS_GDS_NO                   (_AC(1, ULL) << 26)
+#define  ARCH_CAPS_RFDS_NO                  (_AC(1, ULL) << 27)
+#define  ARCH_CAPS_RFDS_CLEAR               (_AC(1, ULL) << 28)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index b31b13146fb1..0638d9980a61 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -12,6 +12,7 @@
 
 #include <asm/amd.h>
 #include <asm/hvm/svm/svm.h>
+#include <asm/intel-family.h>
 #include <asm/microcode.h>
 #include <asm/msr.h>
 #include <asm/pv/domain.h>
@@ -435,7 +436,7 @@ static void __init print_details(enum ind_thunk thunk)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_EIBRS)                          ? " EIBRS"          : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -451,6 +452,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
            (caps & ARCH_CAPS_PBRSB_NO)                       ? " PBRSB_NO"       : "",
            (caps & ARCH_CAPS_GDS_NO)                         ? " GDS_NO"         : "",
+           (caps & ARCH_CAPS_RFDS_NO)                        ? " RFDS_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
@@ -461,7 +463,7 @@ static void __init print_details(enum ind_thunk thunk)
            (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO))        ? " SRSO_NO"        : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -479,6 +481,7 @@ static void __init print_details(enum ind_thunk thunk)
            (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
            (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "",
            (caps & ARCH_CAPS_GDS_CTRL)                       ? " GDS_CTRL"       : "",
+           (caps & ARCH_CAPS_RFDS_CLEAR)                     ? " RFDS_CLEAR"     : "",
            (e21a & cpufeat_mask(X86_FEATURE_SBPB))           ? " SBPB"           : "");
 
     /* Compiled-in support which pertains to mitigations. */
@@ -1347,6 +1350,83 @@ static __init void mds_calculations(void)
     }
 }
 
+/*
+ * Register File Data Sampling affects Atom cores from the Goldmont to
+ * Gracemont microarchitectures.  The March 2024 microcode adds RFDS_NO to
+ * some but not all unaffected parts, and RFDS_CLEAR to affected parts still
+ * in support.
+ *
+ * Alder Lake and Raptor Lake client CPUs have a mix of P cores
+ * (Golden/Raptor Cove, not vulnerable) and E cores (Gracemont,
+ * vulnerable), and both enumerate RFDS_CLEAR.
+ *
+ * Both exist in a Xeon SKU, which has the E cores (Gracemont) disabled by
+ * platform configuration, and enumerate RFDS_NO.
+ *
+ * With older parts, or with out-of-date microcode, synthesise RFDS_NO when
+ * safe to do so.
+ *
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ */
+static void __init rfds_calculations(void)
+{
+    /* RFDS is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return;
+
+    /*
+     * If RFDS_NO or RFDS_CLEAR are visible, we've either got suitable
+     * microcode, or an RFDS-aware hypervisor is levelling us in a pool.
+     */
+    if ( cpu_has_rfds_no || cpu_has_rfds_clear )
+        return;
+
+    /* If we're virtualised, don't attempt to synthesise RFDS_NO. */
+    if ( cpu_has_hypervisor )
+        return;
+
+    /*
+     * Not all CPUs are expected to get a microcode update enumerating one of
+     * RFDS_{NO,CLEAR}, or we might have out-of-date microcode.
+     */
+    switch ( boot_cpu_data.x86_model )
+    {
+    case INTEL_FAM6_ALDERLAKE:
+    case INTEL_FAM6_RAPTORLAKE:
+        /*
+         * Alder Lake and Raptor Lake might be a client SKU (with the
+         * Gracemont cores active, and therefore vulnerable) or might be a
+         * server SKU (with the Gracemont cores disabled, and therefore not
+         * vulnerable).
+         *
+         * See if the CPU identifies as hybrid to distinguish the two cases.
+         */
+        if ( !cpu_has_hybrid )
+            break;
+        fallthrough;
+    case INTEL_FAM6_ALDERLAKE_L:
+    case INTEL_FAM6_RAPTORLAKE_P:
+    case INTEL_FAM6_RAPTORLAKE_S:
+
+    case INTEL_FAM6_ATOM_GOLDMONT:      /* Apollo Lake */
+    case INTEL_FAM6_ATOM_GOLDMONT_D:    /* Denverton */
+    case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* Gemini Lake */
+    case INTEL_FAM6_ATOM_TREMONT_D:     /* Snow Ridge / Parker Ridge */
+    case INTEL_FAM6_ATOM_TREMONT:       /* Elkhart Lake */
+    case INTEL_FAM6_ATOM_TREMONT_L:     /* Jasper Lake */
+    case INTEL_FAM6_ATOM_GRACEMONT:     /* Alder Lake N */
+        return;
+    }
+
+    /*
+     * We appear to be on an unaffected CPU which didn't enumerate RFDS_NO,
+     * perhaps because of it's age or because of out-of-date microcode.
+     * Synthesise it.
+     */
+    setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
+}
+
 static bool __init cpu_has_gds(void)
 {
     /*
@@ -1860,6 +1940,7 @@ void __init init_speculation_mitigations(void)
      *
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
      * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
+     * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
      *
      * Relevant ucodes:
      *
@@ -1889,8 +1970,12 @@ void __init init_speculation_mitigations(void)
      *
      *   If FB_CLEAR is enumerated, L1D_FLUSH does not have the same scrubbing
      *   side effects as VERW and cannot be used in its place.
+     *
+     * - March 2023, for RFDS.  Enumerate RFDS_CLEAR to mean that VERW now
+     *   scrubs non-architectural entries from certain register files.
      */
     mds_calculations();
+    rfds_calculations();
 
     /*
      * Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -1922,15 +2007,19 @@ void __init init_speculation_mitigations(void)
      * MLPDS/MFBDS when SMT is enabled.
      */
     if ( opt_verw_pv == -1 )
-        opt_verw_pv = cpu_has_useful_md_clear;
+        opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     if ( opt_verw_hvm == -1 )
-        opt_verw_hvm = cpu_has_useful_md_clear;
+        opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
 
     /*
      * If SMT is active, and we're protecting against MDS or MMIO stale data,
      * we need to scrub before going idle as well as on return to guest.
      * Various pipeline resources are repartitioned amongst non-idle threads.
+     *
+     * We don't need to scrub on idle for RFDS.  There are no affected cores
+     * which support SMT, despite there being affected cores in hybrid systems
+     * which have SMT elsewhere in the platform.
      */
     if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
           opt_verw_mmio) && hw_smt_enabled )
@@ -1944,7 +2033,8 @@ void __init init_speculation_mitigations(void)
      * It is only safe to use L1D_FLUSH in place of VERW when MD_CLEAR is the
      * only *_CLEAR we can see.
      */
-    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear )
+    if ( opt_l1d_flush && cpu_has_md_clear && !cpu_has_fb_clear &&
+         !cpu_has_rfds_clear )
         opt_verw_hvm = false;
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 0374cec3a2af..eb9f552948be 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -266,6 +266,7 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /*!A VERW clears microarchitectural buffe
 XEN_CPUFEATURE(RTM_ALWAYS_ABORT, 9*32+11) /*! June 2021 TSX defeaturing in microcode. */
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*A  SERIALIZE insn */
+XEN_CPUFEATURE(HYBRID,        9*32+15) /*   Heterogeneous platform */
 XEN_CPUFEATURE(TSXLDTRK,      9*32+16) /*a  TSX load tracking suspend/resume insns */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
 XEN_CPUFEATURE(AVX512_FP16,   9*32+23) /*A  AVX512 FP16 instructions */
@@ -343,6 +344,8 @@ XEN_CPUFEATURE(OVRCLK_STATUS,      16*32+23) /*   MSR_OVERCLOCKING_STATUS */
 XEN_CPUFEATURE(PBRSB_NO,           16*32+24) /*A  No Post-Barrier RSB predictions */
 XEN_CPUFEATURE(GDS_CTRL,           16*32+25) /*   MCU_OPT_CTRL.GDS_MIT_{DIS,LOCK} */
 XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
+XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
+XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
 /* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */