Xen Security Advisory 404 v2 (CVE-2022-21123,CVE-2022-21125,CVE-2022-21166) - x86: MMIO Stale Data vulnerabilities

Xen.org security team posted 1 patch 1 year, 10 months ago
Failed in applying to current master (apply log)
Xen Security Advisory 404 v2 (CVE-2022-21123,CVE-2022-21125,CVE-2022-21166) - x86: MMIO Stale Data vulnerabilities
Posted by Xen.org security team 1 year, 10 months ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

 Xen Security Advisory CVE-2022-21123,CVE-2022-21125,CVE-2022-21166 / XSA-404
                                   version 2

                 x86: MMIO Stale Data vulnerabilities

UPDATES IN VERSION 2
====================

Correct one CVE.  The title for version 1 gave CVE-2022-21124 which was
incorrect and should have been CVE-2022-21125.

Patches are now reviewed.  Backports are available.

ISSUE DESCRIPTION
=================

This issue is related to the SRBDS, TAA and MDS vulnerabilities.  Please
see:

  https://xenbits.xen.org/xsa/advisory-320.html (SRBDS)
  https://xenbits.xen.org/xsa/advisory-305.html (TAA)
  https://xenbits.xen.org/xsa/advisory-297.html (MDS)

Please see Intel's whitepaper:

  https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html

IMPACT
======

An attacker might be able to directly read or infer data from other
security contexts in the system.  This can include data belonging to
other VMs, or to Xen itself.  The degree to which an attacker can obtain
data depends on the CPU, and the system configuration.

VULNERABLE SYSTEMS
==================

Systems running all versions of Xen are affected.

Only x86 processors are vulnerable.  Processors from other manufacturers
(e.g. ARM) are not believed to be vulnerable.

Only Intel based processors are affected.  Processors from other x86
manufacturers (e.g. AMD) are not believed to be vulnerable.

Please consult the Intel Security Advisory for details on the affected
processors and configurations.

Per Xen's support statement, PCI passthrough should be to trusted
domains because the overall system security depends on factors outside
of Xen's control.

As such, Xen, in a supported configuration, is not vulnerable to
DRPW/SBDR.

MITIGATION
==========

All mitigations depend on functionality added in the IPU 2022.1 (May
2022) microcode release from Intel.  Consult your dom0 OS vendor.

To the best of the security team's understanding, the summary is as
follows:

Server CPUs (Xeon EP/EX, Scalable, and some Atom servers), excluding
Xeon E3 (which use the client CPU design), are potentially vulnerable to
DRPW (CVE-2022-21166).

Client CPUs (inc Xeon E3) are, furthermore, potentially vulnerable to
SBDR (CVE-2022-21123) and SBDS (CVE-2022-21125).

SBDS only affects CPUs vulnerable to MDS.  On these CPUs, there are
previously undiscovered leakage channels.  There is no change to the
existing MDS mitigations.

DRPW and SBDR only affects configurations where less privileged domains
have MMIO mappings of buggy endpoints.  Consult your hardware vendor.

In configurations where less privileged domains have MMIO access to
buggy endpoints, `spec-ctrl=unpriv-mmio` can be enabled which will cause
Xen to mitigate cross-domain fill buffer leakage, and extend SRBDS
protections to protect RNG data from leakage.

RESOLUTION
==========

Applying the appropriate attached patches and enabling the newly
introduced command line option, if appropriate, mitigates these issues.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa404/xsa404-?.patch           xen-unstable
xsa404/xsa404-4.16-?.patch      Xen 4.16.x
xsa404/xsa404-4.15-?.patch      Xen 4.15.x
xsa404/xsa404-4.14-?.patch      Xen 4.14.x
xsa404/xsa404-4.13-?.patch      Xen 4.13.x

$ sha256sum xsa404*/*
51a812b3e37fb5067aff94d7e587c3fed0de4fcc89e694c7b7dbf1ef2d7e2acc  xsa404/xsa404-1.patch
99d9657cd811f5ed86949bd44777b6bfbb4356fea70795edaa9c7ede341603a0  xsa404/xsa404-2.patch
7e61db8f1741a9e2e9e68e7221cc532f4d17c4d0b2e02ce9ba4468ce187b7b57  xsa404/xsa404-3.patch
be78110d460db361be29f5e5f4b4608bbd25d2032c5f14eed05fd10e66e99e87  xsa404/xsa404-4.13-1.patch
7734bc21a04eb0cea30564bd0855ecc969b7b427a250b5ea6efc6fab46483b70  xsa404/xsa404-4.13-2.patch
6abbdcf5308c033ab7b59c6c75514e29aa14f06c61ef807e2d0c80695af1cace  xsa404/xsa404-4.13-3.patch
ccff36c3615d0068ade29e1d25abd6112b9e90490a5b0ef3d189b27aa53976b2  xsa404/xsa404-4.14-1.patch
ac446bed9d33d84e0b20e4898ce1424f3ed7ed4b05c3c559045a377a9a044b0c  xsa404/xsa404-4.14-2.patch
0ca7801e0442dd304d62538a0861fe459b08dc367530d2142405d602930e1dab  xsa404/xsa404-4.14-3.patch
a26036a136c10810de88960704e6922a40b483a49c8b1821a6e265cae968bfc2  xsa404/xsa404-4.15-1.patch
25616a8665b96b965fbc0b799fb8cd17a360b4add71c6e6e504859cfd35f19ce  xsa404/xsa404-4.15-2.patch
a4c3608210f62e453f9c983ebc1a3b0846ca3a52ba32ee13143561710b4c4118  xsa404/xsa404-4.15-3.patch
a18c04cfdacf7dbb518216ac85047a5851c1f64c62d64e234f8ed19b6905ba60  xsa404/xsa404-4.16-1.patch
d22af75e0bc42e249a37bd91165b426c7146f69dfd6c4de4a06d6ed0b3e5e713  xsa404/xsa404-4.16-2.patch
b04603668f61fbd40e2effaaeb7b3d9c555a8d8a4667208ae0ae42baf323230a  xsa404/xsa404-4.16-3.patch
$

In addition, the backports have already been pushed to xen.git.  They are
available in the following branches:

staging      8c24b70fedcb52633b2370f834d8a2be3f7fa38e
staging-4.16 2e82446cb252f6c8ac697e81f4155872c69afde4
staging-4.15 a3faf632606e54437146dbcac2c9bbb89b9a4007
staging-4.14 c5f774eaeeca195ef85b47713f0b21220c4b41e6
staging-4.13 87ff11354f0dc0d6e77e1695e6c1e14aa1382cdc

NOTE CONCERNING CVE-2022-21127 / Update to SRBDS
================================================

An issue was discovered with the SRBDS microcode mitigation.  A
microcode update was released as part of Intel's IPU 2022.1 in May 2022.

Updating microcode is sufficient to fix the issue, with no extra actions
required on Xen's behalf.  Consult your dom0 OS vendor or OEM for
updated microcode.

NOTE CONCERNING CVE-2022-21180 / Undefined MMIO Hang
====================================================

A related issue was discovered.  See:

  https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/undefined-mmio-hang.html

Xen is not vulnerable to UMH in supported configurations.

The only mitigation is to avoid passing impacted devices through to
untrusted guests.

NOTE CONCERNING LACK OF EMBARGO
===============================

The discoverer did not authorise us to predisclose.
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmKrVbAMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZ2AcH/jWGiu0jpWMkQw/3U4DUu2a77PcC9jLH8NONesB7
SGfdhIMNqmStUI5VJf54ccDIrZSLQxvNVWWxXyQPhZXWhSPf5xE2uYK1qUL+Za8c
kOIJr0Drzffr2Bmu3NnBCRdQDkmXl2GDgqig4YWK/+BOlOO+YxBGdyoE0mBOXMo4
+cQHHvYa16kZVuwxyS0mZxhKFo3JQZaKqh2DEzKZUWm3w8n3NKEYG8S00sttZfjs
dS8rNXEu+yrmPjsJ+hFfJw8MfoETE6yGI47C89dFTN9Q0KedEYM28oD6ClMUC+ks
kwnFAk561m4VUoTqkSv82PeJfS9Sp5D6yO4CDdC05Eyc9gA=
=K9Tq
-----END PGP SIGNATURE-----
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 0d1d98d715b0..266a11ab584e 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2282,9 +2282,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index a72cc9552ad6..9eddeaa20bd5 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -864,6 +864,8 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -2018,14 +2020,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -2089,6 +2092,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 49651f3c435a..5f5de45a1309 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ff3157d52d13..bd45a144ee78 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 75389e962a5e..ad01ee68e12c 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -324,6 +324,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index f76029523610..751355f471f4 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index 02b3b18ce69f..5a590bac44aa 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 1408e4c7abd0..92eb4ecd3d09 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __ro_after_init opt_md_clear_pv = -1;
+static int8_t __ro_after_init opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -933,6 +933,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1197,21 +1204,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 6c250bfcadad..ea47f68d0558 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -71,6 +71,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 #define  ARCH_CAPS_RRSBA                    (_AC(1, ULL) << 19)
 #define  ARCH_CAPS_BHI_NO                   (_AC(1, ULL) << 20)
 
@@ -90,6 +95,7 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 #define  MCU_OPT_CTRL_RTM_ALLOW             (_AC(1, ULL) <<  1)
 #define  MCU_OPT_CTRL_RTM_LOCKED            (_AC(1, ULL) <<  2)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 92eb4ecd3d09..2ec3126cf060 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 266a11ab584e..a92b7d228cae 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2259,7 +2259,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2338,8 +2338,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 2ec3126cf060..1f275ad1fb5d 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
 
 static int8_t __initdata opt_srb_lock = -1;
+static bool __initdata opt_unpriv_mmio;
+static bool __ro_after_init opt_fb_clear_mmio;
 
 static int __init cf_check parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init cf_check parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -942,7 +947,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1197,6 +1204,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1209,18 +1228,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1285,14 +1306,19 @@ void __init init_speculation_mitigations(void)
      * On some SRBDS-affected hardware, it may be safe to relax srb-lock by
      * default.
      *
-     * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only known
-     * way to access the Fill Buffer.  If TSX isn't available (inc. SKU
-     * reasons on some models), or TSX is explicitly disabled, then there is
-     * no need for the extra overhead to protect RDRAND/RDSEED.
+     * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale RNG
+     * data becomes available to other contexts.  To recover the data, an
+     * attacker needs to use:
+     *  - SBDS (MDS or TAA to sample the cores fill buffer)
+     *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+     *  - DRPW (Architecturally latch stale fill buffer data)
+     *
+     * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+     * is no unprivileged MMIO access, the RNG data doesn't need protecting.
      */
     if ( cpu_has_srbds_ctrl )
     {
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index eead69ada2c2..e8bdf30fa46c 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2058,9 +2058,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 820cb0f90558..fe95b25a034e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -651,6 +651,8 @@ int arch_domain_create(struct domain *d,
 
     domain_cpu_policy_changed(d);
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -1746,14 +1748,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = next->dirty_cpu;
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -1816,6 +1819,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 27c8c5ca4943..62ed0d854df1 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -81,6 +81,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         SPEC_CTRL_EXIT_TO_HVM   /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 7447d4a8e5b5..38e1f1098210 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -35,8 +35,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static bool __initdata opt_rsb_pv = true;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -878,6 +878,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1078,21 +1085,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index a8222e978cd9..bcba926bda41 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 
 /* Bug words follow the synthetic words. */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 309b56e2d6b7..71d1ca243b32 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -295,6 +295,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index b252bb863111..157a2c67d89c 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index c60093b090b5..4a3777cc5227 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -141,6 +141,19 @@
     wrmsr
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -242,15 +255,12 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /* Use when exiting to HVM guest context. */
 #define SPEC_CTRL_EXIT_TO_HVM                                           \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM;             \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_HVM
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 38e1f1098210..fd36927ba1cb 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -318,7 +318,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
     printk("Speculative mitigation facilities:\n");
 
     /* Hardware features which pertain to speculative mitigations. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBRS/IBPB" : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_STIBP)) ? " STIBP"     : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_L1D_FLUSH)) ? " L1D_FLUSH" : "",
@@ -333,7 +333,12 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                ? " SSB_NO"    : "",
            (caps & ARCH_CAPS_MDS_NO)                ? " MDS_NO"    : "",
            (caps & ARCH_CAPS_TSX_CTRL)              ? " TSX_CTRL"  : "",
-           (caps & ARCH_CAPS_TAA_NO)                ? " TAA_NO"    : "");
+           (caps & ARCH_CAPS_TAA_NO)                ? " TAA_NO"    : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)          ? " SBDR_SSDP_NO" : "",
+           (caps & ARCH_CAPS_FBSDP_NO)              ? " FBSDP_NO"  : "",
+           (caps & ARCH_CAPS_PSDP_NO)               ? " PSDP_NO"   : "",
+           (caps & ARCH_CAPS_FB_CLEAR)              ? " FB_CLEAR"  : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)         ? " FB_CLEAR_CTRL" : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index ba9e90af210b..2a80660d849d 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -55,6 +55,11 @@
 #define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 #define ARCH_CAPS_TSX_CTRL		(_AC(1, ULL) << 7)
 #define ARCH_CAPS_TAA_NO		(_AC(1, ULL) << 8)
+#define ARCH_CAPS_SBDR_SSDP_NO		(_AC(1, ULL) << 13)
+#define ARCH_CAPS_FBSDP_NO		(_AC(1, ULL) << 14)
+#define ARCH_CAPS_PSDP_NO		(_AC(1, ULL) << 15)
+#define ARCH_CAPS_FB_CLEAR		(_AC(1, ULL) << 17)
+#define ARCH_CAPS_FB_CLEAR_CTRL		(_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e8bdf30fa46c..022cb01da762 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2035,7 +2035,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2114,8 +2114,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index fd36927ba1cb..d4ba9412067b 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination.
 
 static int8_t __initdata opt_srb_lock = -1;
 uint64_t __read_mostly default_xen_mcu_opt_ctrl;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -367,7 +371,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -887,7 +892,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1078,6 +1085,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1090,18 +1109,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1170,12 +1191,18 @@ void __init init_speculation_mitigations(void)
          * On some SRBDS-affected hardware, it may be safe to relax srb-lock
          * by default.
          *
-         * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only way
-         * to access the Fill Buffer.  If TSX isn't available (inc. SKU
-         * reasons on some models), or TSX is explicitly disabled, then there
-         * is no need for the extra overhead to protect RDRAND/RDSEED.
+         * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale
+         * RNG data becomes available to other contexts.  To recover the data,
+         * an attacker needs to use:
+         *  - SBDS (MDS or TAA to sample the cores fill buffer)
+         *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+         *  - DRPW (Architecturally latch stale fill buffer data)
+         *
+         * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and
+         * there is no unprivileged MMIO access, the RNG data doesn't need
+         * protecting.
          */
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && opt_tsx == 0)) )
             opt_srb_lock = 0;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 5467ae7168ff..ad85785e14b3 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2129,9 +2129,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3da81ebf1d41..5ea5ef6ba037 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -651,6 +651,8 @@ int arch_domain_create(struct domain *d,
 
     domain_cpu_policy_changed(d);
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -1763,14 +1765,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -1834,6 +1837,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 49651f3c435a..5f5de45a1309 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 1e226102d399..b4efc940aa2b 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static bool __initdata opt_rsb_pv = true;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -903,6 +903,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1148,21 +1155,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index 09f619459bc7..9eaab7a2a1fa 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0db551bff344..4ee76bba45da 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -308,6 +308,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index 9caecddfec96..68f6c46c470c 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 02b3b18ce69f..5a590bac44aa 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index b4efc940aa2b..38e0cc2847e0 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 7a39d94b9a70..c8670eab8ef5 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -56,6 +56,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
@@ -73,6 +78,7 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 #define  MCU_OPT_CTRL_RTM_ALLOW             (_AC(1, ULL) <<  1)
 #define  MCU_OPT_CTRL_RTM_LOCKED            (_AC(1, ULL) <<  2)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index ad85785e14b3..d1d5852cdd84 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2106,7 +2106,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2185,8 +2185,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 38e0cc2847e0..83b856fa9158 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
 
 static int8_t __initdata opt_srb_lock = -1;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -912,7 +917,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1148,6 +1155,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1160,18 +1179,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1236,14 +1257,19 @@ void __init init_speculation_mitigations(void)
      * On some SRBDS-affected hardware, it may be safe to relax srb-lock by
      * default.
      *
-     * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only known
-     * way to access the Fill Buffer.  If TSX isn't available (inc. SKU
-     * reasons on some models), or TSX is explicitly disabled, then there is
-     * no need for the extra overhead to protect RDRAND/RDSEED.
+     * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale RNG
+     * data becomes available to other contexts.  To recover the data, an
+     * attacker needs to use:
+     *  - SBDS (MDS or TAA to sample the cores fill buffer)
+     *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+     *  - DRPW (Architecturally latch stale fill buffer data)
+     *
+     * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+     * is no unprivileged MMIO access, the RNG data doesn't need protecting.
      */
     if ( cpu_has_srbds_ctrl )
     {
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 1cab26fef61f..e4c820e17053 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2194,9 +2194,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b21272988006..4a61e951facf 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -861,6 +861,8 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -1994,14 +1996,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -2065,6 +2068,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 49651f3c435a..5f5de45a1309 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 1e226102d399..b4efc940aa2b 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static bool __initdata opt_rsb_pv = true;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -903,6 +903,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1148,21 +1155,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index 09f619459bc7..9eaab7a2a1fa 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 7213d184b016..d0df7f83aa0c 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -319,6 +319,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index 9caecddfec96..68f6c46c470c 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 02b3b18ce69f..5a590bac44aa 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index b4efc940aa2b..38e0cc2847e0 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 947778105fb6..1e743461e91d 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -59,6 +59,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
@@ -76,6 +81,7 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 #define  MCU_OPT_CTRL_RTM_ALLOW             (_AC(1, ULL) <<  1)
 #define  MCU_OPT_CTRL_RTM_LOCKED            (_AC(1, ULL) <<  2)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e4c820e17053..e17a835ed254 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2171,7 +2171,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2250,8 +2250,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 38e0cc2847e0..83b856fa9158 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
 
 static int8_t __initdata opt_srb_lock = -1;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -912,7 +917,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1148,6 +1155,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1160,18 +1179,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1236,14 +1257,19 @@ void __init init_speculation_mitigations(void)
      * On some SRBDS-affected hardware, it may be safe to relax srb-lock by
      * default.
      *
-     * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only known
-     * way to access the Fill Buffer.  If TSX isn't available (inc. SKU
-     * reasons on some models), or TSX is explicitly disabled, then there is
-     * no need for the extra overhead to protect RDRAND/RDSEED.
+     * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale RNG
+     * data becomes available to other contexts.  To recover the data, an
+     * attacker needs to use:
+     *  - SBDS (MDS or TAA to sample the cores fill buffer)
+     *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+     *  - DRPW (Architecturally latch stale fill buffer data)
+     *
+     * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+     * is no unprivileged MMIO access, the RNG data doesn't need protecting.
      */
     if ( cpu_has_srbds_ctrl )
     {
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Make VERW flushing runtime conditional

Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type.  However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.

Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.

Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.

For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.

No change in behaviour.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 1d08fb7e9aa6..d5cb09f86541 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2258,9 +2258,8 @@ in place for guests to use.
 Use of a positive boolean value for either of these options is invalid.
 
 The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
-grained control over the alternative blocks used by Xen.  These impact Xen's
-ability to protect itself, and Xen's ability to virtualise support for guests
-to use.
+grained control over the primitives by Xen.  These impact Xen's ability to
+protect itself, and Xen's ability to virtualise support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ef1812dc1402..1fe6644a71ae 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -863,6 +863,8 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    spec_ctrl_init_domain(d);
+
     return 0;
 
  fail:
@@ -2017,14 +2019,15 @@ static void __context_switch(void)
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
+    struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
 
     if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
     {
@@ -2088,6 +2091,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
                 *last_id = next_id;
             }
         }
+
+        /* Update the top-of-stack block with the VERW disposition. */
+        info->spec_ctrl_flags &= ~SCF_verw;
+        if ( nextd->arch.verw )
+            info->spec_ctrl_flags |= SCF_verw;
     }
 
     sched_context_switched(prev, next);
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 49651f3c435a..5f5de45a1309 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
         /* SPEC_CTRL_EXIT_TO_VMX   Req: %rsp=regs/cpuinfo              Clob:    */
-        ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
+        DO_SPEC_CTRL_COND_VERW
 
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index c19464da70ce..21730aa03071 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
 static bool __initdata opt_msr_sc_hvm = true;
 static int8_t __initdata opt_rsb_pv = -1;
 static bool __initdata opt_rsb_hvm = true;
-static int8_t __initdata opt_md_clear_pv = -1;
-static int8_t __initdata opt_md_clear_hvm = -1;
+static int8_t __read_mostly opt_md_clear_pv = -1;
+static int8_t __read_mostly opt_md_clear_hvm = -1;
 
 /* Cmdline controls for Xen's speculative settings. */
 static enum ind_thunk {
@@ -932,6 +932,13 @@ static __init void mds_calculations(uint64_t caps)
     }
 }
 
+void spec_ctrl_init_domain(struct domain *d)
+{
+    bool pv = is_pv_domain(d);
+
+    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+}
+
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
@@ -1196,21 +1203,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The PV blocks need using all the
-     * time, and the Idle blocks need using if either PV or HVM defences are
-     * used.
+     * Enable MDS defences as applicable.  The Idle blocks need using if
+     * either PV or HVM defences are used.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
-     * equivelent semantics to avoid needing to perform both flushes on the
-     * HVM path.  The HVM blocks don't need activating if our hypervisor told
-     * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
+     * equivalent semantics to avoid needing to perform both flushes on the
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     *
+     * After calculating the appropriate idle setting, simplify
+     * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+     * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
     if ( opt_md_clear_pv || opt_md_clear_hvm )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
-    if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
-        setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
+    opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
     /*
      * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index ff3157d52d13..bd45a144ee78 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -35,8 +35,7 @@ XEN_CPUFEATURE(SC_RSB_HVM,        X86_SYNTH(19)) /* RSB overwrite needed for HVM
 XEN_CPUFEATURE(XEN_SELFSNOOP,     X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
 XEN_CPUFEATURE(SC_MSR_IDLE,       X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
 XEN_CPUFEATURE(XEN_LBR,           X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
-XEN_CPUFEATURE(SC_VERW_PV,        X86_SYNTH(23)) /* VERW used by Xen for PV */
-XEN_CPUFEATURE(SC_VERW_HVM,       X86_SYNTH(24)) /* VERW used by Xen for HVM */
+/* Bits 23,24 unused. */
 XEN_CPUFEATURE(SC_VERW_IDLE,      X86_SYNTH(25)) /* VERW used by Xen for idle */
 XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 92d54de0b9a1..2398a1d99da9 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -319,6 +319,9 @@ struct arch_domain
     uint32_t pci_cf8;
     uint8_t cmos_idx;
 
+    /* Use VERW on return-to-guest for its flushing side effect. */
+    bool verw;
+
     union {
         struct pv_domain pv;
         struct hvm_domain hvm;
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index f76029523610..751355f471f4 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 #define SCF_use_shadow (1 << 0)
 #define SCF_ist_wrmsr  (1 << 1)
 #define SCF_ist_rsb    (1 << 2)
+#define SCF_verw       (1 << 3)
 
 #ifndef __ASSEMBLY__
 
@@ -32,6 +33,7 @@
 #include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
+void spec_ctrl_init_domain(struct domain *d);
 
 extern bool opt_ibpb;
 extern bool opt_ssbd;
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 02b3b18ce69f..5a590bac44aa 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -136,6 +136,19 @@
 #endif
 .endm
 
+.macro DO_SPEC_CTRL_COND_VERW
+/*
+ * Requires %rsp=cpuinfo
+ *
+ * Issue a VERW for its flushing side effect, if indicated.  This is a Spectre
+ * v1 gadget, but the IRET/VMEntry is serialising.
+ */
+    testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
+    jz .L\@_verw_skip
+    verw CPUINFO_verw_sel(%rsp)
+.L\@_verw_skip:
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
@@ -231,8 +244,7 @@
 #define SPEC_CTRL_EXIT_TO_PV                                            \
     ALTERNATIVE "",                                                     \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV;              \
-    ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)),           \
-        X86_FEATURE_SC_VERW_PV
+    DO_SPEC_CTRL_COND_VERW
 
 /*
  * Use in IST interrupt/exception context.  May interrupt Xen or PV context.
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls

The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.

FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect.  This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.

FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 21730aa03071..d285538bde9f 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
      * Hardware read-only information, stating immunity to certain issues, or
      * suggestions of which mitigation to use.
      */
-    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (caps & ARCH_CAPS_RDCL_NO)                        ? " RDCL_NO"        : "",
            (caps & ARCH_CAPS_IBRS_ALL)                       ? " IBRS_ALL"       : "",
            (caps & ARCH_CAPS_RSBA)                           ? " RSBA"           : "",
@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (caps & ARCH_CAPS_SSB_NO)                         ? " SSB_NO"         : "",
            (caps & ARCH_CAPS_MDS_NO)                         ? " MDS_NO"         : "",
            (caps & ARCH_CAPS_TAA_NO)                         ? " TAA_NO"         : "",
+           (caps & ARCH_CAPS_SBDR_SSDP_NO)                   ? " SBDR_SSDP_NO"   : "",
+           (caps & ARCH_CAPS_FBSDP_NO)                       ? " FBSDP_NO"       : "",
+           (caps & ARCH_CAPS_PSDP_NO)                        ? " PSDP_NO"        : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS))    ? " IBRS_ALWAYS"    : "",
            (e8b  & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS))   ? " STIBP_ALWAYS"   : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_FAST))      ? " IBRS_FAST"      : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
 
     /* Hardware features which need driving to mitigate issues. */
-    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
+    printk("  Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
            (e8b  & cpufeat_mask(X86_FEATURE_IBPB)) ||
            (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB))          ? " IBPB"           : "",
            (e8b  & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR))       ? " MD_CLEAR"       : "",
            (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL))     ? " SRBDS_CTRL"     : "",
            (e8b  & cpufeat_mask(X86_FEATURE_VIRT_SSBD))      ? " VIRT_SSBD"      : "",
-           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "");
+           (caps & ARCH_CAPS_TSX_CTRL)                       ? " TSX_CTRL"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR)                       ? " FB_CLEAR"       : "",
+           (caps & ARCH_CAPS_FB_CLEAR_CTRL)                  ? " FB_CLEAR_CTRL"  : "");
 
     /* Compiled-in support which pertains to mitigations. */
     if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 31964b88af7a..72bc32ba04ff 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -66,6 +66,11 @@
 #define  ARCH_CAPS_IF_PSCHANGE_MC_NO        (_AC(1, ULL) <<  6)
 #define  ARCH_CAPS_TSX_CTRL                 (_AC(1, ULL) <<  7)
 #define  ARCH_CAPS_TAA_NO                   (_AC(1, ULL) <<  8)
+#define  ARCH_CAPS_SBDR_SSDP_NO             (_AC(1, ULL) << 13)
+#define  ARCH_CAPS_FBSDP_NO                 (_AC(1, ULL) << 14)
+#define  ARCH_CAPS_PSDP_NO                  (_AC(1, ULL) << 15)
+#define  ARCH_CAPS_FB_CLEAR                 (_AC(1, ULL) << 17)
+#define  ARCH_CAPS_FB_CLEAR_CTRL            (_AC(1, ULL) << 18)
 
 #define MSR_FLUSH_CMD                       0x0000010b
 #define  FLUSH_CMD_L1D                      (_AC(1, ULL) <<  0)
@@ -83,6 +88,7 @@
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 #define  MCU_OPT_CTRL_RTM_ALLOW             (_AC(1, ULL) <<  1)
 #define  MCU_OPT_CTRL_RTM_LOCKED            (_AC(1, ULL) <<  2)
+#define  MCU_OPT_CTRL_FB_CLEAR_DIS          (_AC(1, ULL) <<  3)
 
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio

Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.

As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.

However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage.  Such
users should enable this option.

On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.

On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:

 * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
   using FB_CLEAR.
 * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
   srb-lock, previously used to mitigate SRBDS.

Both mitigations require microcode from IPU 2022.1, May 2022.

This is part of XSA-404.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
ARCH_CAPS_FB_CLEAR hunk needs !!

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index d5cb09f86541..a642e43476a2 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2235,7 +2235,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
->              l1d-flush,branch-harden,srb-lock}=<bool> ]`
+>              l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
 
 Controls for speculative execution sidechannel mitigations.  By default, Xen
 will pick the most appropriate mitigations based on compiled in support,
@@ -2314,8 +2314,16 @@ Xen will enable this mitigation.
 On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
 or prevent Xen from protect the Special Register Buffer from leaking stale
 data. By default, Xen will enable this mitigation, except on parts where MDS
-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
-way for an attacker to obtain the stale data).
+is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
+mappings (in which case, there is believed to be no way for an attacker to
+obtain stale data).
+
+The `unpriv-mmio=` boolean indicates whether the system has (or will have)
+less than fully privileged domains granted access to MMIO devices.  By
+default, this option is disabled.  If enabled, Xen will use the `FB_CLEAR`
+and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
+release to mitigate cross-domain leakage of data via the MMIO Stale Data
+vulnerabilities.
 
 ### sync_console
 > `= <boolean>`
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index d285538bde9f..099113ba41e6 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -67,6 +67,8 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
 static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
 
 static int8_t __initdata opt_srb_lock = -1;
+static bool __initdata opt_unpriv_mmio;
+static bool __read_mostly opt_fb_clear_mmio;
 
 static int __init parse_spec_ctrl(const char *s)
 {
@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
             opt_branch_harden = val;
         else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
             opt_srb_lock = val;
+        else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
+            opt_unpriv_mmio = val;
         else
             rc = -EINVAL;
 
@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
            opt_srb_lock                              ? " SRB_LOCK+" : " SRB_LOCK-",
            opt_ibpb                                  ? " IBPB"  : "",
            opt_l1d_flush                             ? " L1D_FLUSH" : "",
-           opt_md_clear_pv || opt_md_clear_hvm       ? " VERW"  : "",
+           opt_md_clear_pv || opt_md_clear_hvm ||
+           opt_fb_clear_mmio                         ? " VERW"  : "",
            opt_branch_harden                         ? " BRANCH_HARDEN" : "");
 
     /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
@@ -941,7 +946,9 @@ void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
 
-    d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
+    d->arch.verw =
+        (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
+        (opt_fb_clear_mmio && is_iommu_enabled(d));
 }
 
 void __init init_speculation_mitigations(void)
@@ -1196,6 +1203,18 @@ void __init init_speculation_mitigations(void)
     mds_calculations(caps);
 
     /*
+     * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
+     * reintroduced the VERW fill buffer flushing side effect because of a
+     * susceptibility to FBSDP.
+     *
+     * If unprivileged guests have (or will have) MMIO mappings, we can
+     * mitigate cross-domain leakage of fill buffer data by issuing VERW on
+     * the return-to-guest path.
+     */
+    if ( opt_unpriv_mmio )
+        opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
+
+    /*
      * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
      * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
      * but it is somewhat better than nothing.
@@ -1208,18 +1227,20 @@ void __init init_speculation_mitigations(void)
                             boot_cpu_has(X86_FEATURE_MD_CLEAR));
 
     /*
-     * Enable MDS defences as applicable.  The Idle blocks need using if
-     * either PV or HVM defences are used.
+     * Enable MDS/MMIO defences as applicable.  The Idle blocks need using if
+     * either the PV or HVM MDS defences are used, or if we may give MMIO
+     * access to untrusted guests.
      *
      * HVM is more complicated.  The MD_CLEAR microcode extends L1D_FLUSH with
      * equivalent semantics to avoid needing to perform both flushes on the
-     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH.
+     * HVM path.  Therefore, we don't need VERW in addition to L1D_FLUSH (for
+     * MDS mitigations.  L1D_FLUSH is not safe for MMIO mitigations.)
      *
      * After calculating the appropriate idle setting, simplify
      * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
      * guests", so spec_ctrl_init_domain() can calculate suitable settings.
      */
-    if ( opt_md_clear_pv || opt_md_clear_hvm )
+    if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
         setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
     opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
 
@@ -1284,14 +1305,19 @@ void __init init_speculation_mitigations(void)
      * On some SRBDS-affected hardware, it may be safe to relax srb-lock by
      * default.
      *
-     * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only known
-     * way to access the Fill Buffer.  If TSX isn't available (inc. SKU
-     * reasons on some models), or TSX is explicitly disabled, then there is
-     * no need for the extra overhead to protect RDRAND/RDSEED.
+     * All parts with SRBDS_CTRL suffer SSDP, the mechanism by which stale RNG
+     * data becomes available to other contexts.  To recover the data, an
+     * attacker needs to use:
+     *  - SBDS (MDS or TAA to sample the cores fill buffer)
+     *  - SBDR (Architecturally retrieve stale transaction buffer contents)
+     *  - DRPW (Architecturally latch stale fill buffer data)
+     *
+     * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
+     * is no unprivileged MMIO access, the RNG data doesn't need protecting.
      */
     if ( cpu_has_srbds_ctrl )
     {
-        if ( opt_srb_lock == -1 &&
+        if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
              (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
              (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
             opt_srb_lock = 0;