Xen Security Advisory 469 v2 (CVE-2024-28956) - x86: Indirect Target Selection

Xen.org security team posted 1 patch 7 months ago
Failed in applying to current master (apply log)
Xen Security Advisory 469 v2 (CVE-2024-28956) - x86: Indirect Target Selection
Posted by Xen.org security team 7 months ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2024-28956 / XSA-469
                               version 2

                    x86: Indirect Target Selection

UPDATES IN VERSION 2
====================

State the CVE.

ISSUE DESCRIPTION
=================

Researchers at VU Amsterdam have released Training Solo, detailing
several speculative attacks which bypass current protections.

One issue, which Intel have named Indirect Target Selection, is a bug in
the hardware support for prediction-domain isolation.  The mitigation
for this involves both microcode and software changes in Xen.

For more details, see:
  https://vusec.net/projects/training-solo
  https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html

Another issue discussed in the Training Solo paper pertains to
classic-BPF.  Xen does not have any capability similar to BPF filters,
so is not believed to be affected by this issue.

IMPACT
======

An attacker might be able to infer the contents of arbitrary host
memory, including memory assigned to other guests.

VULNERABLE SYSTEMS
==================

Systems running all versions of Xen are affected.

Only Intel x86 CPUs are potentially affected.  CPUs from other
manufacturers are not known to be affected.

The affected Intel CPUs are believed to be those which have eIBRS
(hardware Spectre-v2 mitigations, starting with Cascade Lake) up to but
not including those which have BHI_CTRL (Alder Lake and Sapphire
Rapids).  See the Intel whitepaper for full details.

MITIGATION
==========

There are no mitigations.

RESOLUTION
==========

Intel are producing microcode to address part of this issue, by
extending IBPB.  This was released in IPU 2025.1 (February 2025).
Consult your dom0 OS vendor and/or hardware vendor for updated
microcode.

In addition to the microcode, changes are required to Xen to address the
other part of the issue.  These involve recompiling Xen using Return
Thunks (-mfunction-return).  Support for Return Thunks is available in
GCC 8 and Clang 15.  Therefore, the Xen patches further rely on a
toolchain at least this new.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa469/xsa469-??.patch           xen-unstable
xsa469/xsa469-4.20-??.patch      Xen 4.20.x
xsa469/xsa469-4.19-??.patch      Xen 4.19.x
xsa469/xsa469-4.18-??.patch      Xen 4.18.x
xsa469/xsa469-4.17-??.patch      Xen 4.17.x

$ sha256sum xsa469*/*
24648f43282a4c4917f49a828f0889139ffd08710884c2d5d2a149213f889316  xsa469/xsa469-01.patch
f9e1e999e5121b71263ccac9e5dbc7f2422751cd8776526b59a3a535245ee271  xsa469/xsa469-02.patch
3aebd84d8c9374e8db4f28a1bd0265ff46d60a142e8cc983c0eae99a3caa4459  xsa469/xsa469-03.patch
1f836801a543d00a4e79daff0438932c7d7b5e9ee76dead124d290e805adc0f2  xsa469/xsa469-4.17-01.patch
aaa7b6967b3ff8b279675edafbccff4e1882202251e8f37c1aaab8e16f4ca8c9  xsa469/xsa469-4.17-02.patch
cf9a224628f218a7afdf7a275da270b8f469342133d3239c19a8b05adb8d9d9e  xsa469/xsa469-4.17-03.patch
5179885f9452f932b9829e942a6942d7a1bba6820c383b0e0ac7ade34f6f444c  xsa469/xsa469-4.17-04.patch
7d44272960d7ecdc920a14eb19c686c6808fe980bb85c3a467b5e343f137d927  xsa469/xsa469-4.17-05.patch
3984f20f3b84a579a4828da8c60268f55f1ff2dfc030d1790efb52a6ad659cd9  xsa469/xsa469-4.17-06.patch
8ec20c0943def1764c858c76699991aa0b7af76a8773be8219d62c240cf0b294  xsa469/xsa469-4.17-07.patch
00943be479ac428a496dec248c33e9ae9a0002e437a50c46a43d15146ddcccd8  xsa469/xsa469-4.18-01.patch
b5e9ab3ba33c7a5ad2948521a71d73b8e587d4b86fb73b3ce06557343753b097  xsa469/xsa469-4.18-02.patch
cf9a224628f218a7afdf7a275da270b8f469342133d3239c19a8b05adb8d9d9e  xsa469/xsa469-4.18-03.patch
5179885f9452f932b9829e942a6942d7a1bba6820c383b0e0ac7ade34f6f444c  xsa469/xsa469-4.18-04.patch
d7271118c516ea2b57b82afbaa67e44ebc0bee9067925850671a47ba7e6916fc  xsa469/xsa469-4.18-05.patch
99aa42f3b4d492c21d3027f74abe20183cdd411fdf94d62c10c423e6516db5aa  xsa469/xsa469-4.18-06.patch
893e3b721fe00623a8d75414f40f11033a4a3999d368a141eec8652d9733a77e  xsa469/xsa469-4.18-07.patch
a5a958452c4b3adc820692349103892ec3ffc3195c9b4e1fc55547d2ec7aec57  xsa469/xsa469-4.19-01.patch
2f3a3dadc69026ffa0a093ae34c239023e2336b3f9ec90a1ddc5df7b103c158c  xsa469/xsa469-4.19-02.patch
3aebd84d8c9374e8db4f28a1bd0265ff46d60a142e8cc983c0eae99a3caa4459  xsa469/xsa469-4.19-03.patch
bbdb2fcda3d2fe36ceaaf235700158f5a4f08c8983732265725cc7ac811c9bd2  xsa469/xsa469-4.19-04.patch
fd01af8b8cc66b6a1f0e2a7383b4c320cbb1513b7f23a7bd84fbb3e626751122  xsa469/xsa469-4.19-05.patch
1ed51e005c7f07a99793bfc97680bc02e8661412d88a853594f40a02599e03c2  xsa469/xsa469-4.19-06.patch
4dda661661f23b48f62caeaadd1b02ef79b2d7b08e61ab44c2eacdd2171f63b8  xsa469/xsa469-4.19-07.patch
27ce09ab9bb9460bec08761b2f739ce5adb292dc13664dfa425e42e5b8db8731  xsa469/xsa469-4.20-01.patch
f9e1e999e5121b71263ccac9e5dbc7f2422751cd8776526b59a3a535245ee271  xsa469/xsa469-4.20-02.patch
3aebd84d8c9374e8db4f28a1bd0265ff46d60a142e8cc983c0eae99a3caa4459  xsa469/xsa469-4.20-03.patch
bbdb2fcda3d2fe36ceaaf235700158f5a4f08c8983732265725cc7ac811c9bd2  xsa469/xsa469-4.20-04.patch
656031e56bbf8c8d0e8bd294c1e8857dd80850ae3cc68efbf42564f2b9efdb1f  xsa469/xsa469-4.20-05.patch
3343a834e0931b56490a50cc9fc7488a13a98a00923e4399020f628b3aab0220  xsa469/xsa469-4.20-06.patch
465d916cac02bb4ae1edf9db34fe5d2426ac1681a1078d331fbdf449834692c4  xsa469/xsa469-4.20-07.patch
97335c870407928039464307611c0b9687fd7cdfde5124ec5441a4fddfaa4165  xsa469/xsa469-04.patch
226771d9359b3c341c0978a145c29478d4c6ca196ec67d48c3df6115aee27940  xsa469/xsa469-05.patch
17eae8965871975569be4228943d5136eb65951d9c9513d2b0fe743e8cf8e4ef  xsa469/xsa469-06.patch
fa8b7ed035e242a3881431150ee86965f26520dab4432e3f7ac9862063491f72  xsa469/xsa469-07.patch
$
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmgiLJAMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZ7uUIAIrV83bfhcHAnZvWo5CGOtuyuwQQuobYKqUcC3fq
/LGKVwmrQVCR6qm2YeI8TIhJfet8OiqutteaVknAJlegVR8uTIznIr7CzjmzgoR5
ojnw0Ae0rQ5rhzSeMvBD1VsbheBI7nmLR9dorMs+Rp9MiVzboT81XOpWDTMHisjI
4LpN+TB+yIB76hQQ4divDnqFKxA1SFQWrmn2bcr3v5cJ1cctvLOic/67WO9Uto92
5tG7iV68cStski0wMvjO1isalCTKSaU7eDjTe1ar+/SwhE4T0RKfcpim3xcc5ZVk
cH/xKcK4vbJUKIxjqnDrEdT3uGk1vdVxJfNAOCpfZc5EwiI=
=A2YZ
-----END PGP SIGNATURE-----
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/alternative: Support replacements when a feature is not present

Use the top bit of a->cpuid to express inverted polarity.  This requires
stripping the top bit back out when performing the sanity checks.

Despite only being used once, create a replace boolean to express the decision
more clearly in _apply_alternatives().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 43b009888c02..4d1bbe7313fd 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -228,6 +228,8 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t *repl = ALT_REPL_PTR(a);
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
+        unsigned int feat = a->cpuid & ~ALT_FLAG_NOT;
+        bool inv = a->cpuid & ALT_FLAG_NOT, replace;
 
         if ( a->repl_len > total_len )
         {
@@ -245,11 +247,11 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             return -ENOSPC;
         }
 
-        if ( a->cpuid >= NCAPINTS * 32 )
+        if ( feat >= NCAPINTS * 32 )
         {
              printk(XENLOG_ERR
                    "Alt for %ps, feature %#x outside of featureset range %#x\n",
-                   ALT_ORIG_PTR(a), a->cpuid, NCAPINTS * 32);
+                   ALT_ORIG_PTR(a), feat, NCAPINTS * 32);
             return -ERANGE;
         }
 
@@ -271,8 +273,14 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
         if ( a->priv )
             continue;
 
+        /*
+         * Should a replacement be performed?  Most replacements have positive
+         * polarity, but we support negative polarity too.
+         */
+        replace = boot_cpu_has(feat) ^ inv;
+
         /* If there is no replacement to make, see about optimising the nops. */
-        if ( !boot_cpu_has(a->cpuid) )
+        if ( !replace )
         {
             /* Origin site site already touched?  Don't nop anything. */
             if ( base->priv )
diff --git a/xen/arch/x86/include/asm/alternative-asm.h b/xen/arch/x86/include/asm/alternative-asm.h
index 22da9f89f15a..3eb0f4e8a02a 100644
--- a/xen/arch/x86/include/asm/alternative-asm.h
+++ b/xen/arch/x86/include/asm/alternative-asm.h
@@ -12,7 +12,7 @@
  * instruction. See apply_alternatives().
  */
 .macro altinstruction_entry orig, repl, feature, orig_len, repl_len, pad_len
-    .if \feature >= NCAPINTS * 32
+    .if ((\feature) & ~ALT_FLAG_NOT) >= NCAPINTS * 32
         .error "alternative feature outside of featureset range"
     .endif
     .long \orig - .
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 29c3d724b07f..b9ea49bd1c3d 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -1,6 +1,13 @@
 #ifndef __X86_ALTERNATIVE_H__
 #define __X86_ALTERNATIVE_H__
 
+/*
+ * Common to both C and ASM.  Express a replacement when a feature is not
+ * available.
+ */
+#define ALT_FLAG_NOT (1 << 15)
+#define ALT_NOT(x) (ALT_FLAG_NOT | (x))
+
 #ifdef __ASSEMBLY__
 #include <asm/alternative-asm.h>
 #else
@@ -14,7 +21,7 @@
 struct __packed alt_instr {
     int32_t  orig_offset;   /* original instruction */
     int32_t  repl_offset;   /* offset to replacement instruction */
-    uint16_t cpuid;         /* cpuid bit set for replacement */
+    uint16_t cpuid;         /* cpuid bit set for replacement (top bit is polarity) */
     uint8_t  orig_len;      /* length of original instruction */
     uint8_t  repl_len;      /* length of new instruction */
     uint8_t  pad_len;       /* length of build-time padding */
@@ -61,7 +68,7 @@ extern void alternative_instructions(void);
                     alt_repl_len(n2)) "-" alt_orig_len)
 
 #define ALTINSTR_ENTRY(feature, num)                                    \
-        " .if " STR(feature) " >= " STR(NCAPINTS * 32) "\n"             \
+        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
         " .error \"alternative feature outside of featureset range\"\n" \
         " .endif\n"                                                     \
         " .long .LXEN%=_orig_s - .\n"             /* label           */ \

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/guest: Remove use of the Xen hypercall_page

In order to protect against ITS, Xen needs to start using return thunks.
Therefore the advice in XSA-466 becomes relevant, and the hypercall_page needs
to be removed.

Implement early_hypercall(), with infrastructure to figure out the correct
instruction on first use.  Use ALTERNATIVE()s to result in inline hypercalls,
including the ALT_NOT() form so we only need a single synthetic feature bit.

No overall change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/guest/xen/Makefile b/xen/arch/x86/guest/xen/Makefile
index 26fb4b1007c0..8b3250aa8886 100644
--- a/xen/arch/x86/guest/xen/Makefile
+++ b/xen/arch/x86/guest/xen/Makefile
@@ -1,4 +1,4 @@
-obj-y += hypercall_page.o
+obj-bin-y += hypercall.init.o
 obj-y += xen.o
 
 obj-bin-$(CONFIG_PVH_GUEST) += pvh-boot.init.o
diff --git a/xen/arch/x86/guest/xen/hypercall.S b/xen/arch/x86/guest/xen/hypercall.S
new file mode 100644
index 000000000000..05e429794cc4
--- /dev/null
+++ b/xen/arch/x86/guest/xen/hypercall.S
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include <xen/linkage.h>
+
+        .section .init.text, "ax", @progbits
+
+        /*
+         * Used during early boot, before alternatives have run and inlined
+         * the appropriate instruction.  Called using the hypercall ABI.
+         */
+FUNC(early_hypercall)
+        cmpb    $0, early_hypercall_insn(%rip)
+        jl      .L_setup
+        je      1f
+
+        vmmcall
+        ret
+
+1:      vmcall
+        ret
+
+.L_setup:
+        /*
+         * When setting up the first time around, all registers need
+         * preserving.  Save the non-callee-saved ones.
+         */
+        push    %r11
+        push    %r10
+        push    %r9
+        push    %r8
+        push    %rdi
+        push    %rsi
+        push    %rdx
+        push    %rcx
+        push    %rax
+
+        call    early_hypercall_setup
+
+        pop     %rax
+        pop     %rcx
+        pop     %rdx
+        pop     %rsi
+        pop     %rdi
+        pop     %r8
+        pop     %r9
+        pop     %r10
+        pop     %r11
+
+        jmp     early_hypercall
+END(early_hypercall)
diff --git a/xen/arch/x86/guest/xen/hypercall_page.S b/xen/arch/x86/guest/xen/hypercall_page.S
deleted file mode 100644
index 7ab55fc1f6e6..000000000000
--- a/xen/arch/x86/guest/xen/hypercall_page.S
+++ /dev/null
@@ -1,76 +0,0 @@
-#include <asm/page.h>
-#include <asm/asm_defns.h>
-#include <public/xen.h>
-
-        .section ".text.page_aligned", "ax", @progbits
-
-DATA(hypercall_page, PAGE_SIZE)
-         /* Poisoned with `ret` for safety before hypercalls are set up. */
-        .fill PAGE_SIZE, 1, 0xc3
-END(hypercall_page)
-
-/*
- * Identify a specific hypercall in the hypercall page
- * @param name Hypercall name.
- */
-#define DECLARE_HYPERCALL(name)                                                 \
-        .globl HYPERCALL_ ## name;                                              \
-        .type  HYPERCALL_ ## name, STT_FUNC;                                    \
-        .size  HYPERCALL_ ## name, 32;                                          \
-        .set   HYPERCALL_ ## name, hypercall_page + __HYPERVISOR_ ## name * 32
-
-DECLARE_HYPERCALL(set_trap_table)
-DECLARE_HYPERCALL(mmu_update)
-DECLARE_HYPERCALL(set_gdt)
-DECLARE_HYPERCALL(stack_switch)
-DECLARE_HYPERCALL(set_callbacks)
-DECLARE_HYPERCALL(fpu_taskswitch)
-DECLARE_HYPERCALL(sched_op_compat)
-DECLARE_HYPERCALL(platform_op)
-DECLARE_HYPERCALL(set_debugreg)
-DECLARE_HYPERCALL(get_debugreg)
-DECLARE_HYPERCALL(update_descriptor)
-DECLARE_HYPERCALL(memory_op)
-DECLARE_HYPERCALL(multicall)
-DECLARE_HYPERCALL(update_va_mapping)
-DECLARE_HYPERCALL(set_timer_op)
-DECLARE_HYPERCALL(event_channel_op_compat)
-DECLARE_HYPERCALL(xen_version)
-DECLARE_HYPERCALL(console_io)
-DECLARE_HYPERCALL(physdev_op_compat)
-DECLARE_HYPERCALL(grant_table_op)
-DECLARE_HYPERCALL(vm_assist)
-DECLARE_HYPERCALL(update_va_mapping_otherdomain)
-DECLARE_HYPERCALL(iret)
-DECLARE_HYPERCALL(vcpu_op)
-DECLARE_HYPERCALL(set_segment_base)
-DECLARE_HYPERCALL(mmuext_op)
-DECLARE_HYPERCALL(xsm_op)
-DECLARE_HYPERCALL(nmi_op)
-DECLARE_HYPERCALL(sched_op)
-DECLARE_HYPERCALL(callback_op)
-DECLARE_HYPERCALL(xenoprof_op)
-DECLARE_HYPERCALL(event_channel_op)
-DECLARE_HYPERCALL(physdev_op)
-DECLARE_HYPERCALL(hvm_op)
-DECLARE_HYPERCALL(sysctl)
-DECLARE_HYPERCALL(domctl)
-DECLARE_HYPERCALL(kexec_op)
-DECLARE_HYPERCALL(argo_op)
-DECLARE_HYPERCALL(xenpmu_op)
-
-DECLARE_HYPERCALL(arch_0)
-DECLARE_HYPERCALL(arch_1)
-DECLARE_HYPERCALL(arch_2)
-DECLARE_HYPERCALL(arch_3)
-DECLARE_HYPERCALL(arch_4)
-DECLARE_HYPERCALL(arch_5)
-DECLARE_HYPERCALL(arch_6)
-DECLARE_HYPERCALL(arch_7)
-
-/*
- * Local variables:
- * tab-width: 8
- * indent-tabs-mode: nil
- * End:
- */
diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c
index c17f5c447ab6..5c9f393c7514 100644
--- a/xen/arch/x86/guest/xen/xen.c
+++ b/xen/arch/x86/guest/xen/xen.c
@@ -26,7 +26,6 @@
 bool __read_mostly xen_guest;
 
 uint32_t __read_mostly xen_cpuid_base;
-extern char hypercall_page[];
 static struct rangeset *mem;
 
 DEFINE_PER_CPU(unsigned int, vcpu_id);
@@ -35,6 +34,50 @@ static struct vcpu_info *vcpu_info;
 static unsigned long vcpu_info_mapped[BITS_TO_LONGS(NR_CPUS)];
 DEFINE_PER_CPU(struct vcpu_info *, vcpu_info);
 
+/*
+ * Which instruction to use for early hypercalls:
+ *   < 0 setup
+ *     0 vmcall
+ *   > 0 vmmcall
+ */
+int8_t __initdata early_hypercall_insn = -1;
+
+/*
+ * Called once during the first hypercall to figure out which instruction to
+ * use.  Error handling options are limited.
+ */
+void asmlinkage __init early_hypercall_setup(void)
+{
+    BUG_ON(early_hypercall_insn != -1);
+
+    if ( !boot_cpu_data.x86_vendor )
+    {
+        unsigned int eax, ebx, ecx, edx;
+
+        cpuid(0, &eax, &ebx, &ecx, &edx);
+
+        boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
+    }
+
+    switch ( boot_cpu_data.x86_vendor )
+    {
+    case X86_VENDOR_INTEL:
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        early_hypercall_insn = 0;
+        setup_force_cpu_cap(X86_FEATURE_USE_VMCALL);
+        break;
+
+    case X86_VENDOR_AMD:
+    case X86_VENDOR_HYGON:
+        early_hypercall_insn = 1;
+        break;
+
+    default:
+        BUG();
+    }
+}
+
 static void __init find_xen_leaves(void)
 {
     uint32_t eax, ebx, ecx, edx, base;
@@ -337,9 +380,6 @@ const struct hypervisor_ops *__init xg_probe(void)
     if ( !xen_cpuid_base )
         return NULL;
 
-    /* Fill the hypercall page. */
-    wrmsrl(cpuid_ebx(xen_cpuid_base + 2), __pa(hypercall_page));
-
     xen_guest = true;
 
     return &ops;
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ba3df174b76e..9e3ed21c026d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks *
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
+XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/guest/xen-hcall.h b/xen/arch/x86/include/asm/guest/xen-hcall.h
index 665b472d05ac..96004dec9909 100644
--- a/xen/arch/x86/include/asm/guest/xen-hcall.h
+++ b/xen/arch/x86/include/asm/guest/xen-hcall.h
@@ -30,9 +30,11 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__) ASM_CALL_CONSTRAINT              \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1))                                          \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -42,10 +44,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__)                    \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2))                        \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -55,10 +59,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__)      \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3))      \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -69,10 +75,12 @@
         long res, tmp__;                                                \
         register long _a4 asm ("r10") = ((long)(a4));                   \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__),     \
               "=&r" (tmp__) ASM_CALL_CONSTRAINT                         \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3)),     \
               "4" (_a4)                                                 \
             : "memory" );                                               \
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: (Mis)align __x86_indirect_thunk_* to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

Arrange for __x86_indirect_thunk_* to always be in the second half.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index fd5493c22b16..c4b978d67b8e 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -11,6 +11,10 @@
 
 #include <asm/asm_defns.h>
 
+/* Alignment is dealt with explicitly here; override the respective macro. */
+#undef SYM_ALIGN
+#define SYM_ALIGN(align...)
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -35,6 +39,16 @@
 .macro GEN_INDIRECT_THUNK reg:req
         .section .text.__x86_indirect_thunk_\reg, "ax", @progbits
 
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
 FUNC(__x86_indirect_thunk_\reg)
         ALTERNATIVE_2 __stringify(IND_THUNK_RETPOLINE \reg),              \
         __stringify(IND_THUNK_LFENCE \reg), X86_FEATURE_IND_THUNK_LFENCE, \
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/alternative: Support replacements when a feature is not present

Use the top bit of a->cpuid to express inverted polarity.  This requires
stripping the top bit back out when performing the sanity checks.

Despite only being used once, create a replace boolean to express the decision
more clearly in _apply_alternatives().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 8356414be701..6eee6e501ad9 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -209,10 +209,12 @@ static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t *repl = ALT_REPL_PTR(a);
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
+        unsigned int feat = a->cpuid & ~ALT_FLAG_NOT;
+        bool inv = a->cpuid & ALT_FLAG_NOT, replace;
 
         BUG_ON(a->repl_len > total_len);
         BUG_ON(total_len > sizeof(buf));
-        BUG_ON(a->cpuid >= NCAPINTS * 32);
+        BUG_ON(feat >= NCAPINTS * 32);
 
         /*
          * Detect sequences of alt_instr's patching the same origin site, and
@@ -235,8 +237,14 @@ static void init_or_livepatch _apply_alternatives(struct alt_instr *start,
             continue;
         }
 
+        /*
+         * Should a replacement be performed?  Most replacements have positive
+         * polarity, but we support negative polarity too.
+         */
+        replace = boot_cpu_has(feat) ^ inv;
+
         /* If there is no replacement to make, see about optimising the nops. */
-        if ( !boot_cpu_has(a->cpuid) )
+        if ( !replace )
         {
             /* Origin site site already touched?  Don't nop anything. */
             if ( base->priv )
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index ee4cbe92cc2b..388e595786e0 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -1,6 +1,13 @@
 #ifndef __X86_ALTERNATIVE_H__
 #define __X86_ALTERNATIVE_H__
 
+/*
+ * Common to both C and ASM.  Express a replacement when a feature is not
+ * available.
+ */
+#define ALT_FLAG_NOT (1 << 15)
+#define ALT_NOT(x) (ALT_FLAG_NOT | (x))
+
 #ifdef __ASSEMBLY__
 #include <asm/alternative-asm.h>
 #else
@@ -11,7 +18,7 @@
 struct __packed alt_instr {
     int32_t  orig_offset;   /* original instruction */
     int32_t  repl_offset;   /* offset to replacement instruction */
-    uint16_t cpuid;         /* cpuid bit set for replacement */
+    uint16_t cpuid;         /* cpuid bit set for replacement (top bit is polarity) */
     uint8_t  orig_len;      /* length of original instruction */
     uint8_t  repl_len;      /* length of new instruction */
     uint8_t  pad_len;       /* length of build-time padding */

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/guest: Remove use of the Xen hypercall_page

In order to protect against ITS, Xen needs to start using return thunks.
Therefore the advice in XSA-466 becomes relevant, and the hypercall_page needs
to be removed.

Implement early_hypercall(), with infrastructure to figure out the correct
instruction on first use.  Use ALTERNATIVE()s to result in inline hypercalls,
including the ALT_NOT() form so we only need a single synthetic feature bit.

No overall change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/guest/xen/Makefile b/xen/arch/x86/guest/xen/Makefile
index 26fb4b1007c0..8b3250aa8886 100644
--- a/xen/arch/x86/guest/xen/Makefile
+++ b/xen/arch/x86/guest/xen/Makefile
@@ -1,4 +1,4 @@
-obj-y += hypercall_page.o
+obj-bin-y += hypercall.init.o
 obj-y += xen.o
 
 obj-bin-$(CONFIG_PVH_GUEST) += pvh-boot.init.o
diff --git a/xen/arch/x86/guest/xen/hypercall.S b/xen/arch/x86/guest/xen/hypercall.S
new file mode 100644
index 000000000000..47ab685cf848
--- /dev/null
+++ b/xen/arch/x86/guest/xen/hypercall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include <asm/asm_defns.h>
+
+        .section .init.text, "ax", @progbits
+
+        /*
+         * Used during early boot, before alternatives have run and inlined
+         * the appropriate instruction.  Called using the hypercall ABI.
+         */
+ENTRY(early_hypercall)
+        cmpb    $0, early_hypercall_insn(%rip)
+        jl      .L_setup
+        je      1f
+
+        vmmcall
+        ret
+
+1:      vmcall
+        ret
+
+.L_setup:
+        /*
+         * When setting up the first time around, all registers need
+         * preserving.  Save the non-callee-saved ones.
+         */
+        push    %r11
+        push    %r10
+        push    %r9
+        push    %r8
+        push    %rdi
+        push    %rsi
+        push    %rdx
+        push    %rcx
+        push    %rax
+
+        call    early_hypercall_setup
+
+        pop     %rax
+        pop     %rcx
+        pop     %rdx
+        pop     %rsi
+        pop     %rdi
+        pop     %r8
+        pop     %r9
+        pop     %r10
+        pop     %r11
+
+        jmp     early_hypercall
+
+        .type early_hypercall, @function
+        .size early_hypercall, . - early_hypercall
diff --git a/xen/arch/x86/guest/xen/hypercall_page.S b/xen/arch/x86/guest/xen/hypercall_page.S
deleted file mode 100644
index 9958d02cfd5b..000000000000
--- a/xen/arch/x86/guest/xen/hypercall_page.S
+++ /dev/null
@@ -1,78 +0,0 @@
-#include <asm/page.h>
-#include <asm/asm_defns.h>
-#include <public/xen.h>
-
-        .section ".text.page_aligned", "ax", @progbits
-        .p2align PAGE_SHIFT
-
-GLOBAL(hypercall_page)
-         /* Poisoned with `ret` for safety before hypercalls are set up. */
-        .fill PAGE_SIZE, 1, 0xc3
-        .type hypercall_page, STT_OBJECT
-        .size hypercall_page, PAGE_SIZE
-
-/*
- * Identify a specific hypercall in the hypercall page
- * @param name Hypercall name.
- */
-#define DECLARE_HYPERCALL(name)                                                 \
-        .globl HYPERCALL_ ## name;                                              \
-        .type  HYPERCALL_ ## name, STT_FUNC;                                    \
-        .size  HYPERCALL_ ## name, 32;                                          \
-        .set   HYPERCALL_ ## name, hypercall_page + __HYPERVISOR_ ## name * 32
-
-DECLARE_HYPERCALL(set_trap_table)
-DECLARE_HYPERCALL(mmu_update)
-DECLARE_HYPERCALL(set_gdt)
-DECLARE_HYPERCALL(stack_switch)
-DECLARE_HYPERCALL(set_callbacks)
-DECLARE_HYPERCALL(fpu_taskswitch)
-DECLARE_HYPERCALL(sched_op_compat)
-DECLARE_HYPERCALL(platform_op)
-DECLARE_HYPERCALL(set_debugreg)
-DECLARE_HYPERCALL(get_debugreg)
-DECLARE_HYPERCALL(update_descriptor)
-DECLARE_HYPERCALL(memory_op)
-DECLARE_HYPERCALL(multicall)
-DECLARE_HYPERCALL(update_va_mapping)
-DECLARE_HYPERCALL(set_timer_op)
-DECLARE_HYPERCALL(event_channel_op_compat)
-DECLARE_HYPERCALL(xen_version)
-DECLARE_HYPERCALL(console_io)
-DECLARE_HYPERCALL(physdev_op_compat)
-DECLARE_HYPERCALL(grant_table_op)
-DECLARE_HYPERCALL(vm_assist)
-DECLARE_HYPERCALL(update_va_mapping_otherdomain)
-DECLARE_HYPERCALL(iret)
-DECLARE_HYPERCALL(vcpu_op)
-DECLARE_HYPERCALL(set_segment_base)
-DECLARE_HYPERCALL(mmuext_op)
-DECLARE_HYPERCALL(xsm_op)
-DECLARE_HYPERCALL(nmi_op)
-DECLARE_HYPERCALL(sched_op)
-DECLARE_HYPERCALL(callback_op)
-DECLARE_HYPERCALL(xenoprof_op)
-DECLARE_HYPERCALL(event_channel_op)
-DECLARE_HYPERCALL(physdev_op)
-DECLARE_HYPERCALL(hvm_op)
-DECLARE_HYPERCALL(sysctl)
-DECLARE_HYPERCALL(domctl)
-DECLARE_HYPERCALL(kexec_op)
-DECLARE_HYPERCALL(argo_op)
-DECLARE_HYPERCALL(xenpmu_op)
-
-DECLARE_HYPERCALL(arch_0)
-DECLARE_HYPERCALL(arch_1)
-DECLARE_HYPERCALL(arch_2)
-DECLARE_HYPERCALL(arch_3)
-DECLARE_HYPERCALL(arch_4)
-DECLARE_HYPERCALL(arch_5)
-DECLARE_HYPERCALL(arch_6)
-DECLARE_HYPERCALL(arch_7)
-
-/*
- * Local variables:
- * tab-width: 8
- * indent-tabs-mode: nil
- * End:
- */
diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c
index c4cb16df38b3..0d1e6d06586b 100644
--- a/xen/arch/x86/guest/xen/xen.c
+++ b/xen/arch/x86/guest/xen/xen.c
@@ -38,7 +38,6 @@
 bool __read_mostly xen_guest;
 
 uint32_t __read_mostly xen_cpuid_base;
-extern char hypercall_page[];
 static struct rangeset *mem;
 
 DEFINE_PER_CPU(unsigned int, vcpu_id);
@@ -47,6 +46,50 @@ static struct vcpu_info *vcpu_info;
 static unsigned long vcpu_info_mapped[BITS_TO_LONGS(NR_CPUS)];
 DEFINE_PER_CPU(struct vcpu_info *, vcpu_info);
 
+/*
+ * Which instruction to use for early hypercalls:
+ *   < 0 setup
+ *     0 vmcall
+ *   > 0 vmmcall
+ */
+int8_t __initdata early_hypercall_insn = -1;
+
+/*
+ * Called once during the first hypercall to figure out which instruction to
+ * use.  Error handling options are limited.
+ */
+void __init early_hypercall_setup(void)
+{
+    BUG_ON(early_hypercall_insn != -1);
+
+    if ( !boot_cpu_data.x86_vendor )
+    {
+        unsigned int eax, ebx, ecx, edx;
+
+        cpuid(0, &eax, &ebx, &ecx, &edx);
+
+        boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
+    }
+
+    switch ( boot_cpu_data.x86_vendor )
+    {
+    case X86_VENDOR_INTEL:
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        early_hypercall_insn = 0;
+        setup_force_cpu_cap(X86_FEATURE_USE_VMCALL);
+        break;
+
+    case X86_VENDOR_AMD:
+    case X86_VENDOR_HYGON:
+        early_hypercall_insn = 1;
+        break;
+
+    default:
+        BUG();
+    }
+}
+
 static void __init find_xen_leaves(void)
 {
     uint32_t eax, ebx, ecx, edx, base;
@@ -349,9 +392,6 @@ const struct hypervisor_ops *__init xg_probe(void)
     if ( !xen_cpuid_base )
         return NULL;
 
-    /* Fill the hypercall page. */
-    wrmsrl(cpuid_ebx(xen_cpuid_base + 2), __pa(hypercall_page));
-
     xen_guest = true;
 
     return &ops;
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ba3df174b76e..9e3ed21c026d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks *
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
+XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/guest/xen-hcall.h b/xen/arch/x86/include/asm/guest/xen-hcall.h
index 03d5868a9efd..a7e90adbafb7 100644
--- a/xen/arch/x86/include/asm/guest/xen-hcall.h
+++ b/xen/arch/x86/include/asm/guest/xen-hcall.h
@@ -41,9 +41,11 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__) ASM_CALL_CONSTRAINT              \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1))                                          \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -53,10 +55,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__)                    \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2))                        \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -66,10 +70,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__)      \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3))      \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -80,10 +86,12 @@
         long res, tmp__;                                                \
         register long _a4 asm ("r10") = ((long)(a4));                   \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__),     \
               "=&r" (tmp__) ASM_CALL_CONSTRAINT                         \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3)),     \
               "4" (_a4)                                                 \
             : "memory" );                                               \
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: (Mis)align __x86_indirect_thunk_* to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

Arrange for __x86_indirect_thunk_* to always be in the second half.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index de6aef606832..e7ef104d3bd3 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -35,6 +35,16 @@
 .macro GEN_INDIRECT_THUNK reg:req
         .section .text.__x86_indirect_thunk_\reg, "ax", @progbits
 
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
 ENTRY(__x86_indirect_thunk_\reg)
         ALTERNATIVE_2 __stringify(IND_THUNK_RETPOLINE \reg),              \
         __stringify(IND_THUNK_LFENCE \reg), X86_FEATURE_IND_THUNK_LFENCE, \
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/thunk: (Mis)align the RETs in clear_bhb_loops() to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

clear_bhb_loops() has a precise layout of branches.  The alignment for
performance cause the RETs to always be in an unsafe position, and converting
those to return thunks changes the branching pattern.  While such a conversion
is believed to be safe, clear_bhb_loops() is also a performance-relevant
fastpath, so (mis)align the RETs to be in a safe position.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 7e866784f784..05f1043df7d0 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -52,7 +52,12 @@ ENTRY(clear_bhb_tsx)
  *   ret
  *
  * The CALL/RETs are necessary to prevent the Loop Stream Detector from
- * interfering.  The alignment is for performance and not safety.
+ * interfering.
+ *
+ * The .balign's are for performance, but they cause the RETs to be in unsafe
+ * positions with respect to Indirect Target Selection.  The .skips are to
+ * move the RETs into ITS-safe positions, rather than using the slowpath
+ * through __x86_return_thunk.
  *
  * The "short" sequence (5 and 5) is for CPUs prior to Alder Lake / Sapphire
  * Rapids (i.e. Cores prior to Golden Cove and/or Gracemont).
@@ -68,12 +73,14 @@ ENTRY(clear_bhb_loops)
         jmp     5f
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - (.Lr1 - 1f), 0xcc
 1:      call    2f
-        ret
+.Lr1:   ret
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - 18 /* (.Lr2 - 2f) but Clang IAS doesn't like this */, 0xcc
 2:      ALTERNATIVE "mov $5, %eax", "mov $7, %eax", X86_SPEC_BHB_LOOPS_LONG
 
 3:      jmp     4f
@@ -85,7 +92,7 @@ ENTRY(clear_bhb_loops)
         sub     $1, %ecx
         jnz     1b
 
-        ret
+.Lr2:   ret
 5:
         /*
          * The Intel sequence has an LFENCE here.  The purpose is to ensure
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/stubs: Introduce place_ret() to abstract away raw 0xc3's

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.
This means it's not safe for logic using the stubs to write raw 0xc3's.

Introduce place_ret() which, for now, writes a raw 0xc3 but will contain
additional logic when return thunks are in use.

stub_selftest() doesn't strictly need to be converted as they only run on
boot, but doing so gets us a partial test of place_ret() too.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/tools/tests/x86_emulator/x86-emulate.h b/tools/tests/x86_emulator/x86-emulate.h
index 58760f096d3e..65ae95e624e6 100644
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -67,6 +67,12 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *place_ret(void *ptr)
+{
+    *(uint8_t *)ptr = 0xc3;
+    return ptr + 1;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpu_policy cp;
 
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index 6a070a8cf8da..882408ccaf9a 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -10,9 +10,7 @@ obj-$(CONFIG_XENOPROF) += oprofile/
 obj-$(CONFIG_PV) += pv/
 obj-y += x86_64/
 
-alternative-y := alternative.init.o
-alternative-$(CONFIG_LIVEPATCH) :=
-obj-bin-y += $(alternative-y)
+obj-y += alternative.o
 obj-y += apic.o
 obj-y += bhb-thunk.o
 obj-y += bitops.o
@@ -40,7 +38,7 @@ obj-y += hypercall.o
 obj-y += i387.o
 obj-y += i8259.o
 obj-y += io_apic.o
-obj-$(CONFIG_LIVEPATCH) += alternative.o livepatch.o
+obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 6eee6e501ad9..88082f68a982 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -149,6 +149,20 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+/*
+ * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
+ *
+ * Returns the next position to write into the stub.
+ */
+void *place_ret(void *ptr)
+{
+    uint8_t *p = ptr;
+
+    *p++ = 0xc3;
+
+    return p;
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index f05c16def68b..adfd439d3a50 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -135,20 +135,20 @@ search_exception_table(const struct cpu_user_regs *regs, unsigned long *stub_ra)
 int __init cf_check stub_selftest(void)
 {
     static const struct {
-        uint8_t opc[8];
+        uint8_t opc[7];
         uint64_t rax;
         union stub_exception_token res;
     } tests[] __initconst = {
 #define endbr64 0xf3, 0x0f, 0x1e, 0xfa
-        { .opc = { endbr64, 0x0f, 0xb9, 0xc3, 0xc3 }, /* ud1 */
+        { .opc = { endbr64, 0x0f, 0xb9, 0x90 }, /* ud1 */
           .res.fields.trapnr = TRAP_invalid_op },
-        { .opc = { endbr64, 0x90, 0x02, 0x00, 0xc3 }, /* nop; add (%rax),%al */
+        { .opc = { endbr64, 0x90, 0x02, 0x00 }, /* nop; add (%rax),%al */
           .rax = 0x0123456789abcdef,
           .res.fields.trapnr = TRAP_gp_fault },
-        { .opc = { endbr64, 0x02, 0x04, 0x04, 0xc3 }, /* add (%rsp,%rax),%al */
+        { .opc = { endbr64, 0x02, 0x04, 0x04 }, /* add (%rsp,%rax),%al */
           .rax = 0xfedcba9876543210,
           .res.fields.trapnr = TRAP_stack_error },
-        { .opc = { endbr64, 0xcc, 0xc3, 0xc3, 0xc3 }, /* int3 */
+        { .opc = { endbr64, 0xcc, 0x90, 0x90 }, /* int3 */
           .res.fields.trapnr = TRAP_int3 },
 #undef endbr64
     };
@@ -167,6 +167,7 @@ int __init cf_check stub_selftest(void)
 
         memset(ptr, 0xcc, STUB_BUF_SIZE / 2);
         memcpy(ptr, tests[i].opc, ARRAY_SIZE(tests[i].opc));
+        place_ret(ptr + ARRAY_SIZE(tests[i].opc));
         unmap_domain_page(ptr);
 
         asm volatile ( "INDIRECT_CALL %[stb]\n"
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 388e595786e0..28e69e988ac0 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -30,6 +30,8 @@ struct __packed alt_instr {
 #define ALT_REPL_PTR(a)     __ALT_PTR(a, repl_offset)
 
 extern void add_nops(void *insns, unsigned int len);
+void *place_ret(void *ptr);
+
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
 extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index f216c2641205..28ea7cc580a9 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -88,7 +88,6 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
         0x41, 0x5c, /* pop %r12  */
         0x5d,       /* pop %rbp  */
         0x5b,       /* pop %rbx  */
-        0xc3,       /* ret       */
     };
 
     const struct stubs *this_stubs = &this_cpu(stubs);
@@ -138,11 +137,13 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
 
     APPEND_CALL(save_guest_gprs);
     APPEND_BUFF(epilogue);
+    p = place_ret(p);
 
     /* Build-time best effort attempt to catch problems. */
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
-                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES)));
+                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
+                  1 /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c
index 995670cbc8e9..b5eca13410cd 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1533,36 +1533,42 @@ static inline bool fpu_check_write(void)
 
 #define emulate_fpu_insn_memdst(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
     insn_bytes = 2;                                                     \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "+m" (arg) : "a" (&(arg)));                     \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_memsrc(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "m" (arg), "a" (&(arg)));        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub(bytes...)                                 \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "i" (0));                        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub_eflags(bytes...)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
     unsigned long tmp_;                                                 \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),             \
                 _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),            \
                 [eflags] "+g" (_regs.eflags), [tmp] "=&r" (tmp_)        \
@@ -3852,7 +3858,7 @@ x86_emulate(
         stb[3] = 0x91;
         stb[4] = evex.opmsk << 3;
         insn_bytes = 5;
-        stb[5] = 0xc3;
+        place_ret(&stb[5]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
 
@@ -6751,7 +6757,7 @@ x86_emulate(
             evex.lr = 0;
         opc[1] = (modrm & 0x38) | 0xc0;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -6816,7 +6822,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
@@ -6884,7 +6890,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         _regs.eflags &= ~EFLAGS_MASK;
         invoke_stub("",
@@ -7113,7 +7119,7 @@ x86_emulate(
         opc[1] = modrm & 0xc7;
         insn_bytes = PFX_BYTES + 2;
     simd_0f_to_gpr:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         generate_exception_if(ea.type != OP_REG, EXC_UD);
 
@@ -7510,7 +7516,7 @@ x86_emulate(
             vex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -7538,7 +7544,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -7733,7 +7739,7 @@ x86_emulate(
 #endif /* X86EMUL_NO_SIMD */
 
     simd_0f_reg_only:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", [dummy_out] "=g" (dummy) : [dummy_in] "i" (0) );
@@ -8058,7 +8064,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xf8;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         ea.reg = decode_gpr(&_regs, modrm_rm);
@@ -8101,7 +8107,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
@@ -8131,7 +8137,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         _regs.eflags &= ~EFLAGS_MASK;
@@ -9027,7 +9033,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
@@ -9145,7 +9151,7 @@ x86_emulate(
             opc[1] &= 0x38;
         }
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -9424,7 +9430,7 @@ x86_emulate(
         pvex->b = !mode_64bit() || (vex.reg >> 3);
         opc[1] = 0xc0 | (~vex.reg & 7);
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
@@ -9684,7 +9690,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0xf8;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -9783,7 +9789,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -9853,7 +9859,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -9909,7 +9915,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -9974,7 +9980,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -9988,7 +9994,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -10082,7 +10088,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -10160,7 +10166,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -10229,7 +10235,7 @@ x86_emulate(
         pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
         pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (index) : "a" (&index));
         put_stub(stub);
@@ -10404,7 +10410,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         src.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val), "a" (*src.reg));
@@ -10438,7 +10444,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub("=&a" (dst.val), "c" (&src.val));
@@ -10670,7 +10676,7 @@ x86_emulate(
             evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -10837,7 +10843,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 3;
             copy_VEX(opc, vex);
         }
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
 
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
@@ -11065,7 +11071,7 @@ x86_emulate(
         }
         opc[2] = imm1;
         insn_bytes = PFX_BYTES + 3;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -11225,7 +11231,7 @@ x86_emulate(
         pxop->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&a" (dst.val), "c" (&src.val));
@@ -11334,7 +11340,7 @@ x86_emulate(
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
         *(uint32_t *)(buf + 5) = imm1;
-        buf[9] = 0xc3;
+        place_ret(&buf[9]);
 
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val));
 
@@ -11401,12 +11407,12 @@ x86_emulate(
             BUG();
         if ( evex_encoded() )
         {
-            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - EVEX_PFX_BYTES]);
             copy_EVEX(opc, evex);
         }
         else
         {
-            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - PFX_BYTES]);
             copy_REX_VEX(opc, rex_prefix, vex);
         }
 
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: Build Xen with Return Thunks

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

In order to mitigate this, build with return thunks and arrange for
__x86_return_thunk to be (mis)aligned in the same manner as
__x86_indirect_thunk_* so the RET instruction is placed in a safe location.

place_ret() needs to conditionally emit JMP __x86_return_thunk instead of RET.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 471cfd8a80ef..370558756f0d 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -35,9 +35,14 @@ config ARCH_DEFCONFIG
 	default "arch/x86/configs/x86_64_defconfig"
 
 config CC_HAS_INDIRECT_THUNK
+	# GCC >= 8 or Clang >= 6
 	def_bool $(cc-option,-mindirect-branch-register) || \
 	         $(cc-option,-mretpoline-external-thunk)
 
+config CC_HAS_RETURN_THUNK
+	# GCC >= 8 or Clang >= 15
+	def_bool $(cc-option,-mfunction-return=thunk-extern)
+
 config HAS_AS_CET_SS
 	# binutils >= 2.29 or LLVM >= 6
 	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy)
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index 882408ccaf9a..22293d969b38 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
+obj-$(CONFIG_RETURN_THUNK) += indirect-thunk.o
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 3855ff1ddb94..cd4682b9f9c5 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -131,7 +131,7 @@ ENTRY(s3_resume)
         pop     %r12
         pop     %rbx
         pop     %rbp
-        ret
+        RET
 
 .data
         .align 16
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 88082f68a982..151df73583f1 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -149,16 +149,45 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+void nocall __x86_return_thunk(void);
+
 /*
  * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
  *
+ * When CONFIG_RETURN_THUNK is active, this may be a JMP __x86_return_thunk
+ * instead, depending on the safety of @ptr with respect to Indirect Target
+ * Selection.
+ *
  * Returns the next position to write into the stub.
  */
 void *place_ret(void *ptr)
 {
+    unsigned long addr = (unsigned long)ptr;
     uint8_t *p = ptr;
 
-    *p++ = 0xc3;
+    /*
+     * When Return Thunks are used, if a RET would be unsafe at this location
+     * with respect to Indirect Target Selection (i.e. if addr is in the first
+     * half of a cacheline), insert a JMP __x86_return_thunk instead.
+     *
+     * The displacement needs to be relative to the executable alias of the
+     * stub, not to @ptr which is the writeable alias.
+     */
+    if ( IS_ENABLED(CONFIG_RETURN_THUNK) && !(addr & 0x20) )
+    {
+        long stub_va = (this_cpu(stubs.addr) & PAGE_MASK) + (addr & ~PAGE_MASK);
+        long disp = (long)__x86_return_thunk - (stub_va + 5);
+
+        BUG_ON((int32_t)disp != disp);
+
+        *p++ = 0xe9;
+        *(int32_t *)p = disp;
+        p += 4;
+    }
+    else
+    {
+        *p++ = 0xc3;
+    }
 
     return p;
 }
diff --git a/xen/arch/x86/arch.mk b/xen/arch/x86/arch.mk
index 227d439a4523..379317c01337 100644
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -50,6 +50,9 @@ CFLAGS-$(CONFIG_CC_IS_GCC) += -fno-jump-tables
 CFLAGS-$(CONFIG_CC_IS_CLANG) += -mretpoline-external-thunk
 endif
 
+# Compile with return thunk support if selected.
+CFLAGS-$(CONFIG_RETURN_THUNK) += -mfunction-return=thunk-extern
+
 ifdef CONFIG_XEN_IBT
 # Force -fno-jump-tables to work around
 #   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104816
diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 05f1043df7d0..472da481dd42 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -23,7 +23,7 @@ ENTRY(clear_bhb_tsx)
 0:      .byte 0xc6, 0xf8, 0             /* xabort $0 */
         int3
 1:
-        ret
+        RET
 
         .size clear_bhb_tsx, . - clear_bhb_tsx
         .type clear_bhb_tsx, @function
diff --git a/xen/arch/x86/clear_page.S b/xen/arch/x86/clear_page.S
index d9d524c79ecd..ea70bd91676e 100644
--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,5 +1,6 @@
         .file __FILE__
 
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 ENTRY(clear_page_sse2)
@@ -15,4 +16,4 @@ ENTRY(clear_page_sse2)
         jnz     0b
 
         sfence
-        ret
+        RET
diff --git a/xen/arch/x86/copy_page.S b/xen/arch/x86/copy_page.S
index 2da81126c5fa..bb79c5fc79db 100644
--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -1,5 +1,6 @@
         .file __FILE__
 
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 #define src_reg %rsi
@@ -40,4 +41,4 @@ ENTRY(copy_page_sse2)
         movnti  tmp4_reg, 3*WORD_SIZE(dst_reg)
 
         sfence
-        ret
+        RET
diff --git a/xen/arch/x86/efi/check.c b/xen/arch/x86/efi/check.c
index 9e473faad3c9..23ba30abf330 100644
--- a/xen/arch/x86/efi/check.c
+++ b/xen/arch/x86/efi/check.c
@@ -3,6 +3,9 @@ int __attribute__((__ms_abi__)) test(int i)
     return i;
 }
 
+/* In case -mfunction-return is in use. */
+void __x86_return_thunk(void) {};
+
 /*
  * Populate an array with "addresses" of relocatable and absolute values.
  * This is to probe ld for (a) emitting base relocations at all and (b) not
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 7e22fcb9c06b..a9ca0d05ec99 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -47,6 +47,12 @@
     .endif
 .endm
 
+#ifdef CONFIG_RETURN_THUNK
+# define RET jmp __x86_return_thunk
+#else
+# define RET ret
+#endif
+
 #ifdef CONFIG_XEN_IBT
 # define ENDBR64 endbr64
 #else
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index e7ef104d3bd3..239cf7dc770b 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -11,6 +11,9 @@
 
 #include <asm/asm_defns.h>
 
+
+#ifdef CONFIG_INDIRECT_THUNK
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -60,3 +63,27 @@ ENTRY(__x86_indirect_thunk_\reg)
 .irp reg, ax, cx, dx, bx, bp, si, di, 8, 9, 10, 11, 12, 13, 14, 15
         GEN_INDIRECT_THUNK reg=r\reg
 .endr
+
+#endif /* CONFIG_INDIRECT_THUNK */
+
+#ifdef CONFIG_RETURN_THUNK
+        .section .text.entry.__x86_return_thunk, "ax", @progbits
+
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
+ENTRY(__x86_return_thunk)
+        ret
+        int3 /* Halt straight-line speculation */
+
+        .size __x86_return_thunk, . - __x86_return_thunk
+        .type __x86_return_thunk, @function
+
+#endif /* CONFIG_RETURN_THUNK */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 28ea7cc580a9..872a89db769c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -143,7 +143,7 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
                   MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
-                  1 /* ret */));
+                  (IS_ENABLED(CONFIG_RETURN_THUNK) ? 5 : 1) /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index e7f5bfcd2d03..d90435be882e 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -26,7 +26,7 @@ ENTRY(load_guest_gprs)
         movq  UREGS_r15(%rdi), %r15
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
-        ret
+        RET
 
         .size load_guest_gprs, . - load_guest_gprs
         .type load_guest_gprs, STT_FUNC
@@ -51,7 +51,7 @@ ENTRY(save_guest_gprs)
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
         movq  %rcx, UREGS_rcx(%rdi)
-        ret
+        RET
 
         .size save_guest_gprs, . - save_guest_gprs
         .type save_guest_gprs, STT_FUNC
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index b1e47a849e87..2f777e8a7e75 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -575,6 +575,9 @@ static void __init print_details(enum ind_thunk thunk)
 #ifdef CONFIG_INDIRECT_THUNK
                " INDIRECT_THUNK"
 #endif
+#ifdef CONFIG_RETURN_THUNK
+               " RETURN_THUNK"
+#endif
 #ifdef CONFIG_SHADOW_PAGING
                " SHADOW_PAGING"
 #endif
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index ff462a92e003..57e54dc75fc2 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -183,7 +183,7 @@ ENTRY(cr4_pv32_restore)
         mov   %rax, %cr4
         mov   %rax, (%rdx)
         pop   %rdx
-        ret
+        RET
 0:
 #ifndef NDEBUG
         /* Check that _all_ of the bits intended to be set actually are. */
@@ -202,7 +202,7 @@ ENTRY(cr4_pv32_restore)
 #endif
         pop   %rdx
         xor   %eax, %eax
-        ret
+        RET
 
 ENTRY(compat_syscall)
         /* Fix up reported %cs/%ss for compat domains. */
@@ -329,7 +329,7 @@ __UNLIKELY_END(compat_bounce_null_selector)
         xor   %eax, %eax
         mov   %ax,  TRAPBOUNCE_cs(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
 .section .fixup,"ax"
 .Lfx13:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 7bb0cc708a76..fd63ee2e4c1f 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -598,7 +598,7 @@ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         xor   %eax, %eax
         mov   %rax, TRAPBOUNCE_eip(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
         .pushsection .fixup, "ax", @progbits
         # Numeric tags below represent the intended overall %rsi adjustment.
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 8930e14fc40e..b66e708ebf69 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -86,6 +86,7 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _stextentry = .;
        *(.text.entry)
+       *(.text.entry.*)
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index cd7385153823..c82bee92f4ca 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -112,6 +112,17 @@ config INDIRECT_THUNK
 	  When enabled, indirect branches are implemented using a new construct
 	  called "retpoline" that prevents speculation.
 
+config RETURN_THUNK
+	bool "Out-of-line Returns"
+	depends on CC_HAS_RETURN_THUNK
+	default INDIRECT_THUNK
+	help
+	  Compile Xen with out-of-line returns.
+
+	  This allows Xen to mitigate a variety of speculative vulnerabilities
+	  by choosing a hardware-dependent instruction sequence to implement
+	  function returns safely.
+
 config SPECULATIVE_HARDEN_ARRAY
 	bool "Speculative Array Hardening"
 	default y
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware

It is easier to express feature word 17 in terms of word 16 + [32, 64) as
that's how the layout is given in documentation.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index a6b8af12964c..d9aedfc25ab0 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -164,6 +164,7 @@
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
 #define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
 #define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
+#define cpu_has_its_no          boot_cpu_has(X86_FEATURE_ITS_NO)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 2f777e8a7e75..559ee90b44dc 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1766,6 +1766,90 @@ static void __init bhi_calculations(void)
     }
 }
 
+/*
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html
+ */
+static void __init its_calculations(void)
+{
+    /*
+     * Indirect Target Selection is a Branch Prediction bug whereby certain
+     * indirect branches (including RETs) get predicted using a direct branch
+     * target, rather than a suitable indirect target, bypassing hardware
+     * isolation protections.
+     *
+     * ITS affects Core (but not Atom) processors starting from the
+     * introduction of eIBRS, up to but not including Golden Cove cores
+     * (checked here with BHI_CTRL).
+     *
+     * The ITS_NO feature is not expected to be enumerated by hardware, and is
+     * only for VMMs to synthesise for guests.
+     *
+     * ITS comes in 3 flavours:
+     *
+     *   1) Across-IBPB.  Indirect branches after the IBPB can be controlled
+     *      by direct targets which existed prior to the IBPB.  This is
+     *      addressed in the IPU 2025.1 microcode drop, and has no other
+     *      software interaction.
+     *
+     *   2) Guest/Host.  Indirect branches in the VMM can be controlled by
+     *      direct targets from the guest.  This applies equally to PV guests
+     *      (Ring3) and HVM guests (VMX), and applies to all Skylake-uarch
+     *      cores with eIBRS.
+     *
+     *   3) Intra-mode.  Indirect branches in the VMM can be controlled by
+     *      other execution in the same mode.
+     */
+
+    /*
+     * If we can see ITS_NO, or we're virtualised, do nothing.  We are or may
+     * migrate somewhere unsafe.
+     */
+    if ( cpu_has_its_no || cpu_has_hypervisor )
+        return;
+
+    /* ITS is only known to affect Intel processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
+        return;
+
+    /*
+     * ITS does not exist on:
+     *  - non-Family 6 CPUs
+     *  - those without eIBRS
+     *  - those with BHI_CTRL
+     * but we still need to synthesise ITS_NO.
+     */
+    if ( boot_cpu_data.x86 != 6 || !cpu_has_eibrs ||
+         boot_cpu_has(X86_FEATURE_BHI_CTRL) )
+        goto synthesise;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /* These Skylake-uarch cores suffer cases #2 and #3. */
+    case INTEL_FAM6_SKYLAKE_X:
+    case INTEL_FAM6_KABYLAKE_L:
+    case INTEL_FAM6_KABYLAKE:
+    case INTEL_FAM6_COMETLAKE:
+    case INTEL_FAM6_COMETLAKE_L:
+        return;
+
+        /* These Sunny/Willow/Cypress Cove cores suffer case #3. */
+    case INTEL_FAM6_ICELAKE_X:
+    case INTEL_FAM6_ICELAKE_D:
+    case INTEL_FAM6_ICELAKE_L:
+    case INTEL_FAM6_TIGERLAKE_L:
+    case INTEL_FAM6_TIGERLAKE:
+    case INTEL_FAM6_ROCKETLAKE:
+        return;
+
+    default:
+        break;
+    }
+
+    /* Platforms remaining are not believed to be vulnerable to ITS. */
+ synthesise:
+    setup_force_cpu_cap(X86_FEATURE_ITS_NO);
+}
+
 void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
@@ -2316,6 +2400,8 @@ void __init init_speculation_mitigations(void)
 
     bhi_calculations();
 
+    its_calculations();
+
     print_details(thunk);
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 0004fd4bf56e..99c4dc1ffd40 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -334,7 +334,8 @@ XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
 XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
 XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
-/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
+/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
+XEN_CPUFEATURE(ITS_NO,             16*32+62) /*!A No Indirect Target Selection */
 
 #endif /* XEN_CPUFEATURE */
 
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index 415b8b1d6881..05b1c13ec478 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -51,7 +51,7 @@ def parse_definitions(state):
         r"\s+/\*([\w!]*) .*$")
 
     word_regex = re.compile(
-        r"^/\* .* word (\d*) \*/$")
+        r"^/\* .* word (\d*) .*\*/$")
     last_word = -1
 
     this = sys.modules[__name__]
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/alternative: Support replacements when a feature is not present

Use the top bit of a->cpuid to express inverted polarity.  This requires
stripping the top bit back out when performing the sanity checks.

Despite only being used once, create a replace boolean to express the decision
more clearly in _apply_alternatives().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 91b60865bf3b..90bf9a92bff9 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -197,6 +197,8 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t *repl = ALT_REPL_PTR(a);
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
+        unsigned int feat = a->cpuid & ~ALT_FLAG_NOT;
+        bool inv = a->cpuid & ALT_FLAG_NOT, replace;
 
         if ( a->repl_len > total_len )
         {
@@ -214,11 +216,11 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             return -ENOSPC;
         }
 
-        if ( a->cpuid >= NCAPINTS * 32 )
+        if ( feat >= NCAPINTS * 32 )
         {
              printk(XENLOG_ERR
                    "Alt for %ps, feature %#x outside of featureset range %#x\n",
-                   ALT_ORIG_PTR(a), a->cpuid, NCAPINTS * 32);
+                   ALT_ORIG_PTR(a), feat, NCAPINTS * 32);
             return -ERANGE;
         }
 
@@ -243,8 +245,14 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             continue;
         }
 
+        /*
+         * Should a replacement be performed?  Most replacements have positive
+         * polarity, but we support negative polarity too.
+         */
+        replace = boot_cpu_has(feat) ^ inv;
+
         /* If there is no replacement to make, see about optimising the nops. */
-        if ( !boot_cpu_has(a->cpuid) )
+        if ( !replace )
         {
             /* Origin site site already touched?  Don't nop anything. */
             if ( base->priv )
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 69555d781ef9..89b7bdcb82e5 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -1,6 +1,13 @@
 #ifndef __X86_ALTERNATIVE_H__
 #define __X86_ALTERNATIVE_H__
 
+/*
+ * Common to both C and ASM.  Express a replacement when a feature is not
+ * available.
+ */
+#define ALT_FLAG_NOT (1 << 15)
+#define ALT_NOT(x) (ALT_FLAG_NOT | (x))
+
 #ifdef __ASSEMBLY__
 #include <asm/alternative-asm.h>
 #else
@@ -11,7 +18,7 @@
 struct __packed alt_instr {
     int32_t  orig_offset;   /* original instruction */
     int32_t  repl_offset;   /* offset to replacement instruction */
-    uint16_t cpuid;         /* cpuid bit set for replacement */
+    uint16_t cpuid;         /* cpuid bit set for replacement (top bit is polarity) */
     uint8_t  orig_len;      /* length of original instruction */
     uint8_t  repl_len;      /* length of new instruction */
     uint8_t  pad_len;       /* length of build-time padding */

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/guest: Remove use of the Xen hypercall_page

In order to protect against ITS, Xen needs to start using return thunks.
Therefore the advice in XSA-466 becomes relevant, and the hypercall_page needs
to be removed.

Implement early_hypercall(), with infrastructure to figure out the correct
instruction on first use.  Use ALTERNATIVE()s to result in inline hypercalls,
including the ALT_NOT() form so we only need a single synthetic feature bit.

No overall change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/guest/xen/Makefile b/xen/arch/x86/guest/xen/Makefile
index 26fb4b1007c0..8b3250aa8886 100644
--- a/xen/arch/x86/guest/xen/Makefile
+++ b/xen/arch/x86/guest/xen/Makefile
@@ -1,4 +1,4 @@
-obj-y += hypercall_page.o
+obj-bin-y += hypercall.init.o
 obj-y += xen.o
 
 obj-bin-$(CONFIG_PVH_GUEST) += pvh-boot.init.o
diff --git a/xen/arch/x86/guest/xen/hypercall.S b/xen/arch/x86/guest/xen/hypercall.S
new file mode 100644
index 000000000000..47ab685cf848
--- /dev/null
+++ b/xen/arch/x86/guest/xen/hypercall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include <asm/asm_defns.h>
+
+        .section .init.text, "ax", @progbits
+
+        /*
+         * Used during early boot, before alternatives have run and inlined
+         * the appropriate instruction.  Called using the hypercall ABI.
+         */
+ENTRY(early_hypercall)
+        cmpb    $0, early_hypercall_insn(%rip)
+        jl      .L_setup
+        je      1f
+
+        vmmcall
+        ret
+
+1:      vmcall
+        ret
+
+.L_setup:
+        /*
+         * When setting up the first time around, all registers need
+         * preserving.  Save the non-callee-saved ones.
+         */
+        push    %r11
+        push    %r10
+        push    %r9
+        push    %r8
+        push    %rdi
+        push    %rsi
+        push    %rdx
+        push    %rcx
+        push    %rax
+
+        call    early_hypercall_setup
+
+        pop     %rax
+        pop     %rcx
+        pop     %rdx
+        pop     %rsi
+        pop     %rdi
+        pop     %r8
+        pop     %r9
+        pop     %r10
+        pop     %r11
+
+        jmp     early_hypercall
+
+        .type early_hypercall, @function
+        .size early_hypercall, . - early_hypercall
diff --git a/xen/arch/x86/guest/xen/hypercall_page.S b/xen/arch/x86/guest/xen/hypercall_page.S
deleted file mode 100644
index 9958d02cfd5b..000000000000
--- a/xen/arch/x86/guest/xen/hypercall_page.S
+++ /dev/null
@@ -1,78 +0,0 @@
-#include <asm/page.h>
-#include <asm/asm_defns.h>
-#include <public/xen.h>
-
-        .section ".text.page_aligned", "ax", @progbits
-        .p2align PAGE_SHIFT
-
-GLOBAL(hypercall_page)
-         /* Poisoned with `ret` for safety before hypercalls are set up. */
-        .fill PAGE_SIZE, 1, 0xc3
-        .type hypercall_page, STT_OBJECT
-        .size hypercall_page, PAGE_SIZE
-
-/*
- * Identify a specific hypercall in the hypercall page
- * @param name Hypercall name.
- */
-#define DECLARE_HYPERCALL(name)                                                 \
-        .globl HYPERCALL_ ## name;                                              \
-        .type  HYPERCALL_ ## name, STT_FUNC;                                    \
-        .size  HYPERCALL_ ## name, 32;                                          \
-        .set   HYPERCALL_ ## name, hypercall_page + __HYPERVISOR_ ## name * 32
-
-DECLARE_HYPERCALL(set_trap_table)
-DECLARE_HYPERCALL(mmu_update)
-DECLARE_HYPERCALL(set_gdt)
-DECLARE_HYPERCALL(stack_switch)
-DECLARE_HYPERCALL(set_callbacks)
-DECLARE_HYPERCALL(fpu_taskswitch)
-DECLARE_HYPERCALL(sched_op_compat)
-DECLARE_HYPERCALL(platform_op)
-DECLARE_HYPERCALL(set_debugreg)
-DECLARE_HYPERCALL(get_debugreg)
-DECLARE_HYPERCALL(update_descriptor)
-DECLARE_HYPERCALL(memory_op)
-DECLARE_HYPERCALL(multicall)
-DECLARE_HYPERCALL(update_va_mapping)
-DECLARE_HYPERCALL(set_timer_op)
-DECLARE_HYPERCALL(event_channel_op_compat)
-DECLARE_HYPERCALL(xen_version)
-DECLARE_HYPERCALL(console_io)
-DECLARE_HYPERCALL(physdev_op_compat)
-DECLARE_HYPERCALL(grant_table_op)
-DECLARE_HYPERCALL(vm_assist)
-DECLARE_HYPERCALL(update_va_mapping_otherdomain)
-DECLARE_HYPERCALL(iret)
-DECLARE_HYPERCALL(vcpu_op)
-DECLARE_HYPERCALL(set_segment_base)
-DECLARE_HYPERCALL(mmuext_op)
-DECLARE_HYPERCALL(xsm_op)
-DECLARE_HYPERCALL(nmi_op)
-DECLARE_HYPERCALL(sched_op)
-DECLARE_HYPERCALL(callback_op)
-DECLARE_HYPERCALL(xenoprof_op)
-DECLARE_HYPERCALL(event_channel_op)
-DECLARE_HYPERCALL(physdev_op)
-DECLARE_HYPERCALL(hvm_op)
-DECLARE_HYPERCALL(sysctl)
-DECLARE_HYPERCALL(domctl)
-DECLARE_HYPERCALL(kexec_op)
-DECLARE_HYPERCALL(argo_op)
-DECLARE_HYPERCALL(xenpmu_op)
-
-DECLARE_HYPERCALL(arch_0)
-DECLARE_HYPERCALL(arch_1)
-DECLARE_HYPERCALL(arch_2)
-DECLARE_HYPERCALL(arch_3)
-DECLARE_HYPERCALL(arch_4)
-DECLARE_HYPERCALL(arch_5)
-DECLARE_HYPERCALL(arch_6)
-DECLARE_HYPERCALL(arch_7)
-
-/*
- * Local variables:
- * tab-width: 8
- * indent-tabs-mode: nil
- * End:
- */
diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c
index 44a689d4394b..306f88ef6add 100644
--- a/xen/arch/x86/guest/xen/xen.c
+++ b/xen/arch/x86/guest/xen/xen.c
@@ -26,7 +26,6 @@
 bool __read_mostly xen_guest;
 
 uint32_t __read_mostly xen_cpuid_base;
-extern char hypercall_page[];
 static struct rangeset *mem;
 
 DEFINE_PER_CPU(unsigned int, vcpu_id);
@@ -35,6 +34,50 @@ static struct vcpu_info *vcpu_info;
 static unsigned long vcpu_info_mapped[BITS_TO_LONGS(NR_CPUS)];
 DEFINE_PER_CPU(struct vcpu_info *, vcpu_info);
 
+/*
+ * Which instruction to use for early hypercalls:
+ *   < 0 setup
+ *     0 vmcall
+ *   > 0 vmmcall
+ */
+int8_t __initdata early_hypercall_insn = -1;
+
+/*
+ * Called once during the first hypercall to figure out which instruction to
+ * use.  Error handling options are limited.
+ */
+void __init early_hypercall_setup(void)
+{
+    BUG_ON(early_hypercall_insn != -1);
+
+    if ( !boot_cpu_data.x86_vendor )
+    {
+        unsigned int eax, ebx, ecx, edx;
+
+        cpuid(0, &eax, &ebx, &ecx, &edx);
+
+        boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
+    }
+
+    switch ( boot_cpu_data.x86_vendor )
+    {
+    case X86_VENDOR_INTEL:
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        early_hypercall_insn = 0;
+        setup_force_cpu_cap(X86_FEATURE_USE_VMCALL);
+        break;
+
+    case X86_VENDOR_AMD:
+    case X86_VENDOR_HYGON:
+        early_hypercall_insn = 1;
+        break;
+
+    default:
+        BUG();
+    }
+}
+
 static void __init find_xen_leaves(void)
 {
     uint32_t eax, ebx, ecx, edx, base;
@@ -337,9 +380,6 @@ const struct hypervisor_ops *__init xg_probe(void)
     if ( !xen_cpuid_base )
         return NULL;
 
-    /* Fill the hypercall page. */
-    wrmsrl(cpuid_ebx(xen_cpuid_base + 2), __pa(hypercall_page));
-
     xen_guest = true;
 
     return &ops;
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ba3df174b76e..9e3ed21c026d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks *
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
+XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/guest/xen-hcall.h b/xen/arch/x86/include/asm/guest/xen-hcall.h
index 665b472d05ac..96004dec9909 100644
--- a/xen/arch/x86/include/asm/guest/xen-hcall.h
+++ b/xen/arch/x86/include/asm/guest/xen-hcall.h
@@ -30,9 +30,11 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__) ASM_CALL_CONSTRAINT              \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1))                                          \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -42,10 +44,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__)                    \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2))                        \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -55,10 +59,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__)      \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3))      \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -69,10 +75,12 @@
         long res, tmp__;                                                \
         register long _a4 asm ("r10") = ((long)(a4));                   \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__),     \
               "=&r" (tmp__) ASM_CALL_CONSTRAINT                         \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3)),     \
               "4" (_a4)                                                 \
             : "memory" );                                               \
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: (Mis)align __x86_indirect_thunk_* to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

Arrange for __x86_indirect_thunk_* to always be in the second half.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index de6aef606832..e7ef104d3bd3 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -35,6 +35,16 @@
 .macro GEN_INDIRECT_THUNK reg:req
         .section .text.__x86_indirect_thunk_\reg, "ax", @progbits
 
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
 ENTRY(__x86_indirect_thunk_\reg)
         ALTERNATIVE_2 __stringify(IND_THUNK_RETPOLINE \reg),              \
         __stringify(IND_THUNK_LFENCE \reg), X86_FEATURE_IND_THUNK_LFENCE, \
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/thunk: (Mis)align the RETs in clear_bhb_loops() to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

clear_bhb_loops() has a precise layout of branches.  The alignment for
performance cause the RETs to always be in an unsafe position, and converting
those to return thunks changes the branching pattern.  While such a conversion
is believed to be safe, clear_bhb_loops() is also a performance-relevant
fastpath, so (mis)align the RETs to be in a safe position.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 7e866784f784..05f1043df7d0 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -52,7 +52,12 @@ ENTRY(clear_bhb_tsx)
  *   ret
  *
  * The CALL/RETs are necessary to prevent the Loop Stream Detector from
- * interfering.  The alignment is for performance and not safety.
+ * interfering.
+ *
+ * The .balign's are for performance, but they cause the RETs to be in unsafe
+ * positions with respect to Indirect Target Selection.  The .skips are to
+ * move the RETs into ITS-safe positions, rather than using the slowpath
+ * through __x86_return_thunk.
  *
  * The "short" sequence (5 and 5) is for CPUs prior to Alder Lake / Sapphire
  * Rapids (i.e. Cores prior to Golden Cove and/or Gracemont).
@@ -68,12 +73,14 @@ ENTRY(clear_bhb_loops)
         jmp     5f
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - (.Lr1 - 1f), 0xcc
 1:      call    2f
-        ret
+.Lr1:   ret
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - 18 /* (.Lr2 - 2f) but Clang IAS doesn't like this */, 0xcc
 2:      ALTERNATIVE "mov $5, %eax", "mov $7, %eax", X86_SPEC_BHB_LOOPS_LONG
 
 3:      jmp     4f
@@ -85,7 +92,7 @@ ENTRY(clear_bhb_loops)
         sub     $1, %ecx
         jnz     1b
 
-        ret
+.Lr2:   ret
 5:
         /*
          * The Intel sequence has an LFENCE here.  The purpose is to ensure
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/stubs: Introduce place_ret() to abstract away raw 0xc3's

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.
This means it's not safe for logic using the stubs to write raw 0xc3's.

Introduce place_ret() which, for now, writes a raw 0xc3 but will contain
additional logic when return thunks are in use.

stub_selftest() doesn't strictly need to be converted as they only run on
boot, but doing so gets us a partial test of place_ret() too.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/tools/tests/x86_emulator/x86-emulate.h b/tools/tests/x86_emulator/x86-emulate.h
index 34f085511497..1483a5a88809 100644
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -68,6 +68,12 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *place_ret(void *ptr)
+{
+    *(uint8_t *)ptr = 0xc3;
+    return ptr + 1;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpu_policy cp;
 
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index e8a4bf59c34f..7df877a09d3f 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -11,9 +11,7 @@ obj-$(CONFIG_PV) += pv/
 obj-y += x86_64/
 obj-y += x86_emulate/
 
-alternative-y := alternative.init.o
-alternative-$(CONFIG_LIVEPATCH) :=
-obj-bin-y += $(alternative-y)
+obj-y += alternative.o
 obj-y += apic.o
 obj-y += bhb-thunk.o
 obj-y += bitops.o
@@ -42,7 +40,7 @@ obj-y += hypercall.o
 obj-y += i387.o
 obj-y += i8259.o
 obj-y += io_apic.o
-obj-$(CONFIG_LIVEPATCH) += alternative.o livepatch.o
+obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 90bf9a92bff9..be1957723472 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,6 +137,20 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+/*
+ * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
+ *
+ * Returns the next position to write into the stub.
+ */
+void *place_ret(void *ptr)
+{
+    uint8_t *p = ptr;
+
+    *p++ = 0xc3;
+
+    return p;
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 12cc9935d885..16bf2cc62b89 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -135,20 +135,20 @@ search_exception_table(const struct cpu_user_regs *regs, unsigned long *stub_ra)
 int __init cf_check stub_selftest(void)
 {
     static const struct {
-        uint8_t opc[8];
+        uint8_t opc[7];
         uint64_t rax;
         union stub_exception_token res;
     } tests[] __initconst = {
 #define endbr64 0xf3, 0x0f, 0x1e, 0xfa
-        { .opc = { endbr64, 0x0f, 0xb9, 0xc3, 0xc3 }, /* ud1 */
+        { .opc = { endbr64, 0x0f, 0xb9, 0x90 }, /* ud1 */
           .res.fields.trapnr = X86_EXC_UD },
-        { .opc = { endbr64, 0x90, 0x02, 0x00, 0xc3 }, /* nop; add (%rax),%al */
+        { .opc = { endbr64, 0x90, 0x02, 0x00 }, /* nop; add (%rax),%al */
           .rax = 0x0123456789abcdef,
           .res.fields.trapnr = X86_EXC_GP },
-        { .opc = { endbr64, 0x02, 0x04, 0x04, 0xc3 }, /* add (%rsp,%rax),%al */
+        { .opc = { endbr64, 0x02, 0x04, 0x04 }, /* add (%rsp,%rax),%al */
           .rax = 0xfedcba9876543210,
           .res.fields.trapnr = X86_EXC_SS },
-        { .opc = { endbr64, 0xcc, 0xc3, 0xc3, 0xc3 }, /* int3 */
+        { .opc = { endbr64, 0xcc, 0x90, 0x90 }, /* int3 */
           .res.fields.trapnr = X86_EXC_BP },
 #undef endbr64
     };
@@ -167,6 +167,7 @@ int __init cf_check stub_selftest(void)
 
         memset(ptr, 0xcc, STUB_BUF_SIZE / 2);
         memcpy(ptr, tests[i].opc, ARRAY_SIZE(tests[i].opc));
+        place_ret(ptr + ARRAY_SIZE(tests[i].opc));
         unmap_domain_page(ptr);
 
         asm volatile ( "INDIRECT_CALL %[stb]\n"
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 89b7bdcb82e5..841a63ebf1b6 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -30,6 +30,8 @@ struct __packed alt_instr {
 #define ALT_REPL_PTR(a)     __ALT_PTR(a, repl_offset)
 
 extern void add_nops(void *insns, unsigned int len);
+void *place_ret(void *ptr);
+
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
 extern int apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 70150c272276..ff5d1c9f8634 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -76,7 +76,6 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
         0x41, 0x5c, /* pop %r12  */
         0x5d,       /* pop %rbp  */
         0x5b,       /* pop %rbx  */
-        0xc3,       /* ret       */
     };
 
     const struct stubs *this_stubs = &this_cpu(stubs);
@@ -126,11 +125,13 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
 
     APPEND_CALL(save_guest_gprs);
     APPEND_BUFF(epilogue);
+    p = place_ret(p);
 
     /* Build-time best effort attempt to catch problems. */
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
-                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES)));
+                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
+                  1 /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/x86_emulate/fpu.c b/xen/arch/x86/x86_emulate/fpu.c
index 480d87965705..03612d00a2ce 100644
--- a/xen/arch/x86/x86_emulate/fpu.c
+++ b/xen/arch/x86/x86_emulate/fpu.c
@@ -32,36 +32,42 @@ static inline bool fpu_check_write(void)
 
 #define emulate_fpu_insn_memdst(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
     *insn_bytes = 2;                                                    \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "+m" (arg) : "a" (&(arg)));                     \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_memsrc(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "m" (arg), "a" (&(arg)));        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub(bytes...)                                 \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "i" (0));                        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub_eflags(bytes...)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
     unsigned long tmp_;                                                 \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),             \
                 _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),            \
                 [eflags] "+g" (regs->eflags), [tmp] "=&r" (tmp_)        \
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c
index ef2598d4ca3f..dd5b65afe064 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1396,7 +1396,7 @@ x86_emulate(
         stb[3] = 0x91;
         stb[4] = evex.opmsk << 3;
         insn_bytes = 5;
-        stb[5] = 0xc3;
+        place_ret(&stb[5]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
 
@@ -3627,7 +3627,7 @@ x86_emulate(
         }
         opc[1] = (modrm & 0x38) | 0xc0;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -3694,7 +3694,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
@@ -3768,7 +3768,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         _regs.eflags &= ~EFLAGS_MASK;
         invoke_stub("",
@@ -4004,7 +4004,7 @@ x86_emulate(
         opc[1] = modrm & 0xc7;
         insn_bytes = PFX_BYTES + 2;
     simd_0f_to_gpr:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         generate_exception_if(ea.type != OP_REG, X86_EXC_UD);
 
@@ -4401,7 +4401,7 @@ x86_emulate(
             vex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4438,7 +4438,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4633,7 +4633,7 @@ x86_emulate(
 #endif /* X86EMUL_NO_SIMD */
 
     simd_0f_reg_only:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", [dummy_out] "=g" (dummy) : [dummy_in] "i" (0) );
@@ -4967,7 +4967,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xf8;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         ea.reg = decode_gpr(&_regs, modrm_rm);
@@ -5010,7 +5010,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
@@ -5040,7 +5040,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         _regs.eflags &= ~EFLAGS_MASK;
@@ -5608,7 +5608,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
@@ -5726,7 +5726,7 @@ x86_emulate(
             opc[1] &= 0x38;
         }
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -6006,7 +6006,7 @@ x86_emulate(
         pvex->b = !mode_64bit() || (vex.reg >> 3);
         opc[1] = 0xc0 | (~vex.reg & 7);
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
@@ -6290,7 +6290,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0xf8;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -6389,7 +6389,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6459,7 +6459,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6515,7 +6515,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6580,7 +6580,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6594,7 +6594,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6688,7 +6688,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6766,7 +6766,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6848,7 +6848,7 @@ x86_emulate(
         pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
         pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (index) : "a" (&index));
         put_stub(stub);
@@ -7026,7 +7026,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         src.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val), "a" (*src.reg));
@@ -7062,7 +7062,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub("=&a" (dst.val), "c" (&src.val));
@@ -7303,7 +7303,7 @@ x86_emulate(
             evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7470,7 +7470,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 3;
             copy_VEX(opc, vex);
         }
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
 
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
@@ -7716,7 +7716,7 @@ x86_emulate(
         }
         opc[2] = imm1;
         insn_bytes = PFX_BYTES + 3;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -8056,7 +8056,7 @@ x86_emulate(
         pxop->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&a" (dst.val), "c" (&src.val));
@@ -8165,7 +8165,7 @@ x86_emulate(
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
         *(uint32_t *)(buf + 5) = imm1;
-        buf[9] = 0xc3;
+        place_ret(&buf[9]);
 
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val));
 
@@ -8255,12 +8255,12 @@ x86_emulate(
             BUG();
         if ( evex_encoded() )
         {
-            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - EVEX_PFX_BYTES]);
             copy_EVEX(opc, evex);
         }
         else
         {
-            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - PFX_BYTES]);
             copy_REX_VEX(opc, rex_prefix, vex);
         }
 
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: Build Xen with Return Thunks

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

In order to mitigate this, build with return thunks and arrange for
__x86_return_thunk to be (mis)aligned in the same manner as
__x86_indirect_thunk_* so the RET instruction is placed in a safe location.

place_ret() needs to conditionally emit JMP __x86_return_thunk instead of RET.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 1acdffc51c22..9611b8076197 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -35,9 +35,14 @@ config ARCH_DEFCONFIG
 	default "arch/x86/configs/x86_64_defconfig"
 
 config CC_HAS_INDIRECT_THUNK
+	# GCC >= 8 or Clang >= 6
 	def_bool $(cc-option,-mindirect-branch-register) || \
 	         $(cc-option,-mretpoline-external-thunk)
 
+config CC_HAS_RETURN_THUNK
+	# GCC >= 8 or Clang >= 15
+	def_bool $(cc-option,-mfunction-return=thunk-extern)
+
 config HAS_AS_CET_SS
 	# binutils >= 2.29 or LLVM >= 6
 	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy)
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index 7df877a09d3f..fc984d629edb 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -44,6 +44,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
+obj-$(CONFIG_RETURN_THUNK) += indirect-thunk.o
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 66f799339913..97bd676aaee2 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -133,7 +133,7 @@ ENTRY(s3_resume)
         pop     %r12
         pop     %rbx
         pop     %rbp
-        ret
+        RET
 
 .data
         .align 16
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index be1957723472..ccb9985a766e 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,16 +137,45 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+void nocall __x86_return_thunk(void);
+
 /*
  * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
  *
+ * When CONFIG_RETURN_THUNK is active, this may be a JMP __x86_return_thunk
+ * instead, depending on the safety of @ptr with respect to Indirect Target
+ * Selection.
+ *
  * Returns the next position to write into the stub.
  */
 void *place_ret(void *ptr)
 {
+    unsigned long addr = (unsigned long)ptr;
     uint8_t *p = ptr;
 
-    *p++ = 0xc3;
+    /*
+     * When Return Thunks are used, if a RET would be unsafe at this location
+     * with respect to Indirect Target Selection (i.e. if addr is in the first
+     * half of a cacheline), insert a JMP __x86_return_thunk instead.
+     *
+     * The displacement needs to be relative to the executable alias of the
+     * stub, not to @ptr which is the writeable alias.
+     */
+    if ( IS_ENABLED(CONFIG_RETURN_THUNK) && !(addr & 0x20) )
+    {
+        long stub_va = (this_cpu(stubs.addr) & PAGE_MASK) + (addr & ~PAGE_MASK);
+        long disp = (long)__x86_return_thunk - (stub_va + 5);
+
+        BUG_ON((int32_t)disp != disp);
+
+        *p++ = 0xe9;
+        *(int32_t *)p = disp;
+        p += 4;
+    }
+    else
+    {
+        *p++ = 0xc3;
+    }
 
     return p;
 }
diff --git a/xen/arch/x86/arch.mk b/xen/arch/x86/arch.mk
index 28217c9ace46..21a753b6394e 100644
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -46,6 +46,9 @@ CFLAGS-$(CONFIG_CC_IS_GCC) += -fno-jump-tables
 CFLAGS-$(CONFIG_CC_IS_CLANG) += -mretpoline-external-thunk
 endif
 
+# Compile with return thunk support if selected.
+CFLAGS-$(CONFIG_RETURN_THUNK) += -mfunction-return=thunk-extern
+
 # Disable the addition of a .note.gnu.property section to object files when
 # livepatch support is enabled.  The contents of that section can change
 # depending on the instructions used, and livepatch-build-tools doesn't know
diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 05f1043df7d0..472da481dd42 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -23,7 +23,7 @@ ENTRY(clear_bhb_tsx)
 0:      .byte 0xc6, 0xf8, 0             /* xabort $0 */
         int3
 1:
-        ret
+        RET
 
         .size clear_bhb_tsx, . - clear_bhb_tsx
         .type clear_bhb_tsx, @function
diff --git a/xen/arch/x86/clear_page.S b/xen/arch/x86/clear_page.S
index 5b5622cc526f..fd4c8c0b2a28 100644
--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,5 +1,6 @@
         .file __FILE__
 
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 ENTRY(clear_page_sse2)
@@ -15,7 +16,7 @@ ENTRY(clear_page_sse2)
         jnz     0b
 
         sfence
-        ret
+        RET
 
         .type clear_page_sse2, @function
         .size clear_page_sse2, . - clear_page_sse2
diff --git a/xen/arch/x86/copy_page.S b/xen/arch/x86/copy_page.S
index ddb6e0ebbb6e..184127c11d86 100644
--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -1,5 +1,6 @@
         .file __FILE__
 
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 #define src_reg %rsi
@@ -40,7 +41,7 @@ ENTRY(copy_page_sse2)
         movnti  tmp4_reg, 3*WORD_SIZE(dst_reg)
 
         sfence
-        ret
+        RET
 
         .type copy_page_sse2, @function
         .size copy_page_sse2, . - copy_page_sse2
diff --git a/xen/arch/x86/efi/check.c b/xen/arch/x86/efi/check.c
index 9e473faad3c9..23ba30abf330 100644
--- a/xen/arch/x86/efi/check.c
+++ b/xen/arch/x86/efi/check.c
@@ -3,6 +3,9 @@ int __attribute__((__ms_abi__)) test(int i)
     return i;
 }
 
+/* In case -mfunction-return is in use. */
+void __x86_return_thunk(void) {};
+
 /*
  * Populate an array with "addresses" of relocatable and absolute values.
  * This is to probe ld for (a) emitting base relocations at all and (b) not
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 32d6b4491063..97ebe21298a2 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -58,6 +58,12 @@
     .endif
 .endm
 
+#ifdef CONFIG_RETURN_THUNK
+# define RET jmp __x86_return_thunk
+#else
+# define RET ret
+#endif
+
 #ifdef CONFIG_XEN_IBT
 # define ENDBR64 endbr64
 #else
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index e7ef104d3bd3..239cf7dc770b 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -11,6 +11,9 @@
 
 #include <asm/asm_defns.h>
 
+
+#ifdef CONFIG_INDIRECT_THUNK
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -60,3 +63,27 @@ ENTRY(__x86_indirect_thunk_\reg)
 .irp reg, ax, cx, dx, bx, bp, si, di, 8, 9, 10, 11, 12, 13, 14, 15
         GEN_INDIRECT_THUNK reg=r\reg
 .endr
+
+#endif /* CONFIG_INDIRECT_THUNK */
+
+#ifdef CONFIG_RETURN_THUNK
+        .section .text.entry.__x86_return_thunk, "ax", @progbits
+
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
+ENTRY(__x86_return_thunk)
+        ret
+        int3 /* Halt straight-line speculation */
+
+        .size __x86_return_thunk, . - __x86_return_thunk
+        .type __x86_return_thunk, @function
+
+#endif /* CONFIG_RETURN_THUNK */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ff5d1c9f8634..295d847ea24c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -131,7 +131,7 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
                   MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
-                  1 /* ret */));
+                  (IS_ENABLED(CONFIG_RETURN_THUNK) ? 5 : 1) /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index e7f5bfcd2d03..bf830a78f8ae 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -26,12 +26,11 @@ ENTRY(load_guest_gprs)
         movq  UREGS_r15(%rdi), %r15
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
-        ret
+        RET
 
         .size load_guest_gprs, . - load_guest_gprs
         .type load_guest_gprs, STT_FUNC
 
-
 /* Save guest GPRs.  Parameter on the stack above the return address. */
 ENTRY(save_guest_gprs)
         pushq %rdi
@@ -51,7 +50,7 @@ ENTRY(save_guest_gprs)
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
         movq  %rcx, UREGS_rcx(%rdi)
-        ret
+        RET
 
         .size save_guest_gprs, . - save_guest_gprs
         .type save_guest_gprs, STT_FUNC
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 51a66a144e6c..8daa28e1ea97 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -569,6 +569,9 @@ static void __init print_details(enum ind_thunk thunk)
 #ifdef CONFIG_INDIRECT_THUNK
                " INDIRECT_THUNK"
 #endif
+#ifdef CONFIG_RETURN_THUNK
+               " RETURN_THUNK"
+#endif
 #ifdef CONFIG_SHADOW_PAGING
                " SHADOW_PAGING"
 #endif
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 3c544e9a14d6..dfa1152b6049 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -183,7 +183,7 @@ ENTRY(cr4_pv32_restore)
         mov   %rax, %cr4
         mov   %rax, (%rdx)
         pop   %rdx
-        ret
+        RET
 0:
 #ifndef NDEBUG
         /* Check that _all_ of the bits intended to be set actually are. */
@@ -202,7 +202,7 @@ ENTRY(cr4_pv32_restore)
 #endif
         pop   %rdx
         xor   %eax, %eax
-        ret
+        RET
 
 ENTRY(compat_syscall)
         /* Fix up reported %cs/%ss for compat domains. */
@@ -329,7 +329,7 @@ __UNLIKELY_END(compat_bounce_null_selector)
         xor   %eax, %eax
         mov   %ax,  TRAPBOUNCE_cs(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
 .section .fixup,"ax"
 .Lfx13:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index df3f3b4ea723..ccf058ae5575 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -598,7 +598,7 @@ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         xor   %eax, %eax
         mov   %rax, TRAPBOUNCE_eip(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
         .pushsection .fixup, "ax", @progbits
         # Numeric tags below represent the intended overall %rsi adjustment.
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 8930e14fc40e..b66e708ebf69 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -86,6 +86,7 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _stextentry = .;
        *(.text.entry)
+       *(.text.entry.*)
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 3361a6d89257..2c103c2c4736 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -127,6 +127,17 @@ config INDIRECT_THUNK
 	  When enabled, indirect branches are implemented using a new construct
 	  called "retpoline" that prevents speculation.
 
+config RETURN_THUNK
+	bool "Out-of-line Returns"
+	depends on CC_HAS_RETURN_THUNK
+	default INDIRECT_THUNK
+	help
+	  Compile Xen with out-of-line returns.
+
+	  This allows Xen to mitigate a variety of speculative vulnerabilities
+	  by choosing a hardware-dependent instruction sequence to implement
+	  function returns safely.
+
 config SPECULATIVE_HARDEN_ARRAY
 	bool "Speculative Array Hardening"
 	default y
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware

It is easier to express feature word 17 in terms of word 16 + [32, 64) as
that's how the layout is given in documentation.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 3c57f55de075..1048f16e0c64 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -211,6 +211,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
 #define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
 #define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
+#define cpu_has_its_no          boot_cpu_has(X86_FEATURE_ITS_NO)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 8daa28e1ea97..46681d10ae96 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1781,6 +1781,90 @@ static void __init bhi_calculations(void)
     }
 }
 
+/*
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html
+ */
+static void __init its_calculations(void)
+{
+    /*
+     * Indirect Target Selection is a Branch Prediction bug whereby certain
+     * indirect branches (including RETs) get predicted using a direct branch
+     * target, rather than a suitable indirect target, bypassing hardware
+     * isolation protections.
+     *
+     * ITS affects Core (but not Atom) processors starting from the
+     * introduction of eIBRS, up to but not including Golden Cove cores
+     * (checked here with BHI_CTRL).
+     *
+     * The ITS_NO feature is not expected to be enumerated by hardware, and is
+     * only for VMMs to synthesise for guests.
+     *
+     * ITS comes in 3 flavours:
+     *
+     *   1) Across-IBPB.  Indirect branches after the IBPB can be controlled
+     *      by direct targets which existed prior to the IBPB.  This is
+     *      addressed in the IPU 2025.1 microcode drop, and has no other
+     *      software interaction.
+     *
+     *   2) Guest/Host.  Indirect branches in the VMM can be controlled by
+     *      direct targets from the guest.  This applies equally to PV guests
+     *      (Ring3) and HVM guests (VMX), and applies to all Skylake-uarch
+     *      cores with eIBRS.
+     *
+     *   3) Intra-mode.  Indirect branches in the VMM can be controlled by
+     *      other execution in the same mode.
+     */
+
+    /*
+     * If we can see ITS_NO, or we're virtualised, do nothing.  We are or may
+     * migrate somewhere unsafe.
+     */
+    if ( cpu_has_its_no || cpu_has_hypervisor )
+        return;
+
+    /* ITS is only known to affect Intel processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
+        return;
+
+    /*
+     * ITS does not exist on:
+     *  - non-Family 6 CPUs
+     *  - those without eIBRS
+     *  - those with BHI_CTRL
+     * but we still need to synthesise ITS_NO.
+     */
+    if ( boot_cpu_data.x86 != 6 || !cpu_has_eibrs ||
+         boot_cpu_has(X86_FEATURE_BHI_CTRL) )
+        goto synthesise;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /* These Skylake-uarch cores suffer cases #2 and #3. */
+    case INTEL_FAM6_SKYLAKE_X:
+    case INTEL_FAM6_KABYLAKE_L:
+    case INTEL_FAM6_KABYLAKE:
+    case INTEL_FAM6_COMETLAKE:
+    case INTEL_FAM6_COMETLAKE_L:
+        return;
+
+        /* These Sunny/Willow/Cypress Cove cores suffer case #3. */
+    case INTEL_FAM6_ICELAKE_X:
+    case INTEL_FAM6_ICELAKE_D:
+    case INTEL_FAM6_ICELAKE_L:
+    case INTEL_FAM6_TIGERLAKE_L:
+    case INTEL_FAM6_TIGERLAKE:
+    case INTEL_FAM6_ROCKETLAKE:
+        return;
+
+    default:
+        break;
+    }
+
+    /* Platforms remaining are not believed to be vulnerable to ITS. */
+ synthesise:
+    setup_force_cpu_cap(X86_FEATURE_ITS_NO);
+}
+
 void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
@@ -2331,6 +2415,8 @@ void __init init_speculation_mitigations(void)
 
     bhi_calculations();
 
+    its_calculations();
+
     print_details(thunk);
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 603a448cb3d2..6fa77774091d 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -344,7 +344,8 @@ XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
 XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
 XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A Register File(s) cleared by VERW */
 
-/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
+/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
+XEN_CPUFEATURE(ITS_NO,             16*32+62) /*!A No Indirect Target Selection */
 
 #endif /* XEN_CPUFEATURE */
 
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index 415d644db519..61d236058a78 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -51,7 +51,7 @@ def parse_definitions(state):
         r"\s+/\*([\w!]*) .*$")
 
     word_regex = re.compile(
-        r"^/\* .* word (\d*) \*/$")
+        r"^/\* .* word (\d*) .*\*/$")
     last_word = -1
 
     this = sys.modules[__name__]
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/alternative: Support replacements when a feature is not present

Use the top bit of a->cpuid to express inverted polarity.  This requires
stripping the top bit back out when performing the sanity checks.

Despite only being used once, create a replace boolean to express the decision
more clearly in _apply_alternatives().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 1ba35cb9ede9..88c90044c20d 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -197,6 +197,8 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t *repl = ALT_REPL_PTR(a);
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
+        unsigned int feat = a->cpuid & ~ALT_FLAG_NOT;
+        bool inv = a->cpuid & ALT_FLAG_NOT, replace;
 
         if ( a->repl_len > total_len )
         {
@@ -214,11 +216,11 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             return -ENOSPC;
         }
 
-        if ( a->cpuid >= NCAPINTS * 32 )
+        if ( feat >= NCAPINTS * 32 )
         {
              printk(XENLOG_ERR
                    "Alt for %ps, feature %#x outside of featureset range %#x\n",
-                   ALT_ORIG_PTR(a), a->cpuid, NCAPINTS * 32);
+                   ALT_ORIG_PTR(a), feat, NCAPINTS * 32);
             return -ERANGE;
         }
 
@@ -243,8 +245,14 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             continue;
         }
 
+        /*
+         * Should a replacement be performed?  Most replacements have positive
+         * polarity, but we support negative polarity too.
+         */
+        replace = boot_cpu_has(feat) ^ inv;
+
         /* If there is no replacement to make, see about optimising the nops. */
-        if ( !boot_cpu_has(a->cpuid) )
+        if ( !replace )
         {
             /* Origin site site already touched?  Don't nop anything. */
             if ( base->priv )
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 69555d781ef9..89b7bdcb82e5 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -1,6 +1,13 @@
 #ifndef __X86_ALTERNATIVE_H__
 #define __X86_ALTERNATIVE_H__
 
+/*
+ * Common to both C and ASM.  Express a replacement when a feature is not
+ * available.
+ */
+#define ALT_FLAG_NOT (1 << 15)
+#define ALT_NOT(x) (ALT_FLAG_NOT | (x))
+
 #ifdef __ASSEMBLY__
 #include <asm/alternative-asm.h>
 #else
@@ -11,7 +18,7 @@
 struct __packed alt_instr {
     int32_t  orig_offset;   /* original instruction */
     int32_t  repl_offset;   /* offset to replacement instruction */
-    uint16_t cpuid;         /* cpuid bit set for replacement */
+    uint16_t cpuid;         /* cpuid bit set for replacement (top bit is polarity) */
     uint8_t  orig_len;      /* length of original instruction */
     uint8_t  repl_len;      /* length of new instruction */
     uint8_t  pad_len;       /* length of build-time padding */

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/guest: Remove use of the Xen hypercall_page

In order to protect against ITS, Xen needs to start using return thunks.
Therefore the advice in XSA-466 becomes relevant, and the hypercall_page needs
to be removed.

Implement early_hypercall(), with infrastructure to figure out the correct
instruction on first use.  Use ALTERNATIVE()s to result in inline hypercalls,
including the ALT_NOT() form so we only need a single synthetic feature bit.

No overall change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/guest/xen/Makefile b/xen/arch/x86/guest/xen/Makefile
index 26fb4b1007c0..8b3250aa8886 100644
--- a/xen/arch/x86/guest/xen/Makefile
+++ b/xen/arch/x86/guest/xen/Makefile
@@ -1,4 +1,4 @@
-obj-y += hypercall_page.o
+obj-bin-y += hypercall.init.o
 obj-y += xen.o
 
 obj-bin-$(CONFIG_PVH_GUEST) += pvh-boot.init.o
diff --git a/xen/arch/x86/guest/xen/hypercall.S b/xen/arch/x86/guest/xen/hypercall.S
new file mode 100644
index 000000000000..05e429794cc4
--- /dev/null
+++ b/xen/arch/x86/guest/xen/hypercall.S
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include <xen/linkage.h>
+
+        .section .init.text, "ax", @progbits
+
+        /*
+         * Used during early boot, before alternatives have run and inlined
+         * the appropriate instruction.  Called using the hypercall ABI.
+         */
+FUNC(early_hypercall)
+        cmpb    $0, early_hypercall_insn(%rip)
+        jl      .L_setup
+        je      1f
+
+        vmmcall
+        ret
+
+1:      vmcall
+        ret
+
+.L_setup:
+        /*
+         * When setting up the first time around, all registers need
+         * preserving.  Save the non-callee-saved ones.
+         */
+        push    %r11
+        push    %r10
+        push    %r9
+        push    %r8
+        push    %rdi
+        push    %rsi
+        push    %rdx
+        push    %rcx
+        push    %rax
+
+        call    early_hypercall_setup
+
+        pop     %rax
+        pop     %rcx
+        pop     %rdx
+        pop     %rsi
+        pop     %rdi
+        pop     %r8
+        pop     %r9
+        pop     %r10
+        pop     %r11
+
+        jmp     early_hypercall
+END(early_hypercall)
diff --git a/xen/arch/x86/guest/xen/hypercall_page.S b/xen/arch/x86/guest/xen/hypercall_page.S
deleted file mode 100644
index 7ab55fc1f6e6..000000000000
--- a/xen/arch/x86/guest/xen/hypercall_page.S
+++ /dev/null
@@ -1,76 +0,0 @@
-#include <asm/page.h>
-#include <asm/asm_defns.h>
-#include <public/xen.h>
-
-        .section ".text.page_aligned", "ax", @progbits
-
-DATA(hypercall_page, PAGE_SIZE)
-         /* Poisoned with `ret` for safety before hypercalls are set up. */
-        .fill PAGE_SIZE, 1, 0xc3
-END(hypercall_page)
-
-/*
- * Identify a specific hypercall in the hypercall page
- * @param name Hypercall name.
- */
-#define DECLARE_HYPERCALL(name)                                                 \
-        .globl HYPERCALL_ ## name;                                              \
-        .type  HYPERCALL_ ## name, STT_FUNC;                                    \
-        .size  HYPERCALL_ ## name, 32;                                          \
-        .set   HYPERCALL_ ## name, hypercall_page + __HYPERVISOR_ ## name * 32
-
-DECLARE_HYPERCALL(set_trap_table)
-DECLARE_HYPERCALL(mmu_update)
-DECLARE_HYPERCALL(set_gdt)
-DECLARE_HYPERCALL(stack_switch)
-DECLARE_HYPERCALL(set_callbacks)
-DECLARE_HYPERCALL(fpu_taskswitch)
-DECLARE_HYPERCALL(sched_op_compat)
-DECLARE_HYPERCALL(platform_op)
-DECLARE_HYPERCALL(set_debugreg)
-DECLARE_HYPERCALL(get_debugreg)
-DECLARE_HYPERCALL(update_descriptor)
-DECLARE_HYPERCALL(memory_op)
-DECLARE_HYPERCALL(multicall)
-DECLARE_HYPERCALL(update_va_mapping)
-DECLARE_HYPERCALL(set_timer_op)
-DECLARE_HYPERCALL(event_channel_op_compat)
-DECLARE_HYPERCALL(xen_version)
-DECLARE_HYPERCALL(console_io)
-DECLARE_HYPERCALL(physdev_op_compat)
-DECLARE_HYPERCALL(grant_table_op)
-DECLARE_HYPERCALL(vm_assist)
-DECLARE_HYPERCALL(update_va_mapping_otherdomain)
-DECLARE_HYPERCALL(iret)
-DECLARE_HYPERCALL(vcpu_op)
-DECLARE_HYPERCALL(set_segment_base)
-DECLARE_HYPERCALL(mmuext_op)
-DECLARE_HYPERCALL(xsm_op)
-DECLARE_HYPERCALL(nmi_op)
-DECLARE_HYPERCALL(sched_op)
-DECLARE_HYPERCALL(callback_op)
-DECLARE_HYPERCALL(xenoprof_op)
-DECLARE_HYPERCALL(event_channel_op)
-DECLARE_HYPERCALL(physdev_op)
-DECLARE_HYPERCALL(hvm_op)
-DECLARE_HYPERCALL(sysctl)
-DECLARE_HYPERCALL(domctl)
-DECLARE_HYPERCALL(kexec_op)
-DECLARE_HYPERCALL(argo_op)
-DECLARE_HYPERCALL(xenpmu_op)
-
-DECLARE_HYPERCALL(arch_0)
-DECLARE_HYPERCALL(arch_1)
-DECLARE_HYPERCALL(arch_2)
-DECLARE_HYPERCALL(arch_3)
-DECLARE_HYPERCALL(arch_4)
-DECLARE_HYPERCALL(arch_5)
-DECLARE_HYPERCALL(arch_6)
-DECLARE_HYPERCALL(arch_7)
-
-/*
- * Local variables:
- * tab-width: 8
- * indent-tabs-mode: nil
- * End:
- */
diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c
index 7484b3f73ad3..2c30db05dfa7 100644
--- a/xen/arch/x86/guest/xen/xen.c
+++ b/xen/arch/x86/guest/xen/xen.c
@@ -26,7 +26,6 @@
 bool __read_mostly xen_guest;
 
 uint32_t __read_mostly xen_cpuid_base;
-extern char hypercall_page[];
 static struct rangeset *mem;
 
 DEFINE_PER_CPU(unsigned int, vcpu_id);
@@ -35,6 +34,50 @@ static struct vcpu_info *vcpu_info;
 static unsigned long vcpu_info_mapped[BITS_TO_LONGS(NR_CPUS)];
 DEFINE_PER_CPU(struct vcpu_info *, vcpu_info);
 
+/*
+ * Which instruction to use for early hypercalls:
+ *   < 0 setup
+ *     0 vmcall
+ *   > 0 vmmcall
+ */
+int8_t __initdata early_hypercall_insn = -1;
+
+/*
+ * Called once during the first hypercall to figure out which instruction to
+ * use.  Error handling options are limited.
+ */
+void asmlinkage __init early_hypercall_setup(void)
+{
+    BUG_ON(early_hypercall_insn != -1);
+
+    if ( !boot_cpu_data.x86_vendor )
+    {
+        unsigned int eax, ebx, ecx, edx;
+
+        cpuid(0, &eax, &ebx, &ecx, &edx);
+
+        boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
+    }
+
+    switch ( boot_cpu_data.x86_vendor )
+    {
+    case X86_VENDOR_INTEL:
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        early_hypercall_insn = 0;
+        setup_force_cpu_cap(X86_FEATURE_USE_VMCALL);
+        break;
+
+    case X86_VENDOR_AMD:
+    case X86_VENDOR_HYGON:
+        early_hypercall_insn = 1;
+        break;
+
+    default:
+        BUG();
+    }
+}
+
 static void __init find_xen_leaves(void)
 {
     uint32_t eax, ebx, ecx, edx, base;
@@ -337,9 +380,6 @@ const struct hypervisor_ops *__init xg_probe(void)
     if ( !xen_cpuid_base )
         return NULL;
 
-    /* Fill the hypercall page. */
-    wrmsrl(cpuid_ebx(xen_cpuid_base + 2), __pa(hypercall_page));
-
     xen_guest = true;
 
     return &ops;
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ba3df174b76e..9e3ed21c026d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks *
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
+XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/guest/xen-hcall.h b/xen/arch/x86/include/asm/guest/xen-hcall.h
index 665b472d05ac..96004dec9909 100644
--- a/xen/arch/x86/include/asm/guest/xen-hcall.h
+++ b/xen/arch/x86/include/asm/guest/xen-hcall.h
@@ -30,9 +30,11 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__) ASM_CALL_CONSTRAINT              \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1))                                          \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -42,10 +44,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__)                    \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2))                        \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -55,10 +59,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__)      \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3))      \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -69,10 +75,12 @@
         long res, tmp__;                                                \
         register long _a4 asm ("r10") = ((long)(a4));                   \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__),     \
               "=&r" (tmp__) ASM_CALL_CONSTRAINT                         \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3)),     \
               "4" (_a4)                                                 \
             : "memory" );                                               \
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: (Mis)align __x86_indirect_thunk_* to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

Arrange for __x86_indirect_thunk_* to always be in the second half.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index fd5493c22b16..c4b978d67b8e 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -11,6 +11,10 @@
 
 #include <asm/asm_defns.h>
 
+/* Alignment is dealt with explicitly here; override the respective macro. */
+#undef SYM_ALIGN
+#define SYM_ALIGN(align...)
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -35,6 +39,16 @@
 .macro GEN_INDIRECT_THUNK reg:req
         .section .text.__x86_indirect_thunk_\reg, "ax", @progbits
 
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
 FUNC(__x86_indirect_thunk_\reg)
         ALTERNATIVE_2 __stringify(IND_THUNK_RETPOLINE \reg),              \
         __stringify(IND_THUNK_LFENCE \reg), X86_FEATURE_IND_THUNK_LFENCE, \
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/thunk: (Mis)align the RETs in clear_bhb_loops() to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

clear_bhb_loops() has a precise layout of branches.  The alignment for
performance cause the RETs to always be in an unsafe position, and converting
those to return thunks changes the branching pattern.  While such a conversion
is believed to be safe, clear_bhb_loops() is also a performance-relevant
fastpath, so (mis)align the RETs to be in a safe position.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 678c00c5d06f..52625f4e2c17 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -50,7 +50,12 @@ END(clear_bhb_tsx)
  *   ret
  *
  * The CALL/RETs are necessary to prevent the Loop Stream Detector from
- * interfering.  The alignment is for performance and not safety.
+ * interfering.
+ *
+ * The .balign's are for performance, but they cause the RETs to be in unsafe
+ * positions with respect to Indirect Target Selection.  The .skips are to
+ * move the RETs into ITS-safe positions, rather than using the slowpath
+ * through __x86_return_thunk.
  *
  * The "short" sequence (5 and 5) is for CPUs prior to Alder Lake / Sapphire
  * Rapids (i.e. Cores prior to Golden Cove and/or Gracemont).
@@ -66,12 +71,14 @@ FUNC(clear_bhb_loops)
         jmp     5f
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - (.Lr1 - 1f), 0xcc
 1:      call    2f
-        ret
+.Lr1:   ret
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - 18 /* (.Lr2 - 2f) but Clang IAS doesn't like this */, 0xcc
 2:      ALTERNATIVE "mov $5, %eax", "mov $7, %eax", X86_SPEC_BHB_LOOPS_LONG
 
 3:      jmp     4f
@@ -83,7 +90,7 @@ FUNC(clear_bhb_loops)
         sub     $1, %ecx
         jnz     1b
 
-        ret
+.Lr2:   ret
 5:
         /*
          * The Intel sequence has an LFENCE here.  The purpose is to ensure
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/stubs: Introduce place_ret() to abstract away raw 0xc3's

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.
This means it's not safe for logic using the stubs to write raw 0xc3's.

Introduce place_ret() which, for now, writes a raw 0xc3 but will contain
additional logic when return thunks are in use.

stub_selftest() doesn't strictly need to be converted as they only run on
boot, but doing so gets us a partial test of place_ret() too.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/tools/tests/x86_emulator/x86-emulate.h b/tools/tests/x86_emulator/x86-emulate.h
index 8f8accfe3e70..946aaa9d660b 100644
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -68,6 +68,12 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *place_ret(void *ptr)
+{
+    *(uint8_t *)ptr = 0xc3;
+    return ptr + 1;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpu_policy cp;
 
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index c1e64278ce85..a7e5a82689de 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -11,9 +11,7 @@ obj-$(CONFIG_PV) += pv/
 obj-y += x86_64/
 obj-y += x86_emulate/
 
-alternative-y := alternative.init.o
-alternative-$(CONFIG_LIVEPATCH) :=
-obj-bin-y += $(alternative-y)
+obj-y += alternative.o
 obj-y += apic.o
 obj-y += bhb-thunk.o
 obj-y += bitops.o
@@ -41,7 +39,7 @@ obj-y += hypercall.o
 obj-y += i387.o
 obj-y += i8259.o
 obj-y += io_apic.o
-obj-$(CONFIG_LIVEPATCH) += alternative.o livepatch.o
+obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 88c90044c20d..ec451d962c10 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,6 +137,20 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+/*
+ * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
+ *
+ * Returns the next position to write into the stub.
+ */
+void *place_ret(void *ptr)
+{
+    uint8_t *p = ptr;
+
+    *p++ = 0xc3;
+
+    return p;
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 705cf9eb94ca..1572efa69a00 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -151,20 +151,20 @@ search_exception_table(const struct cpu_user_regs *regs, unsigned long *stub_ra)
 int __init cf_check stub_selftest(void)
 {
     static const struct {
-        uint8_t opc[8];
+        uint8_t opc[7];
         uint64_t rax;
         union stub_exception_token res;
     } tests[] __initconst = {
 #define endbr64 0xf3, 0x0f, 0x1e, 0xfa
-        { .opc = { endbr64, 0x0f, 0xb9, 0xc3, 0xc3 }, /* ud1 */
+        { .opc = { endbr64, 0x0f, 0xb9, 0x90 }, /* ud1 */
           .res.fields.trapnr = X86_EXC_UD },
-        { .opc = { endbr64, 0x90, 0x02, 0x00, 0xc3 }, /* nop; add (%rax),%al */
+        { .opc = { endbr64, 0x90, 0x02, 0x00 }, /* nop; add (%rax),%al */
           .rax = 0x0123456789abcdef,
           .res.fields.trapnr = X86_EXC_GP },
-        { .opc = { endbr64, 0x02, 0x04, 0x04, 0xc3 }, /* add (%rsp,%rax),%al */
+        { .opc = { endbr64, 0x02, 0x04, 0x04 }, /* add (%rsp,%rax),%al */
           .rax = 0xfedcba9876543210UL,
           .res.fields.trapnr = X86_EXC_SS },
-        { .opc = { endbr64, 0xcc, 0xc3, 0xc3, 0xc3 }, /* int3 */
+        { .opc = { endbr64, 0xcc, 0x90, 0x90 }, /* int3 */
           .res.fields.trapnr = X86_EXC_BP },
 #undef endbr64
     };
@@ -183,6 +183,7 @@ int __init cf_check stub_selftest(void)
 
         memset(ptr, 0xcc, STUB_BUF_SIZE / 2);
         memcpy(ptr, tests[i].opc, ARRAY_SIZE(tests[i].opc));
+        place_ret(ptr + ARRAY_SIZE(tests[i].opc));
         unmap_domain_page(ptr);
 
         asm volatile ( "INDIRECT_CALL %[stb]\n"
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 89b7bdcb82e5..841a63ebf1b6 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -30,6 +30,8 @@ struct __packed alt_instr {
 #define ALT_REPL_PTR(a)     __ALT_PTR(a, repl_offset)
 
 extern void add_nops(void *insns, unsigned int len);
+void *place_ret(void *ptr);
+
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
 extern int apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 70150c272276..ff5d1c9f8634 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -76,7 +76,6 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
         0x41, 0x5c, /* pop %r12  */
         0x5d,       /* pop %rbp  */
         0x5b,       /* pop %rbx  */
-        0xc3,       /* ret       */
     };
 
     const struct stubs *this_stubs = &this_cpu(stubs);
@@ -126,11 +125,13 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
 
     APPEND_CALL(save_guest_gprs);
     APPEND_BUFF(epilogue);
+    p = place_ret(p);
 
     /* Build-time best effort attempt to catch problems. */
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
-                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES)));
+                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
+                  1 /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/x86_emulate/fpu.c b/xen/arch/x86/x86_emulate/fpu.c
index 480d87965705..03612d00a2ce 100644
--- a/xen/arch/x86/x86_emulate/fpu.c
+++ b/xen/arch/x86/x86_emulate/fpu.c
@@ -32,36 +32,42 @@ static inline bool fpu_check_write(void)
 
 #define emulate_fpu_insn_memdst(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
     *insn_bytes = 2;                                                    \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "+m" (arg) : "a" (&(arg)));                     \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_memsrc(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "m" (arg), "a" (&(arg)));        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub(bytes...)                                 \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "i" (0));                        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub_eflags(bytes...)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
     unsigned long tmp_;                                                 \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),             \
                 _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),            \
                 [eflags] "+g" (regs->eflags), [tmp] "=&r" (tmp_)        \
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c
index b1d192cbbf1e..f40709682484 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1396,7 +1396,7 @@ x86_emulate(
         stb[3] = 0x91;
         stb[4] = evex.opmsk << 3;
         insn_bytes = 5;
-        stb[5] = 0xc3;
+        place_ret(&stb[5]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
 
@@ -3627,7 +3627,7 @@ x86_emulate(
         }
         opc[1] = (modrm & 0x38) | 0xc0;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -3694,7 +3694,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
@@ -3768,7 +3768,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         _regs.eflags &= ~EFLAGS_MASK;
         invoke_stub("",
@@ -4004,7 +4004,7 @@ x86_emulate(
         opc[1] = modrm & 0xc7;
         insn_bytes = PFX_BYTES + 2;
     simd_0f_to_gpr:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         generate_exception_if(ea.type != OP_REG, X86_EXC_UD);
 
@@ -4401,7 +4401,7 @@ x86_emulate(
             vex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4438,7 +4438,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4633,7 +4633,7 @@ x86_emulate(
 #endif /* X86EMUL_NO_SIMD */
 
     simd_0f_reg_only:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", [dummy_out] "=g" (dummy) : [dummy_in] "i" (0) );
@@ -4967,7 +4967,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xf8;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         ea.reg = decode_gpr(&_regs, modrm_rm);
@@ -5010,7 +5010,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
@@ -5040,7 +5040,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         _regs.eflags &= ~EFLAGS_MASK;
@@ -5608,7 +5608,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
@@ -5726,7 +5726,7 @@ x86_emulate(
             opc[1] &= 0x38;
         }
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -6006,7 +6006,7 @@ x86_emulate(
         pvex->b = !mode_64bit() || (vex.reg >> 3);
         opc[1] = 0xc0 | (~vex.reg & 7);
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
@@ -6290,7 +6290,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0xf8;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -6389,7 +6389,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6459,7 +6459,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6515,7 +6515,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6580,7 +6580,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6594,7 +6594,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6688,7 +6688,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6766,7 +6766,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6848,7 +6848,7 @@ x86_emulate(
         pevex->r = !mode_64bit() || !(state->sib_index & 0x08);
         pevex->R = !mode_64bit() || !(state->sib_index & 0x10);
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (index) : "a" (&index));
         put_stub(stub);
@@ -7058,7 +7058,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         src.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val), "a" (*src.reg));
@@ -7094,7 +7094,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub("=&a" (dst.val), "c" (&src.val));
@@ -7335,7 +7335,7 @@ x86_emulate(
             evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7502,7 +7502,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 3;
             copy_VEX(opc, vex);
         }
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
 
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
@@ -7748,7 +7748,7 @@ x86_emulate(
         }
         opc[2] = imm1;
         insn_bytes = PFX_BYTES + 3;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -8094,7 +8094,7 @@ x86_emulate(
         pxop->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&a" (dst.val), "c" (&src.val));
@@ -8203,7 +8203,7 @@ x86_emulate(
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
         *(uint32_t *)(buf + 5) = imm1;
-        buf[9] = 0xc3;
+        place_ret(&buf[9]);
 
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val));
 
@@ -8293,12 +8293,12 @@ x86_emulate(
             BUG();
         if ( evex_encoded() )
         {
-            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - EVEX_PFX_BYTES]);
             copy_EVEX(opc, evex);
         }
         else
         {
-            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - PFX_BYTES]);
             copy_REX_VEX(opc, rex_prefix, vex);
         }
 
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: Build Xen with Return Thunks

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

In order to mitigate this, build with return thunks and arrange for
__x86_return_thunk to be (mis)aligned in the same manner as
__x86_indirect_thunk_* so the RET instruction is placed in a safe location.

place_ret() needs to conditionally emit JMP __x86_return_thunk instead of RET.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 7e03e4bc5546..4542ea8408c7 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -37,9 +37,14 @@ config ARCH_DEFCONFIG
 	default "arch/x86/configs/x86_64_defconfig"
 
 config CC_HAS_INDIRECT_THUNK
+	# GCC >= 8 or Clang >= 6
 	def_bool $(cc-option,-mindirect-branch-register) || \
 	         $(cc-option,-mretpoline-external-thunk)
 
+config CC_HAS_RETURN_THUNK
+	# GCC >= 8 or Clang >= 15
+	def_bool $(cc-option,-mfunction-return=thunk-extern)
+
 config HAS_AS_CET_SS
 	# binutils >= 2.29 or LLVM >= 6
 	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy)
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index a7e5a82689de..27806a81aca8 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
+obj-$(CONFIG_RETURN_THUNK) += indirect-thunk.o
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index 66f799339913..97bd676aaee2 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -133,7 +133,7 @@ ENTRY(s3_resume)
         pop     %r12
         pop     %rbx
         pop     %rbp
-        ret
+        RET
 
 .data
         .align 16
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index ec451d962c10..1b71ae959abe 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,16 +137,45 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+void nocall __x86_return_thunk(void);
+
 /*
  * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
  *
+ * When CONFIG_RETURN_THUNK is active, this may be a JMP __x86_return_thunk
+ * instead, depending on the safety of @ptr with respect to Indirect Target
+ * Selection.
+ *
  * Returns the next position to write into the stub.
  */
 void *place_ret(void *ptr)
 {
+    unsigned long addr = (unsigned long)ptr;
     uint8_t *p = ptr;
 
-    *p++ = 0xc3;
+    /*
+     * When Return Thunks are used, if a RET would be unsafe at this location
+     * with respect to Indirect Target Selection (i.e. if addr is in the first
+     * half of a cacheline), insert a JMP __x86_return_thunk instead.
+     *
+     * The displacement needs to be relative to the executable alias of the
+     * stub, not to @ptr which is the writeable alias.
+     */
+    if ( IS_ENABLED(CONFIG_RETURN_THUNK) && !(addr & 0x20) )
+    {
+        long stub_va = (this_cpu(stubs.addr) & PAGE_MASK) + (addr & ~PAGE_MASK);
+        long disp = (long)__x86_return_thunk - (stub_va + 5);
+
+        BUG_ON((int32_t)disp != disp);
+
+        *p++ = 0xe9;
+        *(int32_t *)p = disp;
+        p += 4;
+    }
+    else
+    {
+        *p++ = 0xc3;
+    }
 
     return p;
 }
diff --git a/xen/arch/x86/arch.mk b/xen/arch/x86/arch.mk
index b88d097a844b..85d3e7cbfeeb 100644
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -46,6 +46,9 @@ CFLAGS-$(CONFIG_CC_IS_GCC) += -fno-jump-tables
 CFLAGS-$(CONFIG_CC_IS_CLANG) += -mretpoline-external-thunk
 endif
 
+# Compile with return thunk support if selected.
+CFLAGS-$(CONFIG_RETURN_THUNK) += -mfunction-return=thunk-extern
+
 # Disable the addition of a .note.gnu.property section to object files when
 # livepatch support is enabled.  The contents of that section can change
 # depending on the instructions used, and livepatch-build-tools doesn't know
diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 52625f4e2c17..7f92201a3cbb 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -23,7 +23,7 @@ FUNC(clear_bhb_tsx)
 0:      .byte 0xc6, 0xf8, 0             /* xabort $0 */
         int3
 1:
-        ret
+        RET
 END(clear_bhb_tsx)
 
 /*
diff --git a/xen/arch/x86/clear_page.S b/xen/arch/x86/clear_page.S
index d6c076f1d8bc..dc3c3c26bfb7 100644
--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 FUNC(clear_page_sse2)
@@ -16,5 +18,5 @@ FUNC(clear_page_sse2)
         jnz     0b
 
         sfence
-        ret
+        RET
 END(clear_page_sse2)
diff --git a/xen/arch/x86/copy_page.S b/xen/arch/x86/copy_page.S
index c3c436545bac..e43e5370c815 100644
--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 #define src_reg %rsi
@@ -41,5 +43,5 @@ FUNC(copy_page_sse2)
         movnti  tmp4_reg, 3*WORD_SIZE(dst_reg)
 
         sfence
-        ret
+        RET
 END(copy_page_sse2)
diff --git a/xen/arch/x86/efi/check.c b/xen/arch/x86/efi/check.c
index 9e473faad3c9..23ba30abf330 100644
--- a/xen/arch/x86/efi/check.c
+++ b/xen/arch/x86/efi/check.c
@@ -3,6 +3,9 @@ int __attribute__((__ms_abi__)) test(int i)
     return i;
 }
 
+/* In case -mfunction-return is in use. */
+void __x86_return_thunk(void) {};
+
 /*
  * Populate an array with "addresses" of relocatable and absolute values.
  * This is to probe ld for (a) emitting base relocations at all and (b) not
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 32d6b4491063..97ebe21298a2 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -58,6 +58,12 @@
     .endif
 .endm
 
+#ifdef CONFIG_RETURN_THUNK
+# define RET jmp __x86_return_thunk
+#else
+# define RET ret
+#endif
+
 #ifdef CONFIG_XEN_IBT
 # define ENDBR64 endbr64
 #else
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index c4b978d67b8e..26dad15f12c9 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -15,6 +15,8 @@
 #undef SYM_ALIGN
 #define SYM_ALIGN(align...)
 
+#ifdef CONFIG_INDIRECT_THUNK
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -62,3 +64,25 @@ END(__x86_indirect_thunk_\reg)
 .irp reg, ax, cx, dx, bx, bp, si, di, 8, 9, 10, 11, 12, 13, 14, 15
         GEN_INDIRECT_THUNK reg=r\reg
 .endr
+
+#endif /* CONFIG_INDIRECT_THUNK */
+
+#ifdef CONFIG_RETURN_THUNK
+        .section .text.entry.__x86_return_thunk, "ax", @progbits
+
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
+FUNC(__x86_return_thunk)
+        ret
+        int3 /* Halt straight-line speculation */
+END(__x86_return_thunk)
+
+#endif /* CONFIG_RETURN_THUNK */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ff5d1c9f8634..295d847ea24c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -131,7 +131,7 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
                   MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
-                  1 /* ret */));
+                  (IS_ENABLED(CONFIG_RETURN_THUNK) ? 5 : 1) /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index 5409ad3b1447..362b5d241623 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -26,7 +26,7 @@ FUNC(load_guest_gprs)
         movq  UREGS_r15(%rdi), %r15
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
-        ret
+        RET
 END(load_guest_gprs)
 
 /* Save guest GPRs.  Parameter on the stack above the return address. */
@@ -48,5 +48,5 @@ FUNC(save_guest_gprs)
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
         movq  %rcx, UREGS_rcx(%rdi)
-        ret
+        RET
 END(save_guest_gprs)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 35351044f901..019a0a81f4a7 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -569,6 +569,9 @@ static void __init print_details(enum ind_thunk thunk)
 #ifdef CONFIG_INDIRECT_THUNK
                " INDIRECT_THUNK"
 #endif
+#ifdef CONFIG_RETURN_THUNK
+               " RETURN_THUNK"
+#endif
 #ifdef CONFIG_SHADOW_PAGING
                " SHADOW_PAGING"
 #endif
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index a99646c0cd4e..18f46c78cfbe 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -180,7 +180,7 @@ FUNC(cr4_pv32_restore)
         or    cr4_pv32_mask(%rip), %rax
         mov   %rax, %cr4
         mov   %rax, (%rcx)
-        ret
+        RET
 0:
 #ifndef NDEBUG
         /* Check that _all_ of the bits intended to be set actually are. */
@@ -198,7 +198,7 @@ FUNC(cr4_pv32_restore)
 1:
 #endif
         xor   %eax, %eax
-        ret
+        RET
 END(cr4_pv32_restore)
 
 FUNC(compat_syscall)
@@ -329,7 +329,7 @@ __UNLIKELY_END(compat_bounce_null_selector)
         xor   %eax, %eax
         mov   %ax,  TRAPBOUNCE_cs(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
 .section .fixup,"ax"
 .Lfx13:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 9b0cdb76408b..eb62e7c329bd 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -604,7 +604,7 @@ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         xor   %eax, %eax
         mov   %rax, TRAPBOUNCE_eip(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
         .pushsection .fixup, "ax", @progbits
         # Numeric tags below represent the intended overall %rsi adjustment.
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 9a1dfe1b340a..506993867502 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -82,6 +82,7 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _stextentry = .;
        *(.text.entry)
+       *(.text.entry.*)
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 565ceda741b9..da0fa7527643 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -130,6 +130,17 @@ config INDIRECT_THUNK
 	  When enabled, indirect branches are implemented using a new construct
 	  called "retpoline" that prevents speculation.
 
+config RETURN_THUNK
+	bool "Out-of-line Returns"
+	depends on CC_HAS_RETURN_THUNK
+	default INDIRECT_THUNK
+	help
+	  Compile Xen with out-of-line returns.
+
+	  This allows Xen to mitigate a variety of speculative vulnerabilities
+	  by choosing a hardware-dependent instruction sequence to implement
+	  function returns safely.
+
 config SPECULATIVE_HARDEN_ARRAY
 	bool "Speculative Array Hardening"
 	default y
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware

It is easier to express feature word 17 in terms of word 16 + [32, 64) as
that's how the layout is given in documentation.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 9bc553681f4a..1729ba0c3097 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -216,6 +216,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
 #define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
 #define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
+#define cpu_has_its_no          boot_cpu_has(X86_FEATURE_ITS_NO)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 019a0a81f4a7..94cdbd521c4d 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1781,6 +1781,90 @@ static void __init bhi_calculations(void)
     }
 }
 
+/*
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html
+ */
+static void __init its_calculations(void)
+{
+    /*
+     * Indirect Target Selection is a Branch Prediction bug whereby certain
+     * indirect branches (including RETs) get predicted using a direct branch
+     * target, rather than a suitable indirect target, bypassing hardware
+     * isolation protections.
+     *
+     * ITS affects Core (but not Atom) processors starting from the
+     * introduction of eIBRS, up to but not including Golden Cove cores
+     * (checked here with BHI_CTRL).
+     *
+     * The ITS_NO feature is not expected to be enumerated by hardware, and is
+     * only for VMMs to synthesise for guests.
+     *
+     * ITS comes in 3 flavours:
+     *
+     *   1) Across-IBPB.  Indirect branches after the IBPB can be controlled
+     *      by direct targets which existed prior to the IBPB.  This is
+     *      addressed in the IPU 2025.1 microcode drop, and has no other
+     *      software interaction.
+     *
+     *   2) Guest/Host.  Indirect branches in the VMM can be controlled by
+     *      direct targets from the guest.  This applies equally to PV guests
+     *      (Ring3) and HVM guests (VMX), and applies to all Skylake-uarch
+     *      cores with eIBRS.
+     *
+     *   3) Intra-mode.  Indirect branches in the VMM can be controlled by
+     *      other execution in the same mode.
+     */
+
+    /*
+     * If we can see ITS_NO, or we're virtualised, do nothing.  We are or may
+     * migrate somewhere unsafe.
+     */
+    if ( cpu_has_its_no || cpu_has_hypervisor )
+        return;
+
+    /* ITS is only known to affect Intel processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
+        return;
+
+    /*
+     * ITS does not exist on:
+     *  - non-Family 6 CPUs
+     *  - those without eIBRS
+     *  - those with BHI_CTRL
+     * but we still need to synthesise ITS_NO.
+     */
+    if ( boot_cpu_data.x86 != 6 || !cpu_has_eibrs ||
+         boot_cpu_has(X86_FEATURE_BHI_CTRL) )
+        goto synthesise;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /* These Skylake-uarch cores suffer cases #2 and #3. */
+    case INTEL_FAM6_SKYLAKE_X:
+    case INTEL_FAM6_KABYLAKE_L:
+    case INTEL_FAM6_KABYLAKE:
+    case INTEL_FAM6_COMETLAKE:
+    case INTEL_FAM6_COMETLAKE_L:
+        return;
+
+        /* These Sunny/Willow/Cypress Cove cores suffer case #3. */
+    case INTEL_FAM6_ICELAKE_X:
+    case INTEL_FAM6_ICELAKE_D:
+    case INTEL_FAM6_ICELAKE_L:
+    case INTEL_FAM6_TIGERLAKE_L:
+    case INTEL_FAM6_TIGERLAKE:
+    case INTEL_FAM6_ROCKETLAKE:
+        return;
+
+    default:
+        break;
+    }
+
+    /* Platforms remaining are not believed to be vulnerable to ITS. */
+ synthesise:
+    setup_force_cpu_cap(X86_FEATURE_ITS_NO);
+}
+
 void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
@@ -2331,6 +2415,8 @@ void __init init_speculation_mitigations(void)
 
     bhi_calculations();
 
+    its_calculations();
+
     print_details(thunk);
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 9c98e4992861..4d9e468af653 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -365,7 +365,8 @@ XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
 XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
 XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A| Register File(s) cleared by VERW */
 
-/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
+/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
+XEN_CPUFEATURE(ITS_NO,             16*32+62) /*!A No Indirect Target Selection */
 
 #endif /* XEN_CPUFEATURE */
 
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index 601eec608983..dc33ca3181b1 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -51,7 +51,7 @@ def parse_definitions(state):
         r"\s+/\*([\w!|]*) .*$")
 
     word_regex = re.compile(
-        r"^/\* .* word (\d*) \*/$")
+        r"^/\* .* word (\d*) .*\*/$")
     last_word = -1
 
     this = sys.modules[__name__]
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/alternative: Support replacements when a feature is not present

Use the top bit of a->cpuid to express inverted polarity.  This requires
stripping the top bit back out when performing the sanity checks.

Despite only being used once, create a replace boolean to express the decision
more clearly in _apply_alternatives().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 1ba35cb9ede9..88c90044c20d 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -197,6 +197,8 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
         uint8_t *repl = ALT_REPL_PTR(a);
         uint8_t buf[MAX_PATCH_LEN];
         unsigned int total_len = a->orig_len + a->pad_len;
+        unsigned int feat = a->cpuid & ~ALT_FLAG_NOT;
+        bool inv = a->cpuid & ALT_FLAG_NOT, replace;
 
         if ( a->repl_len > total_len )
         {
@@ -214,11 +216,11 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             return -ENOSPC;
         }
 
-        if ( a->cpuid >= NCAPINTS * 32 )
+        if ( feat >= NCAPINTS * 32 )
         {
              printk(XENLOG_ERR
                    "Alt for %ps, feature %#x outside of featureset range %#x\n",
-                   ALT_ORIG_PTR(a), a->cpuid, NCAPINTS * 32);
+                   ALT_ORIG_PTR(a), feat, NCAPINTS * 32);
             return -ERANGE;
         }
 
@@ -243,8 +245,14 @@ static int init_or_livepatch _apply_alternatives(struct alt_instr *start,
             continue;
         }
 
+        /*
+         * Should a replacement be performed?  Most replacements have positive
+         * polarity, but we support negative polarity too.
+         */
+        replace = boot_cpu_has(feat) ^ inv;
+
         /* If there is no replacement to make, see about optimising the nops. */
-        if ( !boot_cpu_has(a->cpuid) )
+        if ( !replace )
         {
             /* Origin site site already touched?  Don't nop anything. */
             if ( base->priv )
diff --git a/xen/arch/x86/include/asm/alternative-asm.h b/xen/arch/x86/include/asm/alternative-asm.h
index 83e8594f0eaf..ed3f4cb055e8 100644
--- a/xen/arch/x86/include/asm/alternative-asm.h
+++ b/xen/arch/x86/include/asm/alternative-asm.h
@@ -12,7 +12,7 @@
  * instruction. See apply_alternatives().
  */
 .macro altinstruction_entry orig, repl, feature, orig_len, repl_len, pad_len
-    .if \feature >= NCAPINTS * 32
+    .if ((\feature) & ~ALT_FLAG_NOT) >= NCAPINTS * 32
         .error "alternative feature outside of featureset range"
     .endif
     .long \orig - .
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index 38472fb58e2d..ef63e257ad78 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -1,6 +1,13 @@
 #ifndef __X86_ALTERNATIVE_H__
 #define __X86_ALTERNATIVE_H__
 
+/*
+ * Common to both C and ASM.  Express a replacement when a feature is not
+ * available.
+ */
+#define ALT_FLAG_NOT (1 << 15)
+#define ALT_NOT(x) (ALT_FLAG_NOT | (x))
+
 #ifdef __ASSEMBLY__
 #include <asm/alternative-asm.h>
 #else
@@ -12,7 +19,7 @@
 struct __packed alt_instr {
     int32_t  orig_offset;   /* original instruction */
     int32_t  repl_offset;   /* offset to replacement instruction */
-    uint16_t cpuid;         /* cpuid bit set for replacement */
+    uint16_t cpuid;         /* cpuid bit set for replacement (top bit is polarity) */
     uint8_t  orig_len;      /* length of original instruction */
     uint8_t  repl_len;      /* length of new instruction */
     uint8_t  pad_len;       /* length of build-time padding */
@@ -60,7 +67,7 @@ extern void alternative_branches(void);
                     alt_repl_len(n2)) "-" alt_orig_len)
 
 #define ALTINSTR_ENTRY(feature, num)                                    \
-        " .if " STR(feature) " >= " STR(NCAPINTS * 32) "\n"             \
+        " .if (" STR(feature & ~ALT_FLAG_NOT) ") >= " STR(NCAPINTS * 32) "\n" \
         " .error \"alternative feature outside of featureset range\"\n" \
         " .endif\n"                                                     \
         " .long .LXEN%=_orig_s - .\n"             /* label           */ \

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/guest: Remove use of the Xen hypercall_page

In order to protect against ITS, Xen needs to start using return thunks.
Therefore the advice in XSA-466 becomes relevant, and the hypercall_page needs
to be removed.

Implement early_hypercall(), with infrastructure to figure out the correct
instruction on first use.  Use ALTERNATIVE()s to result in inline hypercalls,
including the ALT_NOT() form so we only need a single synthetic feature bit.

No overall change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/guest/xen/Makefile b/xen/arch/x86/guest/xen/Makefile
index 26fb4b1007c0..8b3250aa8886 100644
--- a/xen/arch/x86/guest/xen/Makefile
+++ b/xen/arch/x86/guest/xen/Makefile
@@ -1,4 +1,4 @@
-obj-y += hypercall_page.o
+obj-bin-y += hypercall.init.o
 obj-y += xen.o
 
 obj-bin-$(CONFIG_PVH_GUEST) += pvh-boot.init.o
diff --git a/xen/arch/x86/guest/xen/hypercall.S b/xen/arch/x86/guest/xen/hypercall.S
new file mode 100644
index 000000000000..05e429794cc4
--- /dev/null
+++ b/xen/arch/x86/guest/xen/hypercall.S
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include <xen/linkage.h>
+
+        .section .init.text, "ax", @progbits
+
+        /*
+         * Used during early boot, before alternatives have run and inlined
+         * the appropriate instruction.  Called using the hypercall ABI.
+         */
+FUNC(early_hypercall)
+        cmpb    $0, early_hypercall_insn(%rip)
+        jl      .L_setup
+        je      1f
+
+        vmmcall
+        ret
+
+1:      vmcall
+        ret
+
+.L_setup:
+        /*
+         * When setting up the first time around, all registers need
+         * preserving.  Save the non-callee-saved ones.
+         */
+        push    %r11
+        push    %r10
+        push    %r9
+        push    %r8
+        push    %rdi
+        push    %rsi
+        push    %rdx
+        push    %rcx
+        push    %rax
+
+        call    early_hypercall_setup
+
+        pop     %rax
+        pop     %rcx
+        pop     %rdx
+        pop     %rsi
+        pop     %rdi
+        pop     %r8
+        pop     %r9
+        pop     %r10
+        pop     %r11
+
+        jmp     early_hypercall
+END(early_hypercall)
diff --git a/xen/arch/x86/guest/xen/hypercall_page.S b/xen/arch/x86/guest/xen/hypercall_page.S
deleted file mode 100644
index 7ab55fc1f6e6..000000000000
--- a/xen/arch/x86/guest/xen/hypercall_page.S
+++ /dev/null
@@ -1,76 +0,0 @@
-#include <asm/page.h>
-#include <asm/asm_defns.h>
-#include <public/xen.h>
-
-        .section ".text.page_aligned", "ax", @progbits
-
-DATA(hypercall_page, PAGE_SIZE)
-         /* Poisoned with `ret` for safety before hypercalls are set up. */
-        .fill PAGE_SIZE, 1, 0xc3
-END(hypercall_page)
-
-/*
- * Identify a specific hypercall in the hypercall page
- * @param name Hypercall name.
- */
-#define DECLARE_HYPERCALL(name)                                                 \
-        .globl HYPERCALL_ ## name;                                              \
-        .type  HYPERCALL_ ## name, STT_FUNC;                                    \
-        .size  HYPERCALL_ ## name, 32;                                          \
-        .set   HYPERCALL_ ## name, hypercall_page + __HYPERVISOR_ ## name * 32
-
-DECLARE_HYPERCALL(set_trap_table)
-DECLARE_HYPERCALL(mmu_update)
-DECLARE_HYPERCALL(set_gdt)
-DECLARE_HYPERCALL(stack_switch)
-DECLARE_HYPERCALL(set_callbacks)
-DECLARE_HYPERCALL(fpu_taskswitch)
-DECLARE_HYPERCALL(sched_op_compat)
-DECLARE_HYPERCALL(platform_op)
-DECLARE_HYPERCALL(set_debugreg)
-DECLARE_HYPERCALL(get_debugreg)
-DECLARE_HYPERCALL(update_descriptor)
-DECLARE_HYPERCALL(memory_op)
-DECLARE_HYPERCALL(multicall)
-DECLARE_HYPERCALL(update_va_mapping)
-DECLARE_HYPERCALL(set_timer_op)
-DECLARE_HYPERCALL(event_channel_op_compat)
-DECLARE_HYPERCALL(xen_version)
-DECLARE_HYPERCALL(console_io)
-DECLARE_HYPERCALL(physdev_op_compat)
-DECLARE_HYPERCALL(grant_table_op)
-DECLARE_HYPERCALL(vm_assist)
-DECLARE_HYPERCALL(update_va_mapping_otherdomain)
-DECLARE_HYPERCALL(iret)
-DECLARE_HYPERCALL(vcpu_op)
-DECLARE_HYPERCALL(set_segment_base)
-DECLARE_HYPERCALL(mmuext_op)
-DECLARE_HYPERCALL(xsm_op)
-DECLARE_HYPERCALL(nmi_op)
-DECLARE_HYPERCALL(sched_op)
-DECLARE_HYPERCALL(callback_op)
-DECLARE_HYPERCALL(xenoprof_op)
-DECLARE_HYPERCALL(event_channel_op)
-DECLARE_HYPERCALL(physdev_op)
-DECLARE_HYPERCALL(hvm_op)
-DECLARE_HYPERCALL(sysctl)
-DECLARE_HYPERCALL(domctl)
-DECLARE_HYPERCALL(kexec_op)
-DECLARE_HYPERCALL(argo_op)
-DECLARE_HYPERCALL(xenpmu_op)
-
-DECLARE_HYPERCALL(arch_0)
-DECLARE_HYPERCALL(arch_1)
-DECLARE_HYPERCALL(arch_2)
-DECLARE_HYPERCALL(arch_3)
-DECLARE_HYPERCALL(arch_4)
-DECLARE_HYPERCALL(arch_5)
-DECLARE_HYPERCALL(arch_6)
-DECLARE_HYPERCALL(arch_7)
-
-/*
- * Local variables:
- * tab-width: 8
- * indent-tabs-mode: nil
- * End:
- */
diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c
index c17f5c447ab6..5c9f393c7514 100644
--- a/xen/arch/x86/guest/xen/xen.c
+++ b/xen/arch/x86/guest/xen/xen.c
@@ -26,7 +26,6 @@
 bool __read_mostly xen_guest;
 
 uint32_t __read_mostly xen_cpuid_base;
-extern char hypercall_page[];
 static struct rangeset *mem;
 
 DEFINE_PER_CPU(unsigned int, vcpu_id);
@@ -35,6 +34,50 @@ static struct vcpu_info *vcpu_info;
 static unsigned long vcpu_info_mapped[BITS_TO_LONGS(NR_CPUS)];
 DEFINE_PER_CPU(struct vcpu_info *, vcpu_info);
 
+/*
+ * Which instruction to use for early hypercalls:
+ *   < 0 setup
+ *     0 vmcall
+ *   > 0 vmmcall
+ */
+int8_t __initdata early_hypercall_insn = -1;
+
+/*
+ * Called once during the first hypercall to figure out which instruction to
+ * use.  Error handling options are limited.
+ */
+void asmlinkage __init early_hypercall_setup(void)
+{
+    BUG_ON(early_hypercall_insn != -1);
+
+    if ( !boot_cpu_data.x86_vendor )
+    {
+        unsigned int eax, ebx, ecx, edx;
+
+        cpuid(0, &eax, &ebx, &ecx, &edx);
+
+        boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
+    }
+
+    switch ( boot_cpu_data.x86_vendor )
+    {
+    case X86_VENDOR_INTEL:
+    case X86_VENDOR_CENTAUR:
+    case X86_VENDOR_SHANGHAI:
+        early_hypercall_insn = 0;
+        setup_force_cpu_cap(X86_FEATURE_USE_VMCALL);
+        break;
+
+    case X86_VENDOR_AMD:
+    case X86_VENDOR_HYGON:
+        early_hypercall_insn = 1;
+        break;
+
+    default:
+        BUG();
+    }
+}
+
 static void __init find_xen_leaves(void)
 {
     uint32_t eax, ebx, ecx, edx, base;
@@ -337,9 +380,6 @@ const struct hypervisor_ops *__init xg_probe(void)
     if ( !xen_cpuid_base )
         return NULL;
 
-    /* Fill the hypercall page. */
-    wrmsrl(cpuid_ebx(xen_cpuid_base + 2), __pa(hypercall_page));
-
     xen_guest = true;
 
     return &ops;
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index ba3df174b76e..9e3ed21c026d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_SHSTK,         X86_SYNTH(26)) /* Xen uses CET Shadow Stacks *
 XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch Tracking */
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
+XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/guest/xen-hcall.h b/xen/arch/x86/include/asm/guest/xen-hcall.h
index 665b472d05ac..96004dec9909 100644
--- a/xen/arch/x86/include/asm/guest/xen-hcall.h
+++ b/xen/arch/x86/include/asm/guest/xen-hcall.h
@@ -30,9 +30,11 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__) ASM_CALL_CONSTRAINT              \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1))                                          \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -42,10 +44,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__)                    \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2))                        \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -55,10 +59,12 @@
     ({                                                                  \
         long res, tmp__;                                                \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__)      \
               ASM_CALL_CONSTRAINT                                       \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3))      \
             : "memory" );                                               \
         (type)res;                                                      \
@@ -69,10 +75,12 @@
         long res, tmp__;                                                \
         register long _a4 asm ("r10") = ((long)(a4));                   \
         asm volatile (                                                  \
-            "call hypercall_page + %c[offset]"                          \
+            ALTERNATIVE_2("call early_hypercall",                       \
+                          "vmmcall", ALT_NOT(X86_FEATURE_USE_VMCALL),   \
+                          "vmcall", X86_FEATURE_USE_VMCALL)             \
             : "=a" (res), "=D" (tmp__), "=S" (tmp__), "=d" (tmp__),     \
               "=&r" (tmp__) ASM_CALL_CONSTRAINT                         \
-            : [offset] "i" (hcall * 32),                                \
+            : "0" (hcall),                                              \
               "1" ((long)(a1)), "2" ((long)(a2)), "3" ((long)(a3)),     \
               "4" (_a4)                                                 \
             : "memory" );                                               \
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: (Mis)align __x86_indirect_thunk_* to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

Arrange for __x86_indirect_thunk_* to always be in the second half.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index fd5493c22b16..c4b978d67b8e 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -11,6 +11,10 @@
 
 #include <asm/asm_defns.h>
 
+/* Alignment is dealt with explicitly here; override the respective macro. */
+#undef SYM_ALIGN
+#define SYM_ALIGN(align...)
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -35,6 +39,16 @@
 .macro GEN_INDIRECT_THUNK reg:req
         .section .text.__x86_indirect_thunk_\reg, "ax", @progbits
 
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
 FUNC(__x86_indirect_thunk_\reg)
         ALTERNATIVE_2 __stringify(IND_THUNK_RETPOLINE \reg),              \
         __stringify(IND_THUNK_LFENCE \reg), X86_FEATURE_IND_THUNK_LFENCE, \
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/thunk: (Mis)align the RETs in clear_bhb_loops() to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

clear_bhb_loops() has a precise layout of branches.  The alignment for
performance cause the RETs to always be in an unsafe position, and converting
those to return thunks changes the branching pattern.  While such a conversion
is believed to be safe, clear_bhb_loops() is also a performance-relevant
fastpath, so (mis)align the RETs to be in a safe position.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 678c00c5d06f..52625f4e2c17 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -50,7 +50,12 @@ END(clear_bhb_tsx)
  *   ret
  *
  * The CALL/RETs are necessary to prevent the Loop Stream Detector from
- * interfering.  The alignment is for performance and not safety.
+ * interfering.
+ *
+ * The .balign's are for performance, but they cause the RETs to be in unsafe
+ * positions with respect to Indirect Target Selection.  The .skips are to
+ * move the RETs into ITS-safe positions, rather than using the slowpath
+ * through __x86_return_thunk.
  *
  * The "short" sequence (5 and 5) is for CPUs prior to Alder Lake / Sapphire
  * Rapids (i.e. Cores prior to Golden Cove and/or Gracemont).
@@ -66,12 +71,14 @@ FUNC(clear_bhb_loops)
         jmp     5f
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - (.Lr1 - 1f), 0xcc
 1:      call    2f
-        ret
+.Lr1:   ret
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - 18 /* (.Lr2 - 2f) but Clang IAS doesn't like this */, 0xcc
 2:      ALTERNATIVE "mov $5, %eax", "mov $7, %eax", X86_SPEC_BHB_LOOPS_LONG
 
 3:      jmp     4f
@@ -83,7 +90,7 @@ FUNC(clear_bhb_loops)
         sub     $1, %ecx
         jnz     1b
 
-        ret
+.Lr2:   ret
 5:
         /*
          * The Intel sequence has an LFENCE here.  The purpose is to ensure
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/stubs: Introduce place_ret() to abstract away raw 0xc3's

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.
This means it's not safe for logic using the stubs to write raw 0xc3's.

Introduce place_ret() which, for now, writes a raw 0xc3 but will contain
additional logic when return thunks are in use.

stub_selftest() doesn't strictly need to be converted as they only run on
boot, but doing so gets us a partial test of place_ret() too.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/tools/tests/x86_emulator/x86-emulate.h b/tools/tests/x86_emulator/x86-emulate.h
index 929c1a72ae88..4c292ac338f6 100644
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -77,6 +77,12 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *place_ret(void *ptr)
+{
+    *(uint8_t *)ptr = 0xc3;
+    return ptr + 1;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpu_policy cpu_policy;
 
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index 0ded68d06958..02ab235de7ff 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -11,9 +11,7 @@ obj-$(CONFIG_PV) += pv/
 obj-y += x86_64/
 obj-y += x86_emulate/
 
-alternative-y := alternative.init.o
-alternative-$(CONFIG_LIVEPATCH) :=
-obj-bin-y += $(alternative-y)
+obj-y += alternative.o
 obj-y += apic.o
 obj-y += bhb-thunk.o
 obj-y += bitops.o
@@ -41,7 +39,7 @@ obj-y += hypercall.o
 obj-y += i387.o
 obj-y += i8259.o
 obj-y += io_apic.o
-obj-$(CONFIG_LIVEPATCH) += alternative.o livepatch.o
+obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 88c90044c20d..ec451d962c10 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,6 +137,20 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+/*
+ * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
+ *
+ * Returns the next position to write into the stub.
+ */
+void *place_ret(void *ptr)
+{
+    uint8_t *p = ptr;
+
+    *p++ = 0xc3;
+
+    return p;
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 705cf9eb94ca..1572efa69a00 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -151,20 +151,20 @@ search_exception_table(const struct cpu_user_regs *regs, unsigned long *stub_ra)
 int __init cf_check stub_selftest(void)
 {
     static const struct {
-        uint8_t opc[8];
+        uint8_t opc[7];
         uint64_t rax;
         union stub_exception_token res;
     } tests[] __initconst = {
 #define endbr64 0xf3, 0x0f, 0x1e, 0xfa
-        { .opc = { endbr64, 0x0f, 0xb9, 0xc3, 0xc3 }, /* ud1 */
+        { .opc = { endbr64, 0x0f, 0xb9, 0x90 }, /* ud1 */
           .res.fields.trapnr = X86_EXC_UD },
-        { .opc = { endbr64, 0x90, 0x02, 0x00, 0xc3 }, /* nop; add (%rax),%al */
+        { .opc = { endbr64, 0x90, 0x02, 0x00 }, /* nop; add (%rax),%al */
           .rax = 0x0123456789abcdef,
           .res.fields.trapnr = X86_EXC_GP },
-        { .opc = { endbr64, 0x02, 0x04, 0x04, 0xc3 }, /* add (%rsp,%rax),%al */
+        { .opc = { endbr64, 0x02, 0x04, 0x04 }, /* add (%rsp,%rax),%al */
           .rax = 0xfedcba9876543210UL,
           .res.fields.trapnr = X86_EXC_SS },
-        { .opc = { endbr64, 0xcc, 0xc3, 0xc3, 0xc3 }, /* int3 */
+        { .opc = { endbr64, 0xcc, 0x90, 0x90 }, /* int3 */
           .res.fields.trapnr = X86_EXC_BP },
 #undef endbr64
     };
@@ -183,6 +183,7 @@ int __init cf_check stub_selftest(void)
 
         memset(ptr, 0xcc, STUB_BUF_SIZE / 2);
         memcpy(ptr, tests[i].opc, ARRAY_SIZE(tests[i].opc));
+        place_ret(ptr + ARRAY_SIZE(tests[i].opc));
         unmap_domain_page(ptr);
 
         asm volatile ( "INDIRECT_CALL %[stb]\n"
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index ef63e257ad78..28d4ec0f9c4a 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -31,6 +31,8 @@ struct __packed alt_instr {
 #define ALT_REPL_PTR(a)     __ALT_PTR(a, repl_offset)
 
 extern void add_nops(void *insns, unsigned int len);
+void *place_ret(void *ptr);
+
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
 extern int apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 70150c272276..ff5d1c9f8634 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -76,7 +76,6 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
         0x41, 0x5c, /* pop %r12  */
         0x5d,       /* pop %rbp  */
         0x5b,       /* pop %rbx  */
-        0xc3,       /* ret       */
     };
 
     const struct stubs *this_stubs = &this_cpu(stubs);
@@ -126,11 +125,13 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
 
     APPEND_CALL(save_guest_gprs);
     APPEND_BUFF(epilogue);
+    p = place_ret(p);
 
     /* Build-time best effort attempt to catch problems. */
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
-                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES)));
+                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
+                  1 /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/x86_emulate/fpu.c b/xen/arch/x86/x86_emulate/fpu.c
index 54c8621421c6..9cc37a1d8e3c 100644
--- a/xen/arch/x86/x86_emulate/fpu.c
+++ b/xen/arch/x86/x86_emulate/fpu.c
@@ -32,36 +32,42 @@ static inline bool fpu_check_write(void)
 
 #define emulate_fpu_insn_memdst(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
     *insn_bytes = 2;                                                    \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "+m" (arg) : "a" (&(arg)));                     \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_memsrc(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "m" (arg), "a" (&(arg)));        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub(bytes...)                                 \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "i" (0));                        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub_eflags(bytes...)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
     unsigned long tmp_;                                                 \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),             \
                 _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),            \
                 [eflags] "+g" (regs->eflags), [tmp] "=&r" (tmp_)        \
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c
index e8cba5a27f56..a6a1c9937f5d 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1398,7 +1398,7 @@ x86_emulate(
         stb[3] = 0x91;
         stb[4] = evex.opmsk << 3;
         insn_bytes = 5;
-        stb[5] = 0xc3;
+        place_ret(&stb[5]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
 
@@ -3631,7 +3631,7 @@ x86_emulate(
         }
         opc[1] = (modrm & 0x38) | 0xc0;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -3698,7 +3698,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
@@ -3772,7 +3772,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         _regs.eflags &= ~EFLAGS_MASK;
         invoke_stub("",
@@ -4008,7 +4008,7 @@ x86_emulate(
         opc[1] = modrm & 0xc7;
         insn_bytes = PFX_BYTES + 2;
     simd_0f_to_gpr:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         generate_exception_if(ea.type != OP_REG, X86_EXC_UD);
 
@@ -4405,7 +4405,7 @@ x86_emulate(
             vex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4442,7 +4442,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4637,7 +4637,7 @@ x86_emulate(
 #endif /* X86EMUL_NO_SIMD */
 
     simd_0f_reg_only:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", [dummy_out] "=g" (dummy) : [dummy_in] "i" (0) );
@@ -4971,7 +4971,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xf8;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         ea.reg = decode_gpr(&_regs, modrm_rm);
@@ -5014,7 +5014,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
@@ -5044,7 +5044,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         _regs.eflags &= ~EFLAGS_MASK;
@@ -5612,7 +5612,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
@@ -5730,7 +5730,7 @@ x86_emulate(
             opc[1] &= 0x38;
         }
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -6010,7 +6010,7 @@ x86_emulate(
         pvex->b = !mode_64bit() || (vex.reg >> 3);
         opc[1] = 0xc0 | (~vex.reg & 7);
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
@@ -6284,7 +6284,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0xf8;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -6383,7 +6383,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6453,7 +6453,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6509,7 +6509,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6574,7 +6574,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6588,7 +6588,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6664,7 +6664,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6741,7 +6741,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6940,7 +6940,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         src.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val), "a" (*src.reg));
@@ -6976,7 +6976,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub("=&a" (dst.val), "c" (&src.val));
@@ -7217,7 +7217,7 @@ x86_emulate(
             evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7384,7 +7384,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 3;
             copy_VEX(opc, vex);
         }
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
 
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
@@ -7630,7 +7630,7 @@ x86_emulate(
         }
         opc[2] = imm1;
         insn_bytes = PFX_BYTES + 3;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7976,7 +7976,7 @@ x86_emulate(
         pxop->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&a" (dst.val), "c" (&src.val));
@@ -8085,7 +8085,7 @@ x86_emulate(
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
         *(uint32_t *)(buf + 5) = imm1;
-        buf[9] = 0xc3;
+        place_ret(&buf[9]);
 
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val));
 
@@ -8181,12 +8181,12 @@ x86_emulate(
 
         if ( evex_encoded() )
         {
-            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - EVEX_PFX_BYTES]);
             copy_EVEX(opc, evex);
         }
         else
         {
-            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - PFX_BYTES]);
             copy_REX_VEX(opc, rex_prefix, vex);
         }
 
@@ -8510,7 +8510,7 @@ int x86_emul_rmw(
         pvex->reg = 0xf; /* rAX */
         buf[3] = ctxt->opcode;
         buf[4] = 0x11; /* reg=rDX r/m=(%RCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         *eflags &= ~EFLAGS_MASK;
         invoke_stub("",
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: Build Xen with Return Thunks

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

In order to mitigate this, build with return thunks and arrange for
__x86_return_thunk to be (mis)aligned in the same manner as
__x86_indirect_thunk_* so the RET instruction is placed in a safe location.

place_ret() needs to conditionally emit JMP __x86_return_thunk instead of RET.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 9cdd04721afa..96fd1c3272d1 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -38,9 +38,14 @@ config ARCH_DEFCONFIG
 	default "arch/x86/configs/x86_64_defconfig"
 
 config CC_HAS_INDIRECT_THUNK
+	# GCC >= 8 or Clang >= 6
 	def_bool $(cc-option,-mindirect-branch-register) || \
 	         $(cc-option,-mretpoline-external-thunk)
 
+config CC_HAS_RETURN_THUNK
+	# GCC >= 8 or Clang >= 15
+	def_bool $(cc-option,-mfunction-return=thunk-extern)
+
 config HAS_AS_CET_SS
 	# binutils >= 2.29 or LLVM >= 6
 	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy)
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index 02ab235de7ff..a0954a6a8c14 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
+obj-$(CONFIG_RETURN_THUNK) += indirect-thunk.o
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index a741d58b9117..92af6230b31f 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -133,7 +133,7 @@ LABEL(s3_resume)
         pop     %r12
         pop     %rbx
         pop     %rbp
-        ret
+        RET
 END(do_suspend_lowlevel)
 
 .data
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index ec451d962c10..1b71ae959abe 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -137,16 +137,45 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+void nocall __x86_return_thunk(void);
+
 /*
  * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
  *
+ * When CONFIG_RETURN_THUNK is active, this may be a JMP __x86_return_thunk
+ * instead, depending on the safety of @ptr with respect to Indirect Target
+ * Selection.
+ *
  * Returns the next position to write into the stub.
  */
 void *place_ret(void *ptr)
 {
+    unsigned long addr = (unsigned long)ptr;
     uint8_t *p = ptr;
 
-    *p++ = 0xc3;
+    /*
+     * When Return Thunks are used, if a RET would be unsafe at this location
+     * with respect to Indirect Target Selection (i.e. if addr is in the first
+     * half of a cacheline), insert a JMP __x86_return_thunk instead.
+     *
+     * The displacement needs to be relative to the executable alias of the
+     * stub, not to @ptr which is the writeable alias.
+     */
+    if ( IS_ENABLED(CONFIG_RETURN_THUNK) && !(addr & 0x20) )
+    {
+        long stub_va = (this_cpu(stubs.addr) & PAGE_MASK) + (addr & ~PAGE_MASK);
+        long disp = (long)__x86_return_thunk - (stub_va + 5);
+
+        BUG_ON((int32_t)disp != disp);
+
+        *p++ = 0xe9;
+        *(int32_t *)p = disp;
+        p += 4;
+    }
+    else
+    {
+        *p++ = 0xc3;
+    }
 
     return p;
 }
diff --git a/xen/arch/x86/arch.mk b/xen/arch/x86/arch.mk
index cb47d729911e..efdbdb6045be 100644
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -44,6 +44,9 @@ CFLAGS-$(CONFIG_CC_IS_GCC) += -fno-jump-tables
 CFLAGS-$(CONFIG_CC_IS_CLANG) += -mretpoline-external-thunk
 endif
 
+# Compile with return thunk support if selected.
+CFLAGS-$(CONFIG_RETURN_THUNK) += -mfunction-return=thunk-extern
+
 # Disable the addition of a .note.gnu.property section to object files when
 # livepatch support is enabled.  The contents of that section can change
 # depending on the instructions used, and livepatch-build-tools doesn't know
diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 52625f4e2c17..7f92201a3cbb 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -23,7 +23,7 @@ FUNC(clear_bhb_tsx)
 0:      .byte 0xc6, 0xf8, 0             /* xabort $0 */
         int3
 1:
-        ret
+        RET
 END(clear_bhb_tsx)
 
 /*
diff --git a/xen/arch/x86/clear_page.S b/xen/arch/x86/clear_page.S
index d6c076f1d8bc..dc3c3c26bfb7 100644
--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 FUNC(clear_page_sse2)
@@ -16,5 +18,5 @@ FUNC(clear_page_sse2)
         jnz     0b
 
         sfence
-        ret
+        RET
 END(clear_page_sse2)
diff --git a/xen/arch/x86/copy_page.S b/xen/arch/x86/copy_page.S
index c3c436545bac..e43e5370c815 100644
--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 #define src_reg %rsi
@@ -41,5 +43,5 @@ FUNC(copy_page_sse2)
         movnti  tmp4_reg, 3*WORD_SIZE(dst_reg)
 
         sfence
-        ret
+        RET
 END(copy_page_sse2)
diff --git a/xen/arch/x86/efi/check.c b/xen/arch/x86/efi/check.c
index 9e473faad3c9..23ba30abf330 100644
--- a/xen/arch/x86/efi/check.c
+++ b/xen/arch/x86/efi/check.c
@@ -3,6 +3,9 @@ int __attribute__((__ms_abi__)) test(int i)
     return i;
 }
 
+/* In case -mfunction-return is in use. */
+void __x86_return_thunk(void) {};
+
 /*
  * Populate an array with "addresses" of relocatable and absolute values.
  * This is to probe ld for (a) emitting base relocations at all and (b) not
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 32d6b4491063..97ebe21298a2 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -58,6 +58,12 @@
     .endif
 .endm
 
+#ifdef CONFIG_RETURN_THUNK
+# define RET jmp __x86_return_thunk
+#else
+# define RET ret
+#endif
+
 #ifdef CONFIG_XEN_IBT
 # define ENDBR64 endbr64
 #else
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index c4b978d67b8e..26dad15f12c9 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -15,6 +15,8 @@
 #undef SYM_ALIGN
 #define SYM_ALIGN(align...)
 
+#ifdef CONFIG_INDIRECT_THUNK
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -62,3 +64,25 @@ END(__x86_indirect_thunk_\reg)
 .irp reg, ax, cx, dx, bx, bp, si, di, 8, 9, 10, 11, 12, 13, 14, 15
         GEN_INDIRECT_THUNK reg=r\reg
 .endr
+
+#endif /* CONFIG_INDIRECT_THUNK */
+
+#ifdef CONFIG_RETURN_THUNK
+        .section .text.entry.__x86_return_thunk, "ax", @progbits
+
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
+FUNC(__x86_return_thunk)
+        ret
+        int3 /* Halt straight-line speculation */
+END(__x86_return_thunk)
+
+#endif /* CONFIG_RETURN_THUNK */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ff5d1c9f8634..295d847ea24c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -131,7 +131,7 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
                   MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
-                  1 /* ret */));
+                  (IS_ENABLED(CONFIG_RETURN_THUNK) ? 5 : 1) /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index 5409ad3b1447..362b5d241623 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -26,7 +26,7 @@ FUNC(load_guest_gprs)
         movq  UREGS_r15(%rdi), %r15
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
-        ret
+        RET
 END(load_guest_gprs)
 
 /* Save guest GPRs.  Parameter on the stack above the return address. */
@@ -48,5 +48,5 @@ FUNC(save_guest_gprs)
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
         movq  %rcx, UREGS_rcx(%rdi)
-        ret
+        RET
 END(save_guest_gprs)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index ced84750015c..fe27e82a4707 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -571,6 +571,9 @@ static void __init print_details(enum ind_thunk thunk)
 #ifdef CONFIG_INDIRECT_THUNK
                " INDIRECT_THUNK"
 #endif
+#ifdef CONFIG_RETURN_THUNK
+               " RETURN_THUNK"
+#endif
 #ifdef CONFIG_SHADOW_PAGING
                " SHADOW_PAGING"
 #endif
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 1e87652f4bcb..d7b381ea546d 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -180,7 +180,7 @@ FUNC(cr4_pv32_restore)
         or    cr4_pv32_mask(%rip), %rax
         mov   %rax, %cr4
         mov   %rax, (%rcx)
-        ret
+        RET
 0:
 #ifndef NDEBUG
         /* Check that _all_ of the bits intended to be set actually are. */
@@ -198,7 +198,7 @@ FUNC(cr4_pv32_restore)
 1:
 #endif
         xor   %eax, %eax
-        ret
+        RET
 END(cr4_pv32_restore)
 
 FUNC(compat_syscall)
@@ -329,7 +329,7 @@ __UNLIKELY_END(compat_bounce_null_selector)
         xor   %eax, %eax
         mov   %ax,  TRAPBOUNCE_cs(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
 .section .fixup,"ax"
 .Lfx13:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 40d094d5b2ee..6efc6bab1edb 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -604,7 +604,7 @@ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         xor   %eax, %eax
         mov   %rax, TRAPBOUNCE_eip(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
         .pushsection .fixup, "ax", @progbits
         # Numeric tags below represent the intended overall %rsi adjustment.
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 42217eaf2485..e6c71a8e3e21 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -83,6 +83,7 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _stextentry = .;
        *(.text.entry)
+       *(.text.entry.*)
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 6166327f4d14..7d9442ec05c1 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -136,6 +136,17 @@ config INDIRECT_THUNK
 	  When enabled, indirect branches are implemented using a new construct
 	  called "retpoline" that prevents speculation.
 
+config RETURN_THUNK
+	bool "Out-of-line Returns"
+	depends on CC_HAS_RETURN_THUNK
+	default INDIRECT_THUNK
+	help
+	  Compile Xen with out-of-line returns.
+
+	  This allows Xen to mitigate a variety of speculative vulnerabilities
+	  by choosing a hardware-dependent instruction sequence to implement
+	  function returns safely.
+
 config SPECULATIVE_HARDEN_ARRAY
 	bool "Speculative Array Hardening"
 	default y
diff --git a/xen/lib/x86-generic-hweightl.c b/xen/lib/x86-generic-hweightl.c
index 123a5b43928d..1cab68952a42 100644
--- a/xen/lib/x86-generic-hweightl.c
+++ b/xen/lib/x86-generic-hweightl.c
@@ -51,7 +51,11 @@ asm (
     "pop    %rdx\n\t"
     "pop    %rdi\n\t"
 
+#ifdef CONFIG_RETURN_THUNK
+    "jmp    __x86_return_thunk\n\t"
+#else
     "ret\n\t"
+#endif
 
     ".size arch_generic_hweightl, . - arch_generic_hweightl\n\t"
 );
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware

It is easier to express feature word 17 in terms of word 16 + [32, 64) as
that's how the layout is given in documentation.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 3a06b6f297b9..24b021eeffc8 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -218,6 +218,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
 #define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
 #define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
+#define cpu_has_its_no          boot_cpu_has(X86_FEATURE_ITS_NO)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index fe27e82a4707..0a635025e488 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1760,6 +1760,90 @@ static void __init bhi_calculations(void)
     }
 }
 
+/*
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html
+ */
+static void __init its_calculations(void)
+{
+    /*
+     * Indirect Target Selection is a Branch Prediction bug whereby certain
+     * indirect branches (including RETs) get predicted using a direct branch
+     * target, rather than a suitable indirect target, bypassing hardware
+     * isolation protections.
+     *
+     * ITS affects Core (but not Atom) processors starting from the
+     * introduction of eIBRS, up to but not including Golden Cove cores
+     * (checked here with BHI_CTRL).
+     *
+     * The ITS_NO feature is not expected to be enumerated by hardware, and is
+     * only for VMMs to synthesise for guests.
+     *
+     * ITS comes in 3 flavours:
+     *
+     *   1) Across-IBPB.  Indirect branches after the IBPB can be controlled
+     *      by direct targets which existed prior to the IBPB.  This is
+     *      addressed in the IPU 2025.1 microcode drop, and has no other
+     *      software interaction.
+     *
+     *   2) Guest/Host.  Indirect branches in the VMM can be controlled by
+     *      direct targets from the guest.  This applies equally to PV guests
+     *      (Ring3) and HVM guests (VMX), and applies to all Skylake-uarch
+     *      cores with eIBRS.
+     *
+     *   3) Intra-mode.  Indirect branches in the VMM can be controlled by
+     *      other execution in the same mode.
+     */
+
+    /*
+     * If we can see ITS_NO, or we're virtualised, do nothing.  We are or may
+     * migrate somewhere unsafe.
+     */
+    if ( cpu_has_its_no || cpu_has_hypervisor )
+        return;
+
+    /* ITS is only known to affect Intel processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
+        return;
+
+    /*
+     * ITS does not exist on:
+     *  - non-Family 6 CPUs
+     *  - those without eIBRS
+     *  - those with BHI_CTRL
+     * but we still need to synthesise ITS_NO.
+     */
+    if ( boot_cpu_data.x86 != 6 || !cpu_has_eibrs ||
+         boot_cpu_has(X86_FEATURE_BHI_CTRL) )
+        goto synthesise;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /* These Skylake-uarch cores suffer cases #2 and #3. */
+    case INTEL_FAM6_SKYLAKE_X:
+    case INTEL_FAM6_KABYLAKE_L:
+    case INTEL_FAM6_KABYLAKE:
+    case INTEL_FAM6_COMETLAKE:
+    case INTEL_FAM6_COMETLAKE_L:
+        return;
+
+        /* These Sunny/Willow/Cypress Cove cores suffer case #3. */
+    case INTEL_FAM6_ICELAKE_X:
+    case INTEL_FAM6_ICELAKE_D:
+    case INTEL_FAM6_ICELAKE_L:
+    case INTEL_FAM6_TIGERLAKE_L:
+    case INTEL_FAM6_TIGERLAKE:
+    case INTEL_FAM6_ROCKETLAKE:
+        return;
+
+    default:
+        break;
+    }
+
+    /* Platforms remaining are not believed to be vulnerable to ITS. */
+ synthesise:
+    setup_force_cpu_cap(X86_FEATURE_ITS_NO);
+}
+
 void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
@@ -2316,6 +2400,8 @@ void __init init_speculation_mitigations(void)
 
     bhi_calculations();
 
+    its_calculations();
+
     print_details(thunk);
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 16207e3817bb..1e0e870c5c7d 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -379,7 +379,8 @@ XEN_CPUFEATURE(GDS_NO,             16*32+26) /*A  No Gather Data Sampling */
 XEN_CPUFEATURE(RFDS_NO,            16*32+27) /*A  No Register File Data Sampling */
 XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A| Register File(s) cleared by VERW */
 
-/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
+/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
+XEN_CPUFEATURE(ITS_NO,             16*32+62) /*!A No Indirect Target Selection */
 
 #endif /* XEN_CPUFEATURE */
 
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index a77cb30bdb4f..163b105bc6a7 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -51,7 +51,7 @@ def parse_definitions(state):
         r"\s+/\*([\w!|]*) .*$")
 
     word_regex = re.compile(
-        r"^/\* .* word (\d*) \*/$")
+        r"^/\* .* word (\d*) .*\*/$")
     last_word = -1
 
     this = sys.modules[__name__]
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/thunk: (Mis)align the RETs in clear_bhb_loops() to mitigate ITS

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

clear_bhb_loops() has a precise layout of branches.  The alignment for
performance cause the RETs to always be in an unsafe position, and converting
those to return thunks changes the branching pattern.  While such a conversion
is believed to be safe, clear_bhb_loops() is also a performance-relevant
fastpath, so (mis)align the RETs to be in a safe position.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index f5ac41834bbd..7175778bdd11 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -50,7 +50,12 @@ END(clear_bhb_tsx)
  *   ret
  *
  * The CALL/RETs are necessary to prevent the Loop Stream Detector from
- * interfering.  The alignment is for performance and not safety.
+ * interfering.
+ *
+ * The .balign's are for performance, but they cause the RETs to be in unsafe
+ * positions with respect to Indirect Target Selection.  The .skips are to
+ * move the RETs into ITS-safe positions, rather than using the slowpath
+ * through __x86_return_thunk.
  *
  * The "short" sequence (5 and 5) is for CPUs prior to Alder Lake / Sapphire
  * Rapids (i.e. Cores prior to Golden Cove and/or Gracemont).
@@ -66,12 +71,14 @@ FUNC(clear_bhb_loops)
         jmp     5f
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - (.Lr1 - 1f), 0xcc
 1:      call    2f
-        ret
+.Lr1:   ret
         int3
 
-        .align 64
+        .balign 64
+        .skip   32 - 18 /* (.Lr2 - 2f) but Clang IAS doesn't like this */, 0xcc
 2:      ALTERNATIVE "mov $5, %eax", "mov $7, %eax", X86_SPEC_BHB_LOOPS_LONG
 
 3:      jmp     4f
@@ -83,7 +90,7 @@ FUNC(clear_bhb_loops)
         sub     $1, %ecx
         jnz     1b
 
-        ret
+.Lr2:   ret
 5:
         /*
          * The Intel sequence has an LFENCE here.  The purpose is to ensure
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/stubs: Introduce place_ret() to abstract away raw 0xc3's

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.
This means it's not safe for logic using the stubs to write raw 0xc3's.

Introduce place_ret() which, for now, writes a raw 0xc3 but will contain
additional logic when return thunks are in use.

stub_selftest() doesn't strictly need to be converted as they only run on
boot, but doing so gets us a partial test of place_ret() too.

No functional change.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/tools/tests/x86_emulator/x86-emulate.h b/tools/tests/x86_emulator/x86-emulate.h
index 929c1a72ae88..4c292ac338f6 100644
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -77,6 +77,12 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *place_ret(void *ptr)
+{
+    *(uint8_t *)ptr = 0xc3;
+    return ptr + 1;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpu_policy cpu_policy;
 
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index c2f1dcf301d6..c32aba029d44 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -11,9 +11,7 @@ obj-$(CONFIG_PV) += pv/
 obj-y += x86_64/
 obj-y += x86_emulate/
 
-alternative-y := alternative.init.o
-alternative-$(CONFIG_LIVEPATCH) :=
-obj-bin-y += $(alternative-y)
+obj-y += alternative.o
 obj-y += apic.o
 obj-y += bhb-thunk.o
 obj-y += bitops.o
@@ -41,7 +39,7 @@ obj-y += hypercall.o
 obj-y += i387.o
 obj-y += i8259.o
 obj-y += io_apic.o
-obj-$(CONFIG_LIVEPATCH) += alternative.o livepatch.o
+obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 4d1bbe7313fd..0449b9c1b7e5 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -135,6 +135,20 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+/*
+ * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
+ *
+ * Returns the next position to write into the stub.
+ */
+void *place_ret(void *ptr)
+{
+    uint8_t *p = ptr;
+
+    *p++ = 0xc3;
+
+    return p;
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
diff --git a/xen/arch/x86/extable.c b/xen/arch/x86/extable.c
index 705cf9eb94ca..1572efa69a00 100644
--- a/xen/arch/x86/extable.c
+++ b/xen/arch/x86/extable.c
@@ -151,20 +151,20 @@ search_exception_table(const struct cpu_user_regs *regs, unsigned long *stub_ra)
 int __init cf_check stub_selftest(void)
 {
     static const struct {
-        uint8_t opc[8];
+        uint8_t opc[7];
         uint64_t rax;
         union stub_exception_token res;
     } tests[] __initconst = {
 #define endbr64 0xf3, 0x0f, 0x1e, 0xfa
-        { .opc = { endbr64, 0x0f, 0xb9, 0xc3, 0xc3 }, /* ud1 */
+        { .opc = { endbr64, 0x0f, 0xb9, 0x90 }, /* ud1 */
           .res.fields.trapnr = X86_EXC_UD },
-        { .opc = { endbr64, 0x90, 0x02, 0x00, 0xc3 }, /* nop; add (%rax),%al */
+        { .opc = { endbr64, 0x90, 0x02, 0x00 }, /* nop; add (%rax),%al */
           .rax = 0x0123456789abcdef,
           .res.fields.trapnr = X86_EXC_GP },
-        { .opc = { endbr64, 0x02, 0x04, 0x04, 0xc3 }, /* add (%rsp,%rax),%al */
+        { .opc = { endbr64, 0x02, 0x04, 0x04 }, /* add (%rsp,%rax),%al */
           .rax = 0xfedcba9876543210UL,
           .res.fields.trapnr = X86_EXC_SS },
-        { .opc = { endbr64, 0xcc, 0xc3, 0xc3, 0xc3 }, /* int3 */
+        { .opc = { endbr64, 0xcc, 0x90, 0x90 }, /* int3 */
           .res.fields.trapnr = X86_EXC_BP },
 #undef endbr64
     };
@@ -183,6 +183,7 @@ int __init cf_check stub_selftest(void)
 
         memset(ptr, 0xcc, STUB_BUF_SIZE / 2);
         memcpy(ptr, tests[i].opc, ARRAY_SIZE(tests[i].opc));
+        place_ret(ptr + ARRAY_SIZE(tests[i].opc));
         unmap_domain_page(ptr);
 
         asm volatile ( "INDIRECT_CALL %[stb]\n"
diff --git a/xen/arch/x86/include/asm/alternative.h b/xen/arch/x86/include/asm/alternative.h
index b9ea49bd1c3d..e17be8ddfd82 100644
--- a/xen/arch/x86/include/asm/alternative.h
+++ b/xen/arch/x86/include/asm/alternative.h
@@ -33,6 +33,8 @@ struct __packed alt_instr {
 #define ALT_REPL_PTR(a)     __ALT_PTR(a, repl_offset)
 
 extern void add_nops(void *insns, unsigned int len);
+void *place_ret(void *ptr);
+
 /* Similar to alternative_instructions except it can be run with IRQs enabled. */
 extern int apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void alternative_instructions(void);
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 70150c272276..ff5d1c9f8634 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -76,7 +76,6 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
         0x41, 0x5c, /* pop %r12  */
         0x5d,       /* pop %rbp  */
         0x5b,       /* pop %rbx  */
-        0xc3,       /* ret       */
     };
 
     const struct stubs *this_stubs = &this_cpu(stubs);
@@ -126,11 +125,13 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
 
     APPEND_CALL(save_guest_gprs);
     APPEND_BUFF(epilogue);
+    p = place_ret(p);
 
     /* Build-time best effort attempt to catch problems. */
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
-                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES)));
+                  MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
+                  1 /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/x86_emulate/fpu.c b/xen/arch/x86/x86_emulate/fpu.c
index 54c8621421c6..9cc37a1d8e3c 100644
--- a/xen/arch/x86/x86_emulate/fpu.c
+++ b/xen/arch/x86/x86_emulate/fpu.c
@@ -32,36 +32,42 @@ static inline bool fpu_check_write(void)
 
 #define emulate_fpu_insn_memdst(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
     *insn_bytes = 2;                                                    \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "+m" (arg) : "a" (&(arg)));                     \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_memsrc(opc, ext, arg)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     /* ModRM: mod=0, reg=ext, rm=0, i.e. a (%rax) operand */            \
-    memcpy(get_stub(stub),                                              \
-           ((uint8_t[]){ opc, ((ext) & 7) << 3, 0xc3 }), 3);            \
+    memcpy(_p, ((uint8_t[]){ opc, ((ext) & 7) << 3 }), 2); _p += 2;     \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "m" (arg), "a" (&(arg)));        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub(bytes...)                                 \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub("", "", "=m" (dummy) : "i" (0));                        \
     put_stub(stub);                                                     \
 } while (0)
 
 #define emulate_fpu_insn_stub_eflags(bytes...)                          \
 do {                                                                    \
+    void *_p = get_stub(stub);                                          \
     unsigned int nr_ = sizeof((uint8_t[]){ bytes });                    \
     unsigned long tmp_;                                                 \
-    memcpy(get_stub(stub), ((uint8_t[]){ bytes, 0xc3 }), nr_ + 1);      \
+    memcpy(_p, ((uint8_t[]){ bytes }), nr_); _p += nr_;                 \
+    place_ret(_p);                                                      \
     invoke_stub(_PRE_EFLAGS("[eflags]", "[mask]", "[tmp]"),             \
                 _POST_EFLAGS("[eflags]", "[mask]", "[tmp]"),            \
                 [eflags] "+g" (regs->eflags), [tmp] "=&r" (tmp_)        \
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c
index 7d68ebe7ea05..0a98b4347631 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1400,7 +1400,7 @@ x86_emulate(
         stb[3] = 0x91;
         stb[4] = evex.opmsk << 3;
         insn_bytes = 5;
-        stb[5] = 0xc3;
+        place_ret(&stb[5]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
 
@@ -3633,7 +3633,7 @@ x86_emulate(
         }
         opc[1] = (modrm & 0x38) | 0xc0;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -3700,7 +3700,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         ea.reg = decode_gpr(&_regs, modrm_reg);
         invoke_stub("", "", "=a" (*ea.reg) : "c" (mmvalp), "m" (*mmvalp));
@@ -3774,7 +3774,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 2;
             copy_REX_VEX(opc, rex_prefix, vex);
         }
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         _regs.eflags &= ~EFLAGS_MASK;
         invoke_stub("",
@@ -4010,7 +4010,7 @@ x86_emulate(
         opc[1] = modrm & 0xc7;
         insn_bytes = PFX_BYTES + 2;
     simd_0f_to_gpr:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         generate_exception_if(ea.type != OP_REG, X86_EXC_UD);
 
@@ -4407,7 +4407,7 @@ x86_emulate(
             vex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4444,7 +4444,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0x38;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "+m" (src.val) : "a" (&src.val));
@@ -4639,7 +4639,7 @@ x86_emulate(
 #endif /* X86EMUL_NO_SIMD */
 
     simd_0f_reg_only:
-        opc[insn_bytes - PFX_BYTES] = 0xc3;
+        place_ret(&opc[insn_bytes - PFX_BYTES]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", [dummy_out] "=g" (dummy) : [dummy_in] "i" (0) );
@@ -4973,7 +4973,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xf8;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         ea.reg = decode_gpr(&_regs, modrm_rm);
@@ -5016,7 +5016,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         invoke_stub("", "", "=a" (dst.val) : [dummy] "i" (0));
@@ -5046,7 +5046,7 @@ x86_emulate(
         opc = init_prefixes(stub);
         opc[0] = b;
         opc[1] = modrm;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_VEX(opc, vex);
         _regs.eflags &= ~EFLAGS_MASK;
@@ -5614,7 +5614,7 @@ x86_emulate(
         if ( !mode_64bit() )
             vex.w = 0;
         opc[1] = modrm & 0xc7;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_REX_VEX(opc, rex_prefix, vex);
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
@@ -5732,7 +5732,7 @@ x86_emulate(
             opc[1] &= 0x38;
         }
         insn_bytes = PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -6012,7 +6012,7 @@ x86_emulate(
         pvex->b = !mode_64bit() || (vex.reg >> 3);
         opc[1] = 0xc0 | (~vex.reg & 7);
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=a" (ea.val) : [dummy] "i" (0));
         put_stub(stub);
@@ -6286,7 +6286,7 @@ x86_emulate(
             evex.w = 0;
         opc[1] = modrm & 0xf8;
         insn_bytes = EVEX_PFX_BYTES + 2;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         copy_EVEX(opc, evex);
         invoke_stub("", "", "=g" (dummy) : "a" (src.val));
@@ -6385,7 +6385,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6455,7 +6455,7 @@ x86_emulate(
         pvex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pvex->reg = 0xf;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6511,7 +6511,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6576,7 +6576,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "+m" (*mmvalp) : "a" (mmvalp));
 
@@ -6590,7 +6590,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6666,7 +6666,7 @@ x86_emulate(
         pevex->b = 1;
         opc[1] = (modrm_reg & 7) << 3;
         pevex->RX = 1;
-        opc[2] = 0xc3;
+        place_ret(&opc[2]);
 
         invoke_stub("", "", "=m" (*mmvalp) : "a" (mmvalp));
 
@@ -6743,7 +6743,7 @@ x86_emulate(
         opc[2] = 0x90;
         /* Use (%rax) as source. */
         opc[3] = evex.opmsk << 3;
-        opc[4] = 0xc3;
+        place_ret(&opc[4]);
 
         invoke_stub("", "", "+m" (op_mask) : "a" (&op_mask));
         put_stub(stub);
@@ -6941,7 +6941,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         src.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val), "a" (*src.reg));
@@ -6977,7 +6977,7 @@ x86_emulate(
         pvex->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub("=&a" (dst.val), "c" (&src.val));
@@ -7218,7 +7218,7 @@ x86_emulate(
             evex.w = vex.w = 0;
         opc[1] = modrm & 0x38;
         opc[2] = imm1;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7385,7 +7385,7 @@ x86_emulate(
             insn_bytes = PFX_BYTES + 3;
             copy_VEX(opc, vex);
         }
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
 
         /* Latch MXCSR - we may need to restore it below. */
         invoke_stub("stmxcsr %[mxcsr]", "",
@@ -7631,7 +7631,7 @@ x86_emulate(
         }
         opc[2] = imm1;
         insn_bytes = PFX_BYTES + 3;
-        opc[3] = 0xc3;
+        place_ret(&opc[3]);
         if ( vex.opcx == vex_none )
         {
             /* Cover for extra prefix byte. */
@@ -7977,7 +7977,7 @@ x86_emulate(
         pxop->reg = 0xf; /* rAX */
         buf[3] = b;
         buf[4] = (modrm & 0x38) | 0x01; /* r/m=(%rCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         dst.reg = decode_vex_gpr(vex.reg, &_regs, ctxt);
         emulate_stub([dst] "=&a" (dst.val), "c" (&src.val));
@@ -8086,7 +8086,7 @@ x86_emulate(
         buf[3] = b;
         buf[4] = 0x09; /* reg=rCX r/m=(%rCX) */
         *(uint32_t *)(buf + 5) = imm1;
-        buf[9] = 0xc3;
+        place_ret(&buf[9]);
 
         emulate_stub([dst] "=&c" (dst.val), "[dst]" (&src.val));
 
@@ -8182,12 +8182,12 @@ x86_emulate(
 
         if ( evex_encoded() )
         {
-            opc[insn_bytes - EVEX_PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - EVEX_PFX_BYTES]);
             copy_EVEX(opc, evex);
         }
         else
         {
-            opc[insn_bytes - PFX_BYTES] = 0xc3;
+            place_ret(&opc[insn_bytes - PFX_BYTES]);
             copy_REX_VEX(opc, rex_prefix, vex);
         }
 
@@ -8511,7 +8511,7 @@ int x86_emul_rmw(
         pvex->reg = 0xf; /* rAX */
         buf[3] = ctxt->opcode;
         buf[4] = 0x11; /* reg=rDX r/m=(%RCX) */
-        buf[5] = 0xc3;
+        place_ret(&buf[5]);
 
         *eflags &= ~EFLAGS_MASK;
         invoke_stub("",
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/thunk: Build Xen with Return Thunks

The Indirect Target Selection speculative vulnerability means that indirect
branches (including RETs) are unsafe when in the first half of a cacheline.

In order to mitigate this, build with return thunks and arrange for
__x86_return_thunk to be (mis)aligned in the same manner as
__x86_indirect_thunk_* so the RET instruction is placed in a safe location.

place_ret() needs to conditionally emit JMP __x86_return_thunk instead of RET.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index de2fa37f088d..7afe879710be 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -39,9 +39,14 @@ config ARCH_DEFCONFIG
 	default "arch/x86/configs/x86_64_defconfig"
 
 config CC_HAS_INDIRECT_THUNK
+	# GCC >= 8 or Clang >= 6
 	def_bool $(cc-option,-mindirect-branch-register) || \
 	         $(cc-option,-mretpoline-external-thunk)
 
+config CC_HAS_RETURN_THUNK
+	# GCC >= 8 or Clang >= 15
+	def_bool $(cc-option,-mfunction-return=thunk-extern)
+
 config HAS_AS_CET_SS
 	# binutils >= 2.29 or LLVM >= 6
 	def_bool $(as-instr,wrssq %rax$(comma)0;setssbsy)
diff --git a/xen/arch/x86/Makefile b/xen/arch/x86/Makefile
index c32aba029d44..bedb97cbeed9 100644
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += msi.o
 obj-y += msr.o
 obj-$(CONFIG_INDIRECT_THUNK) += indirect-thunk.o
+obj-$(CONFIG_RETURN_THUNK) += indirect-thunk.o
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
diff --git a/xen/arch/x86/acpi/wakeup_prot.S b/xen/arch/x86/acpi/wakeup_prot.S
index a741d58b9117..92af6230b31f 100644
--- a/xen/arch/x86/acpi/wakeup_prot.S
+++ b/xen/arch/x86/acpi/wakeup_prot.S
@@ -133,7 +133,7 @@ LABEL(s3_resume)
         pop     %r12
         pop     %rbx
         pop     %rbp
-        ret
+        RET
 END(do_suspend_lowlevel)
 
 .data
diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 0449b9c1b7e5..ecc56964bd9c 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -135,16 +135,45 @@ void init_or_livepatch add_nops(void *insns, unsigned int len)
     }
 }
 
+void nocall __x86_return_thunk(void);
+
 /*
  * Place a return at @ptr.  @ptr must be in the writable alias of a stub.
  *
+ * When CONFIG_RETURN_THUNK is active, this may be a JMP __x86_return_thunk
+ * instead, depending on the safety of @ptr with respect to Indirect Target
+ * Selection.
+ *
  * Returns the next position to write into the stub.
  */
 void *place_ret(void *ptr)
 {
+    unsigned long addr = (unsigned long)ptr;
     uint8_t *p = ptr;
 
-    *p++ = 0xc3;
+    /*
+     * When Return Thunks are used, if a RET would be unsafe at this location
+     * with respect to Indirect Target Selection (i.e. if addr is in the first
+     * half of a cacheline), insert a JMP __x86_return_thunk instead.
+     *
+     * The displacement needs to be relative to the executable alias of the
+     * stub, not to @ptr which is the writeable alias.
+     */
+    if ( IS_ENABLED(CONFIG_RETURN_THUNK) && !(addr & 0x20) )
+    {
+        long stub_va = (this_cpu(stubs.addr) & PAGE_MASK) + (addr & ~PAGE_MASK);
+        long disp = (long)__x86_return_thunk - (stub_va + 5);
+
+        BUG_ON((int32_t)disp != disp);
+
+        *p++ = 0xe9;
+        *(int32_t *)p = disp;
+        p += 4;
+    }
+    else
+    {
+        *p++ = 0xc3;
+    }
 
     return p;
 }
diff --git a/xen/arch/x86/arch.mk b/xen/arch/x86/arch.mk
index 09b6d8758e95..49915451259f 100644
--- a/xen/arch/x86/arch.mk
+++ b/xen/arch/x86/arch.mk
@@ -34,6 +34,9 @@ CFLAGS-$(CONFIG_CC_IS_GCC) += -fno-jump-tables
 CFLAGS-$(CONFIG_CC_IS_CLANG) += -mretpoline-external-thunk
 endif
 
+# Compile with return thunk support if selected.
+CFLAGS-$(CONFIG_RETURN_THUNK) += -mfunction-return=thunk-extern
+
 # Disable the addition of a .note.gnu.property section to object files when
 # livepatch support is enabled.  The contents of that section can change
 # depending on the instructions used, and livepatch-build-tools doesn't know
diff --git a/xen/arch/x86/bhb-thunk.S b/xen/arch/x86/bhb-thunk.S
index 7175778bdd11..606b378d84cf 100644
--- a/xen/arch/x86/bhb-thunk.S
+++ b/xen/arch/x86/bhb-thunk.S
@@ -23,7 +23,7 @@ FUNC(clear_bhb_tsx)
         xabort  $0
         int3
 1:
-        ret
+        RET
 END(clear_bhb_tsx)
 
 /*
diff --git a/xen/arch/x86/clear_page.S b/xen/arch/x86/clear_page.S
index d6c076f1d8bc..dc3c3c26bfb7 100644
--- a/xen/arch/x86/clear_page.S
+++ b/xen/arch/x86/clear_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 FUNC(clear_page_sse2)
@@ -16,5 +18,5 @@ FUNC(clear_page_sse2)
         jnz     0b
 
         sfence
-        ret
+        RET
 END(clear_page_sse2)
diff --git a/xen/arch/x86/copy_page.S b/xen/arch/x86/copy_page.S
index c3c436545bac..e43e5370c815 100644
--- a/xen/arch/x86/copy_page.S
+++ b/xen/arch/x86/copy_page.S
@@ -1,6 +1,8 @@
         .file __FILE__
 
 #include <xen/linkage.h>
+
+#include <asm/asm_defns.h>
 #include <asm/page.h>
 
 #define src_reg %rsi
@@ -41,5 +43,5 @@ FUNC(copy_page_sse2)
         movnti  tmp4_reg, 3*WORD_SIZE(dst_reg)
 
         sfence
-        ret
+        RET
 END(copy_page_sse2)
diff --git a/xen/arch/x86/efi/check.c b/xen/arch/x86/efi/check.c
index 9e473faad3c9..23ba30abf330 100644
--- a/xen/arch/x86/efi/check.c
+++ b/xen/arch/x86/efi/check.c
@@ -3,6 +3,9 @@ int __attribute__((__ms_abi__)) test(int i)
     return i;
 }
 
+/* In case -mfunction-return is in use. */
+void __x86_return_thunk(void) {};
+
 /*
  * Populate an array with "addresses" of relocatable and absolute values.
  * This is to probe ld for (a) emitting base relocations at all and (b) not
diff --git a/xen/arch/x86/include/asm/asm-defns.h b/xen/arch/x86/include/asm/asm-defns.h
index 1b821db49ca3..96370dd92a58 100644
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -36,6 +36,12 @@
     .endif
 .endm
 
+#ifdef CONFIG_RETURN_THUNK
+# define RET jmp __x86_return_thunk
+#else
+# define RET ret
+#endif
+
 #ifdef CONFIG_XEN_IBT
 # define ENDBR64 endbr64
 #else
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index c4b978d67b8e..26dad15f12c9 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -15,6 +15,8 @@
 #undef SYM_ALIGN
 #define SYM_ALIGN(align...)
 
+#ifdef CONFIG_INDIRECT_THUNK
+
 .macro IND_THUNK_RETPOLINE reg:req
         call 1f
         int3
@@ -62,3 +64,25 @@ END(__x86_indirect_thunk_\reg)
 .irp reg, ax, cx, dx, bx, bp, si, di, 8, 9, 10, 11, 12, 13, 14, 15
         GEN_INDIRECT_THUNK reg=r\reg
 .endr
+
+#endif /* CONFIG_INDIRECT_THUNK */
+
+#ifdef CONFIG_RETURN_THUNK
+        .section .text.entry.__x86_return_thunk, "ax", @progbits
+
+        /*
+         * The Indirect Target Selection speculative vulnerability means that
+         * indirect branches (including RETs) are unsafe when in the first
+         * half of a cacheline.  Arrange for them to be in the second half.
+         *
+         * Align to 64, then skip 32.
+         */
+        .balign 64
+        .fill 32, 1, 0xcc
+
+FUNC(__x86_return_thunk)
+        ret
+        int3 /* Halt straight-line speculation */
+END(__x86_return_thunk)
+
+#endif /* CONFIG_RETURN_THUNK */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ff5d1c9f8634..295d847ea24c 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -131,7 +131,7 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode,
     BUILD_BUG_ON(STUB_BUF_SIZE / 2 <
                  (sizeof(prologue) + sizeof(epilogue) + 10 /* 2x call */ +
                   MAX(3 /* default stub */, IOEMUL_QUIRK_STUB_BYTES) +
-                  1 /* ret */));
+                  (IS_ENABLED(CONFIG_RETURN_THUNK) ? 5 : 1) /* ret */));
     /* Runtime confirmation that we haven't clobbered an adjacent stub. */
     BUG_ON(STUB_BUF_SIZE / 2 < (p - ctxt->io_emul_stub));
 
diff --git a/xen/arch/x86/pv/gpr_switch.S b/xen/arch/x86/pv/gpr_switch.S
index 5409ad3b1447..362b5d241623 100644
--- a/xen/arch/x86/pv/gpr_switch.S
+++ b/xen/arch/x86/pv/gpr_switch.S
@@ -26,7 +26,7 @@ FUNC(load_guest_gprs)
         movq  UREGS_r15(%rdi), %r15
         movq  UREGS_rcx(%rdi), %rcx
         movq  UREGS_rdi(%rdi), %rdi
-        ret
+        RET
 END(load_guest_gprs)
 
 /* Save guest GPRs.  Parameter on the stack above the return address. */
@@ -48,5 +48,5 @@ FUNC(save_guest_gprs)
         movq  %rbx, UREGS_rbx(%rdi)
         movq  %rdx, UREGS_rdx(%rdi)
         movq  %rcx, UREGS_rcx(%rdi)
-        ret
+        RET
 END(save_guest_gprs)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index ced84750015c..fe27e82a4707 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -571,6 +571,9 @@ static void __init print_details(enum ind_thunk thunk)
 #ifdef CONFIG_INDIRECT_THUNK
                " INDIRECT_THUNK"
 #endif
+#ifdef CONFIG_RETURN_THUNK
+               " RETURN_THUNK"
+#endif
 #ifdef CONFIG_SHADOW_PAGING
                " SHADOW_PAGING"
 #endif
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index 1e87652f4bcb..d7b381ea546d 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -180,7 +180,7 @@ FUNC(cr4_pv32_restore)
         or    cr4_pv32_mask(%rip), %rax
         mov   %rax, %cr4
         mov   %rax, (%rcx)
-        ret
+        RET
 0:
 #ifndef NDEBUG
         /* Check that _all_ of the bits intended to be set actually are. */
@@ -198,7 +198,7 @@ FUNC(cr4_pv32_restore)
 1:
 #endif
         xor   %eax, %eax
-        ret
+        RET
 END(cr4_pv32_restore)
 
 FUNC(compat_syscall)
@@ -329,7 +329,7 @@ __UNLIKELY_END(compat_bounce_null_selector)
         xor   %eax, %eax
         mov   %ax,  TRAPBOUNCE_cs(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
 .section .fixup,"ax"
 .Lfx13:
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index d81a626d1667..39c7b9d17f9e 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -604,7 +604,7 @@ __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         xor   %eax, %eax
         mov   %rax, TRAPBOUNCE_eip(%rdx)
         mov   %al,  TRAPBOUNCE_flags(%rdx)
-        ret
+        RET
 
         .pushsection .fixup, "ax", @progbits
         # Numeric tags below represent the intended overall %rsi adjustment.
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 53bafc98a536..bf956b6c5fc0 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -85,6 +85,7 @@ SECTIONS
        . = ALIGN(PAGE_SIZE);
        _stextentry = .;
        *(.text.entry)
+       *(.text.entry.*)
        . = ALIGN(PAGE_SIZE);
        _etextentry = .;
 
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index bf7b081ad0b7..6d43be2e6e8a 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -172,6 +172,17 @@ config INDIRECT_THUNK
 	  by choosing a hardware-dependent instruction sequence to implement
 	  (e.g. function pointers) safely.  "Retpoline" is one such sequence.
 
+config RETURN_THUNK
+	bool "Out-of-line Returns"
+	depends on CC_HAS_RETURN_THUNK
+	default INDIRECT_THUNK
+	help
+	  Compile Xen with out-of-line returns.
+
+	  This allows Xen to mitigate a variety of speculative vulnerabilities
+	  by choosing a hardware-dependent instruction sequence to implement
+	  function returns safely.
+
 config SPECULATIVE_HARDEN_ARRAY
 	bool "Speculative Array Hardening"
 	default y
diff --git a/xen/lib/x86-generic-hweightl.c b/xen/lib/x86-generic-hweightl.c
index 123a5b43928d..1cab68952a42 100644
--- a/xen/lib/x86-generic-hweightl.c
+++ b/xen/lib/x86-generic-hweightl.c
@@ -51,7 +51,11 @@ asm (
     "pop    %rdx\n\t"
     "pop    %rdi\n\t"
 
+#ifdef CONFIG_RETURN_THUNK
+    "jmp    __x86_return_thunk\n\t"
+#else
     "ret\n\t"
+#endif
 
     ".size arch_generic_hweightl, . - arch_generic_hweightl\n\t"
 );
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware

It is easier to express feature word 17 in terms of word 16 + [32, 64) as
that's how the layout is given in documentation.

This is part of XSA-469 / CVE-2024-28956

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 05399fb9c982..397a04af41a1 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -219,6 +219,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_gds_no          boot_cpu_has(X86_FEATURE_GDS_NO)
 #define cpu_has_rfds_no         boot_cpu_has(X86_FEATURE_RFDS_NO)
 #define cpu_has_rfds_clear      boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
+#define cpu_has_its_no          boot_cpu_has(X86_FEATURE_ITS_NO)
 
 /* Synthesized. */
 #define cpu_has_arch_perfmon    boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index fe27e82a4707..0a635025e488 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1760,6 +1760,90 @@ static void __init bhi_calculations(void)
     }
 }
 
+/*
+ * https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/indirect-target-selection.html
+ */
+static void __init its_calculations(void)
+{
+    /*
+     * Indirect Target Selection is a Branch Prediction bug whereby certain
+     * indirect branches (including RETs) get predicted using a direct branch
+     * target, rather than a suitable indirect target, bypassing hardware
+     * isolation protections.
+     *
+     * ITS affects Core (but not Atom) processors starting from the
+     * introduction of eIBRS, up to but not including Golden Cove cores
+     * (checked here with BHI_CTRL).
+     *
+     * The ITS_NO feature is not expected to be enumerated by hardware, and is
+     * only for VMMs to synthesise for guests.
+     *
+     * ITS comes in 3 flavours:
+     *
+     *   1) Across-IBPB.  Indirect branches after the IBPB can be controlled
+     *      by direct targets which existed prior to the IBPB.  This is
+     *      addressed in the IPU 2025.1 microcode drop, and has no other
+     *      software interaction.
+     *
+     *   2) Guest/Host.  Indirect branches in the VMM can be controlled by
+     *      direct targets from the guest.  This applies equally to PV guests
+     *      (Ring3) and HVM guests (VMX), and applies to all Skylake-uarch
+     *      cores with eIBRS.
+     *
+     *   3) Intra-mode.  Indirect branches in the VMM can be controlled by
+     *      other execution in the same mode.
+     */
+
+    /*
+     * If we can see ITS_NO, or we're virtualised, do nothing.  We are or may
+     * migrate somewhere unsafe.
+     */
+    if ( cpu_has_its_no || cpu_has_hypervisor )
+        return;
+
+    /* ITS is only known to affect Intel processors at this time. */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL )
+        return;
+
+    /*
+     * ITS does not exist on:
+     *  - non-Family 6 CPUs
+     *  - those without eIBRS
+     *  - those with BHI_CTRL
+     * but we still need to synthesise ITS_NO.
+     */
+    if ( boot_cpu_data.x86 != 6 || !cpu_has_eibrs ||
+         boot_cpu_has(X86_FEATURE_BHI_CTRL) )
+        goto synthesise;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /* These Skylake-uarch cores suffer cases #2 and #3. */
+    case INTEL_FAM6_SKYLAKE_X:
+    case INTEL_FAM6_KABYLAKE_L:
+    case INTEL_FAM6_KABYLAKE:
+    case INTEL_FAM6_COMETLAKE:
+    case INTEL_FAM6_COMETLAKE_L:
+        return;
+
+        /* These Sunny/Willow/Cypress Cove cores suffer case #3. */
+    case INTEL_FAM6_ICELAKE_X:
+    case INTEL_FAM6_ICELAKE_D:
+    case INTEL_FAM6_ICELAKE_L:
+    case INTEL_FAM6_TIGERLAKE_L:
+    case INTEL_FAM6_TIGERLAKE:
+    case INTEL_FAM6_ROCKETLAKE:
+        return;
+
+    default:
+        break;
+    }
+
+    /* Platforms remaining are not believed to be vulnerable to ITS. */
+ synthesise:
+    setup_force_cpu_cap(X86_FEATURE_ITS_NO);
+}
+
 void spec_ctrl_init_domain(struct domain *d)
 {
     bool pv = is_pv_domain(d);
@@ -2316,6 +2400,8 @@ void __init init_speculation_mitigations(void)
 
     bhi_calculations();
 
+    its_calculations();
+
     print_details(thunk);
 
     /*
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index c290d1668358..a6d4a0cba7d8 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -391,7 +391,8 @@ XEN_CPUFEATURE(RFDS_CLEAR,         16*32+28) /*!A| Register File(s) cleared by V
 XEN_CPUFEATURE(IGN_UMONITOR,       16*32+29) /*   MCU_OPT_CTRL.IGN_UMONITOR */
 XEN_CPUFEATURE(MON_UMON_MITG,      16*32+30) /*   MCU_OPT_CTRL.MON_UMON_MITG */
 
-/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 */
+/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
+XEN_CPUFEATURE(ITS_NO,             16*32+62) /*!A No Indirect Target Selection */
 
 #endif /* XEN_CPUFEATURE */
 
diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index a77cb30bdb4f..163b105bc6a7 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -51,7 +51,7 @@ def parse_definitions(state):
         r"\s+/\*([\w!|]*) .*$")
 
     word_regex = re.compile(
-        r"^/\* .* word (\d*) \*/$")
+        r"^/\* .* word (\d*) .*\*/$")
     last_word = -1
 
     this = sys.modules[__name__]