-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Xen Security Advisory CVE-2024-36350,CVE-2024-36357 / XSA-471
x86: Transitive Scheduler Attacks
ISSUE DESCRIPTION
=================
Researchers from Microsoft and ETH Zurich have discovered several new
speculative sidechannel attacks which bypass current protections. They
are detailed in a paper titled "Enter, Exit, Page Fault, Leak: Testing
Isolation Boundaries for Microarchitectural Leaks".
Two issues, which AMD have named Transitive Scheduler Attacks, utilise
timing information from instruction execution. These are:
* CVE-2024-36350: TSA-SQ (TSA in the Store Queues)
* CVE-2024-36357: TSA-L1 (TSA in the L1 data cache)
For more information, see:
https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7029.html
https://aka.ms/enter-exit-leak
The paper also details other speculative attacks. See below.
IMPACT
======
An attacker might be able to infer data belonging to other contexts,
including data belonging to other guests.
VULNERABLE SYSTEMS
==================
Systems running all versions of Xen are affected.
Only AMD Fam19h CPUs (Zen3/4 microarchitectures) are believed to be
vulnerable. Other AMD CPUs, and CPUs from other manufacturers are not
known to be affected.
MITIGATION
==========
There are no mitigations.
RESOLUTION
==========
AMD are producing microcode to address TSA, which adds scrubbing side
effects to the VERW instruction. This was included in the firmware
fixes for the Entrysign signature vulnerability from ~December 2024, but
are also available in an OS-loadable form on older firmware. Consult
your dom0 OS vendor and/or hardware vendor for updated microcode.
In addition to the microcode, changes are requires to Xen to make use of
VERW scrubbing at suitable points.
Applying the appropriate set of attached patches resolves this issue.
Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball. Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.
xsa471/xsa471-??.patch xen-unstable
xsa471/xsa471-4.20-??.patch Xen 4.20.x
xsa471/xsa471-4.19-??.patch Xen 4.19.x
xsa471/xsa471-4.18-??.patch Xen 4.18.x
xsa471/xsa471-4.17-??.patch Xen 4.17.x
$ sha256sum xsa471*/*
4cc8b54d3cae4864053c4d608061675564cc322c6cd362e33ac59ac4c9371358 xsa471/xsa471-01.patch
9bdfd0ad8d34114e69bb0e264ffdcb176e54211753cc1eed247e73cd3fe752e9 xsa471/xsa471-02.patch
62706c1593cb64bfd053f5ee2e8fa26f5414835c6ef5f694c52a61e18017aa1d xsa471/xsa471-03.patch
e06162c55de0b3ca79302ea47c8169079b0f2cd65a48d3e0509677452c9887da xsa471/xsa471-4.17-01.patch
742c59d776c73993c44e72ceada2b83b61fa77a988f5c2c593b6594b4f4078de xsa471/xsa471-4.17-02.patch
a8ed5e90e405273115d6a06199de3190319ba12aac33bea17495db42b6a9becc xsa471/xsa471-4.17-03.patch
855c9528d93109a1a673dd8f4feb87a688457908c9dd821d0e1a3326efb19257 xsa471/xsa471-4.17-04.patch
f8dbe5c8dbdf921c7f9b4bc7d8234b9bb291df6a4a8cef604284bf4f36947d4e xsa471/xsa471-4.17-05.patch
fa2682955663e0966cf285bf79770fe76b387fbf449e1ac64206a6ae4bf79bc4 xsa471/xsa471-4.17-06.patch
dc4695aced6ba65e8b16088aefe817e8e4d965cd94e9e3340bd48d77c1369902 xsa471/xsa471-4.17-07.patch
0393a8711805f40eeb936dcca56f5bea174ff94098ffaee7064cd6801eb55099 xsa471/xsa471-4.17-08.patch
346493cc12d9a0fa4154968bc0c8dca9d87e583a25ae9ecd22b8ae2c7bdeae19 xsa471/xsa471-4.17-09.patch
0d652e9a6bb89782036b39160f2db9c1fd1dfc0e659fb4e351f04eb66ffbdba8 xsa471/xsa471-4.17-10.patch
ff3ae74f6cfccfaee55f007b2410813068928d20a23729ae2766ddcc65d2e82b xsa471/xsa471-4.17-11.patch
5c6d133f626fdcbc148b596bf3cee1c46260d73fd833aeb6c59cf7c1b315f2b8 xsa471/xsa471-4.17-12.patch
acbdca53d713122545ba210a795c4185e842d4cca71802775adc1a4ce971bf3a xsa471/xsa471-4.17-13.patch
4a7fa23b7f501cb88100b55fab13b7315a01fc1e4a3eff46b5d0c867fcc03ca8 xsa471/xsa471-4.17-14.patch
164d626a2f446ad7692bd70ab7e109f8b6259aaea34bfb9f51df68def98a2e62 xsa471/xsa471-4.17-15.patch
35699b19590ccf1872c8da8731b4c47e95cc38d119510e182d196427ea4455f8 xsa471/xsa471-4.17-16.patch
296c95410b6dbf55fa092d15e0eee66125a87b012095f84c550eea54078d0490 xsa471/xsa471-4.17-17.patch
0212aec278afe0dcc6479b756a0c1821d2bfba646fa9ec56f1b9b37ff51756f4 xsa471/xsa471-4.17-18.patch
4e05073bc960b7f43dd383356d1b56fa9c55dc021205678bd8ac456f3a1d00f0 xsa471/xsa471-4.17-19.patch
0d4166420a9e69afe3303d6d3232ee43cf27e88e5bbd8a52a17521934455ed65 xsa471/xsa471-4.18-01.patch
3ff24a622a8ef97af7dddf480dd8c6c12efb8a2dc74ae8d68836543a6cdb8329 xsa471/xsa471-4.18-02.patch
fe69ab8c1d45e0d23f58126b22e9914d2269d416cd802619000dc3933c49129e xsa471/xsa471-4.18-03.patch
60b563119ed38a052ed6e6a261b56db5e7b8b40befacd4904d5ce50b2d75d280 xsa471/xsa471-4.18-04.patch
864643c643cfe1f03d28bb36aadcd5bdd1dc7276c30357ea8be1cd1d20ef6f69 xsa471/xsa471-4.18-05.patch
9d5c58339aea8afbeea0bdf34c34cbeb4178ac0a475a32e688317b9810d0f148 xsa471/xsa471-4.18-06.patch
38347e4d096a880cc6d91f09d60277914ad6aa8a6b588913f211097574714ab0 xsa471/xsa471-4.18-07.patch
f0db078f811b5c06170f0016fec84a4bbd958b9f8a8d999567c5680d90141c2c xsa471/xsa471-4.18-08.patch
6d2f9de12d113790bbc74327cf94ae08234bec95d88468767462d3a11d0c40d4 xsa471/xsa471-4.18-09.patch
04f63468fca093f8fb5716c0ecafd1ef0be14dd5a464cddc20e719e0c2979980 xsa471/xsa471-4.18-10.patch
5c6e030d1258ce703ddf27fb48ee7b33ca0dbb09657cb38fc7d5b432d215322e xsa471/xsa471-4.18-11.patch
18b17089aa643ae2d6d9d394137a7fe21bf6b8f9743f2237481b68920f3f8f06 xsa471/xsa471-4.18-12.patch
d2c35d0a93e9a98fa04623c024a6e152f4d4d6568e6b603ee0cf7f4e4c9dca82 xsa471/xsa471-4.18-13.patch
f5b3f0aa8a59033bad4f221709eb4f6f14c82f75ab229ff53ef52b917d0f4021 xsa471/xsa471-4.18-14.patch
dc5c0da74f4a6faba0b2af5539cb38a44525379a2f9cafdff18f71cda5280d42 xsa471/xsa471-4.18-15.patch
296c95410b6dbf55fa092d15e0eee66125a87b012095f84c550eea54078d0490 xsa471/xsa471-4.18-16.patch
04c5587d19749a261ca9edc5212d606f2bf577e890c8f4474c55a9c5fe9605d7 xsa471/xsa471-4.18-17.patch
1b10f901d218bafc35d21366e57be89191c0b7c3bcc9def4da5c79bcf93a2e9d xsa471/xsa471-4.18-18.patch
d8b010138f5a2773a07902617c65f5d419bae6445410251c9dc1a777b6bf3378 xsa471/xsa471-4.19-01.patch
d72ab177637179cccbd32d2dbedfbb399ff4ba59360391e898e3c5fc069803a9 xsa471/xsa471-4.19-02.patch
365ee7e6fb3da83e6238cd4f9138de2018fddb65c1604bbd968e73ce97451fe9 xsa471/xsa471-4.19-03.patch
690cbd2b4b5ce5a855e75cf44c098ca2e231a272d2ebdc1e68d267c5c4e50db0 xsa471/xsa471-4.19-04.patch
82ac6bbca376e33fe1e03569ed76f559f18066000aec6dc72f1722245f5e9a54 xsa471/xsa471-4.19-05.patch
06c475bbb74d86375c17e183bdf74e1ef145a49af2aa237ce69f6ca8e6f78a7c xsa471/xsa471-4.19-06.patch
03bacbdb4cabb3e9eee079a847fa2eda3ff30c86bbfc5d5b1987ee028774a507 xsa471/xsa471-4.19-07.patch
b144cea707793e73d6dcbaa0e0ef268bb3cd389e12c080ec687a64a8a3e6ad61 xsa471/xsa471-4.19-08.patch
df35ded3dfe5ca84d459eda720699a35e3e49d4b4d461a3f834d05c30b0bcf59 xsa471/xsa471-4.19-09.patch
6719417c0ead056d83ef003cc3b08bf95a3430560fd8f27357c09ca55b6a3993 xsa471/xsa471-4.19-10.patch
b7c6ff2f529c6d6cc656b42142d06e5462e652ade57cc2ff5d90320af1234a27 xsa471/xsa471-4.19-11.patch
4b8a05edf04f5b43b1edcf44412ec4be734b011a7b8d2d739ffe0bdc04abce82 xsa471/xsa471-4.19-12.patch
e2bca0cd6f66465fdae9e3d251e67ba8a28a96a05201ac939a599dd95a0b3bc6 xsa471/xsa471-4.19-13.patch
8dc65ba84572a090d1bf8ffeb9b5871d9533e4da324fbcdfb1ab32ed83b10fad xsa471/xsa471-4.19-14.patch
2cb102830a29c6c2a898f8f580a9d554c332d6c31dd1608af0fb22b7340f650b xsa471/xsa471-4.19-15.patch
2a873ae56866b9986183e18ea9b70712a15f6df3af299b2d583cbda40a816f58 xsa471/xsa471-4.19-16.patch
77c7634a7d59056f92de619e034f31e63fd6ac6b26dc6e6af65e80fe3e4e5feb xsa471/xsa471-4.20-01.patch
22db1def1859cc7c742b79fee78c994ac4c9cc63daa3663533f324e93e9ef9e0 xsa471/xsa471-4.20-02.patch
fb9a103c606552188c05c14092cce084b52b4df75659f4d8013aa30978708ee8 xsa471/xsa471-4.20-03.patch
6930b94a1997b118692a2e0bd5e32bba2e0269b66de4019e3e870304d695c315 xsa471/xsa471-4.20-04.patch
4a67ef27f84eced8fa9cf3ae42d9f79f74a16659ab004fd79a7ee09fce823cf6 xsa471/xsa471-4.20-05.patch
9c62b492be0f1961d5d8062b7d4ac95b9d120e44ca4bf7e009a499fad9c0fcc3 xsa471/xsa471-4.20-06.patch
511728ef65068fa8bda25c31e3fd578aebc8400597d117f31fd2ba436fbb3776 xsa471/xsa471-4.20-07.patch
9a66742ec752a9f58a02f170a4213a22d32bd487e49bfff799800851ba9650e1 xsa471/xsa471-4.20-08.patch
5a02afd655d29b7eba7ac24a8665d64db39994d84e5125a7511f3e5fb7cafacf xsa471/xsa471-4.20-09.patch
f55ba571fa668a1ba9fb318c082e684780cc9b8d3c4e7f33db17bf7cc2afcdb3 xsa471/xsa471-4.20-10.patch
4cc8b54d3cae4864053c4d608061675564cc322c6cd362e33ac59ac4c9371358 xsa471/xsa471-4.20-11.patch
b180fec77659ce67d24c076301a3d10486afe0c1f224c30b5af7f22f678e8834 xsa471/xsa471-4.20-12.patch
60155cf04e25ad5c95f744dced34c530e0606150e1ca7617e38a9e3d8933eff3 xsa471/xsa471-4.20-13.patch
3d4eb5835d331581fd5c502ea77a0bf3f35c8e12ff9a95d38d32acfed735fefa xsa471/xsa471-4.20-14.patch
bc8590f2187d52a727f2354fda9d006087eaae17c34899bec0257ed7e870e7b6 xsa471/xsa471-4.20-15.patch
91c9100a964b0ecaae5ed019e2c846ea0a8a1e5d734e01853be737bb1799d5dd xsa471/xsa471-04.patch
5fce1dfbf084ccabbba9fcb7a8f758cffc1c8ca93a4f1d2a1c6ad49b4fe9e5da xsa471/xsa471-05.patch
$
NOTE CONCERNING OTHER SPECULATIVE ATTACKS
=========================================
The paper describes two further attack:
* CVE-2024-36348: Rogue execution of SMSW
* CVE-2045-36349: Rogue read of MSR_TSC_AUX
which are both examples of Rogue System Register Read (sometimes called
Spectre-v3a). No fix is planned, because these registers do not
typically contain sensitive information.
-----BEGIN PGP SIGNATURE-----
iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmhtJggMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZq+0H/0DAl85Esb0oZTu2VugMbZjxbaROEghLa+CaJPeK
5IJEn3E+gHPil9P88nktO8P3SipbXHYzuZeCKzg3FFPZskv+x294zdLCgndPcB1Q
Qfx9wKX8IA+hrgfafUORCjQbAeq+ahxTCG6jwrwaSOSuuU1aAM3RZL+haDlhJ8cH
Ib5pdfxZnX5BkJc/Fb/1qrwfW1nHSrvtWJkza79hAyi6d1GnhcSPA9QLfbl4KSSP
DBNHaWyAzKWQc3yjvekO+1h0XKnvcpGRMIa3jQOgemceXcRO2Vrp7gSB6BnG+CNh
ZODnfZM+2zbbXDscdckujoD/0vywEPhEq4RUv2BaDYKna3I=
=lnmx
-----END PGP SIGNATURE-----
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 17:19:19 +0100
Subject: x86/cpu-policy: Rearrange guest_common_*_feature_adjustments()
Turn the if()s into switch()es, as we're going to need AMD sections.
Move the RTM adjustments into the Intel section, where they ought to live.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index c3aaac861d15..47ee1ff47460 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -418,8 +418,9 @@ static void __init guest_common_default_leaves(struct cpu_policy *p)
static void __init guest_common_max_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* MSR_ARCH_CAPS is just feature data, and we can offer it to guests
* unconditionally, although limit it to Intel systems as it is highly
@@ -464,6 +465,22 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__set_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * To mitigate Native-BHI, one option is to use a TSX Abort on capable
+ * systems. This is safe even if RTM has been disabled for other
+ * reasons via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel
+ * doesn't get to know this type of information.
+ *
+ * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to
+ * instead mean "XBEGIN won't fault". This is enough for a guest
+ * kernel to make an informed choice WRT mitigating Native-BHI.
+ *
+ * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
+ */
+ if ( test_bit(X86_FEATURE_RTM, fs) )
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ break;
}
/*
@@ -475,27 +492,13 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
*/
__set_bit(X86_FEATURE_HTT, fs);
__set_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * To mitigate Native-BHI, one option is to use a TSX Abort on capable
- * systems. This is safe even if RTM has been disabled for other reasons
- * via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel doesn't get to
- * know this type of information.
- *
- * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to instead
- * mean "XBEGIN won't fault". This is enough for a guest kernel to make
- * an informed choice WRT mitigating Native-BHI.
- *
- * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
- */
- if ( test_bit(X86_FEATURE_RTM, fs) )
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
static void __init guest_common_default_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* IvyBridge client parts suffer from leakage of RDRAND data due to SRBDS
* (XSA-320 / CVE-2020-0543), and won't be receiving microcode to
@@ -539,6 +542,23 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__clear_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * On certain hardware, speculative or errata workarounds can result
+ * in TSX being placed in "force-abort" mode, where it doesn't
+ * actually function as expected, but is technically compatible with
+ * the ISA.
+ *
+ * Do not advertise RTM to guests by default if it won't actually
+ * work. Instead, advertise RTM_ALWAYS_ABORT indicating that TSX
+ * Aborts are safe to use, e.g. for mitigating Native-BHI.
+ */
+ if ( rtm_disabled )
+ {
+ __clear_bit(X86_FEATURE_RTM, fs);
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ }
+ break;
}
/*
@@ -550,21 +570,6 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
if ( !cpu_has_cmp_legacy )
__clear_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * On certain hardware, speculative or errata workarounds can result in
- * TSX being placed in "force-abort" mode, where it doesn't actually
- * function as expected, but is technically compatible with the ISA.
- *
- * Do not advertise RTM to guests by default if it won't actually work.
- * Instead, advertise RTM_ALWAYS_ABORT indicating that TSX Aborts are safe
- * to use, e.g. for mitigating Native-BHI.
- */
- if ( rtm_disabled )
- {
- __clear_bit(X86_FEATURE_RTM, fs);
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
- }
}
static void __init guest_common_feature_adjustments(uint32_t *fs)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 19:55:15 +0100
Subject: x86/cpu-policy: Infrastructure for CPUID leaf 0x80000021.ecx
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index 063fe86eb72f..f738e17b19e4 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -342,6 +342,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *policy, const char* str)
CPUID_ENTRY(0x00000007, 1, CPUID_REG_EDX),
MSR_ENTRY(0x10a, CPUID_REG_EAX),
MSR_ENTRY(0x10a, CPUID_REG_EDX),
+ CPUID_ENTRY(0x80000021, NA, CPUID_REG_ECX),
#undef MSR_ENTRY
#undef CPUID_ENTRY
};
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 4c4593528dfe..8e36b8e69600 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -37,6 +37,7 @@ static const struct {
{ "CPUID 0x00000007:1.edx", "7d1" },
{ "MSR_ARCH_CAPS.lo", "m10Al" },
{ "MSR_ARCH_CAPS.hi", "m10Ah" },
+ { "CPUID 0x80000021.ecx", "e21c" },
};
#define COL_ALIGN "24"
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 47ee1ff47460..9d1ff6268d79 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -330,7 +330,6 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
p->extd.raw[0x21].b = 0;
- p->extd.raw[0x21].c = 0;
p->extd.raw[0x21].d = 0;
break;
}
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index b934ce7ca487..77364fd728db 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -480,7 +480,9 @@ static void generic_identify(struct cpuinfo_x86 *c)
if (c->extended_cpuid_level >= 0x80000008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x80000008);
if (c->extended_cpuid_level >= 0x80000021)
- c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x80000021);
+ cpuid(0x80000021,
+ &c->x86_capability[FEATURESET_e21a], &tmp,
+ &c->x86_capability[FEATURESET_e21c], &tmp);
/* Intel-defined flags: level 0x00000007 */
if (c->cpuid_level >= 7) {
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 044230bfe854..480d5f58ce09 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -395,6 +395,8 @@ XEN_CPUFEATURE(MON_UMON_MITG, 16*32+30) /* MCU_OPT_CTRL.MON_UMON_MITG */
XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
+/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+
#endif /* XEN_CPUFEATURE */
/* Clean up from a default include. Close the enum (for C). */
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index f08f30afeca3..dd204a825b07 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -22,6 +22,7 @@
#define FEATURESET_7d1 15 /* 0x00000007:1.edx */
#define FEATURESET_m10Al 16 /* 0x0000010a.eax */
#define FEATURESET_m10Ah 17 /* 0x0000010a.edx */
+#define FEATURESET_e21c 18 /* 0x80000021.ecx */
struct cpuid_leaf
{
@@ -328,7 +329,11 @@ struct cpu_policy
uint16_t ucode_size; /* Units of 16 bytes */
uint8_t rap_size; /* Units of 8 entries */
uint8_t :8;
- uint32_t /* c */:32, /* d */:32;
+ union {
+ uint32_t e21c;
+ struct { DECL_BITFIELD(e21c); };
+ };
+ uint32_t /* d */:32;
};
} extd;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index eb7698dc7325..6298d051f2a6 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -81,6 +81,7 @@ void x86_cpu_policy_to_featureset(
fs[FEATURESET_7d1] = p->feat._7d1;
fs[FEATURESET_m10Al] = p->arch_caps.lo;
fs[FEATURESET_m10Ah] = p->arch_caps.hi;
+ fs[FEATURESET_e21c] = p->extd.e21c;
}
void x86_cpu_featureset_to_policy(
@@ -104,6 +105,7 @@ void x86_cpu_featureset_to_policy(
p->feat._7d1 = fs[FEATURESET_7d1];
p->arch_caps.lo = fs[FEATURESET_m10Al];
p->arch_caps.hi = fs[FEATURESET_m10Ah];
+ p->extd.e21c = fs[FEATURESET_e21c];
}
void x86_cpu_policy_recalc_synth(struct cpu_policy *p)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Sep 2024 11:28:39 +0100
Subject: x86/ucode: Digests for TSA microcode
AMD are releasing microcode for TSA, so extend the known-provenance list with
their hashes. These were produced before the remediation of the microcode
signature issues (the entrysign vulnerability), so can be OS-loaded on
out-of-date firmware.
Include an off-by-default check for the sorted-ness of patch_digests[]. It's
not worth running generally under SELF_TESTS, but is useful when editing the
digest list.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu/microcode/amd-patch-digests.c b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
index d32761226712..d2c4e0178a1e 100644
--- a/xen/arch/x86/cpu/microcode/amd-patch-digests.c
+++ b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
@@ -80,6 +80,15 @@
0x0d, 0x5b, 0x65, 0x34, 0x69, 0xb2, 0x62, 0x21,
},
},
+{
+ .patch_id = 0x0a0011d7,
+ .digest = {
+ 0x35, 0x07, 0xcd, 0x40, 0x94, 0xbc, 0x81, 0x6b,
+ 0xfc, 0x61, 0x56, 0x1a, 0xe2, 0xdb, 0x96, 0x12,
+ 0x1c, 0x1c, 0x31, 0xb1, 0x02, 0x6f, 0xe5, 0xd2,
+ 0xfe, 0x1b, 0x04, 0x03, 0x2c, 0x8f, 0x4c, 0x36,
+ },
+},
{
.patch_id = 0x0a001238,
.digest = {
@@ -89,6 +98,15 @@
0xc0, 0xcd, 0x33, 0xf2, 0x8d, 0xf9, 0xef, 0x59,
},
},
+{
+ .patch_id = 0x0a00123b,
+ .digest = {
+ 0xef, 0xa1, 0x1e, 0x71, 0xf1, 0xc3, 0x2c, 0xe2,
+ 0xc3, 0xef, 0x69, 0x41, 0x7a, 0x54, 0xca, 0xc3,
+ 0x8f, 0x62, 0x84, 0xee, 0xc2, 0x39, 0xd9, 0x28,
+ 0x95, 0xa7, 0x12, 0x49, 0x1e, 0x30, 0x71, 0x72,
+ },
+},
{
.patch_id = 0x0a00820c,
.digest = {
@@ -98,6 +116,15 @@
0xe1, 0x3b, 0x8d, 0xb2, 0xf8, 0x22, 0x03, 0xe2,
},
},
+{
+ .patch_id = 0x0a00820d,
+ .digest = {
+ 0xf9, 0x2a, 0xc0, 0xf4, 0x9e, 0xa4, 0x87, 0xa4,
+ 0x7d, 0x87, 0x00, 0xfd, 0xab, 0xda, 0x19, 0xca,
+ 0x26, 0x51, 0x32, 0xc1, 0x57, 0x91, 0xdf, 0xc1,
+ 0x05, 0xeb, 0x01, 0x7c, 0x5a, 0x95, 0x21, 0xb7,
+ },
+},
{
.patch_id = 0x0a101148,
.digest = {
@@ -107,6 +134,15 @@
0xf1, 0x5e, 0xb0, 0xde, 0xb4, 0x98, 0xae, 0xc4,
},
},
+{
+ .patch_id = 0x0a10114c,
+ .digest = {
+ 0x9e, 0xb6, 0xa2, 0xd9, 0x87, 0x38, 0xc5, 0x64,
+ 0xd8, 0x88, 0xfa, 0x78, 0x98, 0xf9, 0x6f, 0x74,
+ 0x39, 0x90, 0x1b, 0xa5, 0xcf, 0x5e, 0xb4, 0x2a,
+ 0x02, 0xff, 0xd4, 0x8c, 0x71, 0x8b, 0xe2, 0xc0,
+ },
+},
{
.patch_id = 0x0a101248,
.digest = {
@@ -116,6 +152,15 @@
0x1b, 0x7d, 0x64, 0x9d, 0x4b, 0x53, 0x13, 0x75,
},
},
+{
+ .patch_id = 0x0a10124c,
+ .digest = {
+ 0x29, 0xea, 0xf1, 0x2c, 0xb2, 0xe4, 0xef, 0x90,
+ 0xa4, 0xcd, 0x1d, 0x86, 0x97, 0x17, 0x61, 0x46,
+ 0xfc, 0x22, 0xcb, 0x57, 0x75, 0x19, 0xc8, 0xcc,
+ 0x0c, 0xf5, 0xbc, 0xac, 0x81, 0x9d, 0x9a, 0xd2,
+ },
+},
{
.patch_id = 0x0a108108,
.digest = {
@@ -125,6 +170,15 @@
0x28, 0x1e, 0x9c, 0x59, 0x69, 0x99, 0x4d, 0x16,
},
},
+{
+ .patch_id = 0x0a108109,
+ .digest = {
+ 0x85, 0xb4, 0xbd, 0x7c, 0x49, 0xa7, 0xbd, 0xfa,
+ 0x49, 0x36, 0x80, 0x81, 0xc5, 0xb7, 0x39, 0x1b,
+ 0x9a, 0xaa, 0x50, 0xde, 0x9b, 0xe9, 0x32, 0x35,
+ 0x42, 0x7e, 0x51, 0x4f, 0x52, 0x2c, 0x28, 0x59,
+ },
+},
{
.patch_id = 0x0a20102d,
.digest = {
@@ -134,6 +188,15 @@
0x8c, 0xe9, 0x19, 0x3e, 0xcc, 0x3f, 0x7b, 0xb4,
},
},
+{
+ .patch_id = 0x0a20102e,
+ .digest = {
+ 0xbe, 0x1f, 0x32, 0x04, 0x0d, 0x3c, 0x9c, 0xdd,
+ 0xe1, 0xa4, 0xbf, 0x76, 0x3a, 0xec, 0xc2, 0xf6,
+ 0x11, 0x00, 0xa7, 0xaf, 0x0f, 0xe5, 0x02, 0xc5,
+ 0x54, 0x3a, 0x1f, 0x8c, 0x16, 0xb5, 0xff, 0xbe,
+ },
+},
{
.patch_id = 0x0a201210,
.digest = {
@@ -143,6 +206,15 @@
0xf7, 0x55, 0xf0, 0x13, 0xbb, 0x22, 0xf6, 0x41,
},
},
+{
+ .patch_id = 0x0a201211,
+ .digest = {
+ 0x69, 0xa1, 0x17, 0xec, 0xd0, 0xf6, 0x6c, 0x95,
+ 0xe2, 0x1e, 0xc5, 0x59, 0x1a, 0x52, 0x0a, 0x27,
+ 0xc4, 0xed, 0xd5, 0x59, 0x1f, 0xbf, 0x00, 0xff,
+ 0x08, 0x88, 0xb5, 0xe1, 0x12, 0xb6, 0xcc, 0x27,
+ },
+},
{
.patch_id = 0x0a404107,
.digest = {
@@ -152,6 +224,15 @@
0x13, 0xbc, 0xc5, 0x25, 0xe4, 0xc5, 0xc3, 0x99,
},
},
+{
+ .patch_id = 0x0a404108,
+ .digest = {
+ 0x69, 0x67, 0x43, 0x06, 0xf8, 0x0c, 0x62, 0xdc,
+ 0xa4, 0x21, 0x30, 0x4f, 0x0f, 0x21, 0x2c, 0xcb,
+ 0xcc, 0x37, 0xf1, 0x1c, 0xc3, 0xf8, 0x2f, 0x19,
+ 0xdf, 0x53, 0x53, 0x46, 0xb1, 0x15, 0xea, 0x00,
+ },
+},
{
.patch_id = 0x0a500011,
.digest = {
@@ -161,6 +242,15 @@
0x11, 0x5e, 0x96, 0x7e, 0x71, 0xe9, 0xfc, 0x74,
},
},
+{
+ .patch_id = 0x0a500012,
+ .digest = {
+ 0xeb, 0x74, 0x0d, 0x47, 0xa1, 0x8e, 0x09, 0xe4,
+ 0x93, 0x4c, 0xad, 0x03, 0x32, 0x4c, 0x38, 0x16,
+ 0x10, 0x39, 0xdd, 0x06, 0xaa, 0xce, 0xd6, 0x0f,
+ 0x62, 0x83, 0x9d, 0x8e, 0x64, 0x55, 0xbe, 0x63,
+ },
+},
{
.patch_id = 0x0a601209,
.digest = {
@@ -170,6 +260,15 @@
0xe8, 0x73, 0xe2, 0xd6, 0xdb, 0xd2, 0x77, 0x1d,
},
},
+{
+ .patch_id = 0x0a60120a,
+ .digest = {
+ 0x0c, 0x8b, 0x3d, 0xfd, 0x52, 0x52, 0x85, 0x7d,
+ 0x20, 0x3a, 0xe1, 0x7e, 0xa4, 0x21, 0x3b, 0x7b,
+ 0x17, 0x86, 0xae, 0xac, 0x13, 0xb8, 0x63, 0x9d,
+ 0x06, 0x01, 0xd0, 0xa0, 0x51, 0x9a, 0x91, 0x2c,
+ },
+},
{
.patch_id = 0x0a704107,
.digest = {
@@ -179,6 +278,15 @@
0x64, 0x39, 0x71, 0x8c, 0xce, 0xe7, 0x41, 0x39,
},
},
+{
+ .patch_id = 0x0a704108,
+ .digest = {
+ 0xd7, 0x55, 0x15, 0x2b, 0xfe, 0xc4, 0xbc, 0x93,
+ 0xec, 0x91, 0xa0, 0xae, 0x45, 0xb7, 0xc3, 0x98,
+ 0x4e, 0xff, 0x61, 0x77, 0x88, 0xc2, 0x70, 0x49,
+ 0xe0, 0x3a, 0x1d, 0x84, 0x38, 0x52, 0xbf, 0x5a,
+ },
+},
{
.patch_id = 0x0a705206,
.digest = {
@@ -188,6 +296,15 @@
0x03, 0x35, 0xe9, 0xbe, 0xfb, 0x06, 0xdf, 0xfc,
},
},
+{
+ .patch_id = 0x0a705208,
+ .digest = {
+ 0x30, 0x1d, 0x55, 0x24, 0xbc, 0x6b, 0x5a, 0x19,
+ 0x0c, 0x7d, 0x1d, 0x74, 0xaa, 0xd1, 0xeb, 0xd2,
+ 0x16, 0x62, 0xf7, 0x5b, 0xe1, 0x1f, 0x18, 0x11,
+ 0x5c, 0xf0, 0x94, 0x90, 0x26, 0xec, 0x69, 0xff,
+ },
+},
{
.patch_id = 0x0a708007,
.digest = {
@@ -197,6 +314,15 @@
0xdf, 0x92, 0x73, 0x84, 0x87, 0x3c, 0x73, 0x93,
},
},
+{
+ .patch_id = 0x0a708008,
+ .digest = {
+ 0x08, 0x6e, 0xf0, 0x22, 0x4b, 0x8e, 0xc4, 0x46,
+ 0x58, 0x34, 0xe6, 0x47, 0xa2, 0x28, 0xfd, 0xab,
+ 0x22, 0x3d, 0xdd, 0xd8, 0x52, 0x9e, 0x1d, 0x16,
+ 0xfa, 0x01, 0x68, 0x14, 0x79, 0x3e, 0xe8, 0x6b,
+ },
+},
{
.patch_id = 0x0a70c005,
.digest = {
@@ -206,6 +332,15 @@
0xee, 0x49, 0xac, 0xe1, 0x8b, 0x13, 0xc5, 0x13,
},
},
+{
+ .patch_id = 0x0a70c008,
+ .digest = {
+ 0x0f, 0xdb, 0x37, 0xa1, 0x10, 0xaf, 0xd4, 0x21,
+ 0x94, 0x0d, 0xa4, 0xa2, 0xe9, 0x86, 0x6c, 0x0e,
+ 0x85, 0x7c, 0x36, 0x30, 0xa3, 0x3a, 0x78, 0x66,
+ 0x18, 0x10, 0x60, 0x0d, 0x78, 0x3d, 0x44, 0xd0,
+ },
+},
{
.patch_id = 0x0aa00116,
.digest = {
@@ -224,3 +359,12 @@
0x68, 0x2f, 0x46, 0xee, 0xfe, 0xc6, 0x6d, 0xef,
},
},
+{
+ .patch_id = 0x0aa00216,
+ .digest = {
+ 0x79, 0xfb, 0x5b, 0x9f, 0xb6, 0xe6, 0xa8, 0xf5,
+ 0x4e, 0x7c, 0x4f, 0x8e, 0x1d, 0xad, 0xd0, 0x08,
+ 0xc2, 0x43, 0x7c, 0x8b, 0xe6, 0xdb, 0xd0, 0xd2,
+ 0xe8, 0x39, 0x26, 0xc1, 0xe5, 0x5a, 0x48, 0xf1,
+ },
+},
diff --git a/xen/arch/x86/cpu/microcode/amd.c b/xen/arch/x86/cpu/microcode/amd.c
index ebd9ecbeef0f..8b09231c6c66 100644
--- a/xen/arch/x86/cpu/microcode/amd.c
+++ b/xen/arch/x86/cpu/microcode/amd.c
@@ -528,3 +528,18 @@ void __init ucode_probe_amd(struct microcode_ops *ops)
*ops = amd_ucode_ops;
}
+
+#if 0 /* Manual CONFIG_SELF_TESTS */
+static void __init __constructor test_digests_sorted(void)
+{
+ for ( unsigned int i = 1; i < ARRAY_SIZE(patch_digests); ++i )
+ {
+ if ( patch_digests[i - 1].patch_id < patch_digests[i].patch_id )
+ continue;
+
+ panic("patch_digests[] not sorted: %08x >= %08x\n",
+ patch_digests[i - 1].patch_id,
+ patch_digests[i].patch_id);
+ }
+}
+#endif /* CONFIG_SELF_TESTS */
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Thu, 17 Apr 2025 12:35:28 +0200
Subject: x86/intel: workaround several MONITOR/MWAIT errata
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There are several errata on Intel regarding the usage of the MONITOR/MWAIT
instructions, all having in common that stores to the monitored region
might not wake up the CPU.
Fix them by forcing the sending of an IPI for the affected models.
The Ice Lake issue has been reproduced internally on XenServer hardware,
and the fix does seem to prevent it. The symptom was APs getting stuck in
the idle loop immediately after bring up, which in turn prevented the BSP
from making progress. This would happen before the watchdog was
initialized, and hence the whole system would get stuck.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4aae4452efeee3d3bba092b875e37d1e7c8f6db9)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 557bc6ef8642..2d8b66c9100a 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -453,8 +453,14 @@ void cpuidle_wakeup_mwait(cpumask_t *mask)
cpumask_andnot(mask, mask, &target);
}
+/* Force sending of a wakeup IPI regardless of mwait usage. */
+bool __ro_after_init force_mwait_ipi_wakeup;
+
bool arch_skip_send_event_check(unsigned int cpu)
{
+ if ( force_mwait_ipi_wakeup )
+ return false;
+
/*
* This relies on softirq_pending() and mwait_wakeup() to access data
* on the same cache line.
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 490f7ff6f1fe..c6ca42d07ad8 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -7,6 +7,7 @@
#include <asm/intel-family.h>
#include <asm/processor.h>
#include <asm/msr.h>
+#include <asm/mwait.h>
#include <asm/uaccess.h>
#include <asm/mpspec.h>
#include <asm/apic.h>
@@ -363,7 +364,6 @@ static void probe_c3_errata(const struct cpuinfo_x86 *c)
INTEL_FAM6_MODEL(0x25),
{ }
};
-#undef INTEL_FAM6_MODEL
/* Serialized by the AP bringup code. */
if ( max_cstate > 1 && (c->apicid & (c->x86_num_siblings - 1)) &&
@@ -375,6 +375,38 @@ static void probe_c3_errata(const struct cpuinfo_x86 *c)
}
}
+/*
+ * APL30: One use of the MONITOR/MWAIT instruction pair is to allow a logical
+ * processor to wait in a sleep state until a store to the armed address range
+ * occurs. Due to this erratum, stores to the armed address range may not
+ * trigger MWAIT to resume execution.
+ *
+ * ICX143: Under complex microarchitectural conditions, a monitor that is armed
+ * with the MWAIT instruction may not be triggered, leading to a processor
+ * hang.
+ *
+ * LNL030: Problem P-cores may not exit power state Core C6 on monitor hit.
+ *
+ * Force the sending of an IPI in those cases.
+ */
+static void __init probe_mwait_errata(void)
+{
+ static const struct x86_cpu_id __initconst models[] = {
+ INTEL_FAM6_MODEL(INTEL_FAM6_ATOM_GOLDMONT), /* APL30 */
+ INTEL_FAM6_MODEL(INTEL_FAM6_ICELAKE_X), /* ICX143 */
+ INTEL_FAM6_MODEL(INTEL_FAM6_LUNARLAKE_M), /* LNL030 */
+ { }
+ };
+#undef INTEL_FAM6_MODEL
+
+ if ( boot_cpu_has(X86_FEATURE_MONITOR) && x86_match_cpu(models) )
+ {
+ printk(XENLOG_WARNING
+ "Forcing IPI MWAIT wakeup due to CPU erratum\n");
+ force_mwait_ipi_wakeup = true;
+ }
+}
+
/*
* P4 Xeon errata 037 workaround.
* Hardware prefetcher may cause stale data to be loaded into the cache.
@@ -401,6 +433,8 @@ static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
__set_bit(X86_FEATURE_CLFLUSH_MONITOR, c->x86_capability);
probe_c3_errata(c);
+ if (system_state < SYS_STATE_smp_boot)
+ probe_mwait_errata();
}
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index f377d9fdcad4..97bf361505f0 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,6 +13,9 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
+/* Force sending of a wakeup IPI regardless of mwait usage. */
+extern bool force_mwait_ipi_wakeup;
+
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
bool mwait_pc10_supported(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 20:59:37 +0100
Subject: x86/cpufeature: Reposition cpu_has_{lfence_dispatch,nscb}
LFENCE_DISPATCH used to be a synthetic feature, but was given a real CPUID bit
by AMD. The define wasn't moved when this was changed.
NSCB has always been a real CPUID bit, and was misplaced when introduced in
the synthetic block alongside LFENCE_DISPATCH.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a039b050071eba644ab414d76ac5d5fc9e067a5)
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index d9aedfc25ab0..020414e98c4d 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -148,6 +148,10 @@
#define cpu_has_avx_vnni boot_cpu_has(X86_FEATURE_AVX_VNNI)
#define cpu_has_avx512_bf16 boot_cpu_has(X86_FEATURE_AVX512_BF16)
+/* CPUID level 0x80000021.eax */
+#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
+
/* MSR_ARCH_CAPS */
#define cpu_has_rdcl_no boot_cpu_has(X86_FEATURE_RDCL_NO)
#define cpu_has_eibrs boot_cpu_has(X86_FEATURE_EIBRS)
@@ -170,8 +174,6 @@
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
#define cpu_has_aperfmperf boot_cpu_has(X86_FEATURE_APERFMPERF)
-#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
-#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
#define cpu_has_xen_lbr boot_cpu_has(X86_FEATURE_XEN_LBR)
#define cpu_has_xen_shstk boot_cpu_has(X86_FEATURE_XEN_SHSTK)
#define cpu_has_xen_ibt boot_cpu_has(X86_FEATURE_XEN_IBT)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 14:59:01 +0100
Subject: x86/idle: Move monitor()/mwait() wrappers into cpu-idle.c
They're not used by any other translation unit, so shouldn't live in
asm/processor.h, which is included almost everywhere.
Our new toolchain baseline knows the MONITOR/MWAIT instructions, so use them
directly rather than using raw hex.
Change the hint/extention parameters from long to int. They're specified to
remain 32bit operands even 64-bit mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 61e10fc28ccddff7c72c14acec56dc7ef2b155d1)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 2d8b66c9100a..773eaecc2bbf 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -59,6 +59,19 @@
/*#define DEBUG_PM_CX*/
+static always_inline void monitor(
+ const void *addr, unsigned int ecx, unsigned int edx)
+{
+ asm volatile ( "monitor"
+ :: "a" (addr), "c" (ecx), "d" (edx) );
+}
+
+static always_inline void mwait(unsigned int eax, unsigned int ecx)
+{
+ asm volatile ( "mwait"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -482,7 +495,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mb();
}
- __monitor(monitor_addr, 0, 0);
+ monitor(monitor_addr, 0, 0);
smp_mb();
/*
@@ -496,7 +509,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
spec_ctrl_enter_idle(info);
- __mwait(eax, ecx);
+ mwait(eax, ecx);
spec_ctrl_exit_idle(info);
cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
@@ -927,9 +940,9 @@ void cf_check acpi_dead_idle(void)
*/
mb();
clflush(mwait_ptr);
- __monitor(mwait_ptr, 0, 0);
+ monitor(mwait_ptr, 0, 0);
mb();
- __mwait(cx->address, 0);
+ mwait(cx->address, 0);
}
}
else if ( (current_cpu_data.x86_vendor &
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 8ff96388e8b3..07328d44bf4e 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -405,23 +405,6 @@ static inline bool_t read_pkru_wd(uint32_t pkru, unsigned int pkey)
return (pkru >> (pkey * PKRU_ATTRS + PKRU_WRITE)) & 1;
}
-static always_inline void __monitor(const void *eax, unsigned long ecx,
- unsigned long edx)
-{
- /* "monitor %eax,%ecx,%edx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc8;"
- : : "a" (eax), "c" (ecx), "d"(edx) );
-}
-
-static always_inline void __mwait(unsigned long eax, unsigned long ecx)
-{
- /* "mwait %eax,%ecx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc9;"
- : : "a" (eax), "c" (ecx) );
-}
-
#define IOBMP_BYTES 8192
#define IOBMP_INVALID_OFFSET 0x8000
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 15:55:29 +0100
Subject: x86/idle: Remove MFENCEs for CLFLUSH_MONITOR
Commit 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround") was
inherited from Linux and added MFENCEs around the AAI65 errata fix.
The SDM now states:
Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
and fence instructions[1].
with footnote 1 reading:
Earlier versions of this manual specified that executions of the CLFLUSH
instruction were ordered only by the MFENCE instruction. All processors
implementing the CLFLUSH instruction also order it relative to the other
operations enumerated above.
I.e. the MFENCEs came about because of an incorrect statement in the SDM.
The Spec Update (no longer available on Intel's website) simply says "issue a
CLFLUSH", with no mention of MFENCEs.
As this erratum is specific to Intel, it's fine to remove the the MFENCEs; AMD
CPUs of a similar vintage do sport otherwise-unordered CLFLUSHs.
Move the feature bit into the BUG range (rather than FEATURE), and move the
workaround into monitor() itself.
The erratum check itself must use setup_force_cpu_cap(). It needs activating
if any CPU needs it, not if all of them need it.
Fixes: 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround")
Fixes: 96d1b237ae9b ("x86/Intel: work around Xeon 7400 series erratum AAI65")
Link: https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f77ef3443542a2c2bbd59ee66178287d4fa5b43f)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 773eaecc2bbf..110e467d6375 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -62,6 +62,9 @@
static always_inline void monitor(
const void *addr, unsigned int ecx, unsigned int edx)
{
+ alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
+ [addr] "a" (addr));
+
asm volatile ( "monitor"
:: "a" (addr), "c" (ecx), "d" (edx) );
}
@@ -488,13 +491,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
s_time_t expires = per_cpu(timer_deadline, cpu);
const void *monitor_addr = &mwait_wakeup(cpu);
- if ( boot_cpu_has(X86_FEATURE_CLFLUSH_MONITOR) )
- {
- mb();
- clflush(monitor_addr);
- mb();
- }
-
monitor(monitor_addr, 0, 0);
smp_mb();
@@ -929,19 +925,7 @@ void cf_check acpi_dead_idle(void)
while ( 1 )
{
- /*
- * 1. The CLFLUSH is a workaround for erratum AAI65 for
- * the Xeon 7400 series.
- * 2. The WBINVD is insufficient due to the spurious-wakeup
- * case where we return around the loop.
- * 3. Unlike wbinvd, clflush is a light weight but not serializing
- * instruction, hence memory fence is necessary to make sure all
- * load/store visible before flush cache line.
- */
- mb();
- clflush(mwait_ptr);
monitor(mwait_ptr, 0, 0);
- mb();
mwait(cx->address, 0);
}
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index c6ca42d07ad8..3c96bafe395f 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -413,6 +413,7 @@ static void __init probe_mwait_errata(void)
*
* Xeon 7400 erratum AAI65 (and further newer Xeons)
* MONITOR/MWAIT may have excessive false wakeups
+ * https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
*/
static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
{
@@ -430,7 +431,7 @@ static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
if (c->x86 == 6 && cpu_has_clflush &&
(c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
- __set_bit(X86_FEATURE_CLFLUSH_MONITOR, c->x86_capability);
+ setup_force_cpu_cap(X86_BUG_CLFLUSH_MONITOR);
probe_c3_errata(c);
if (system_state < SYS_STATE_smp_boot)
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..84c93292c80c 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -19,7 +19,7 @@ XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon
XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */
XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */
XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */
-XEN_CPUFEATURE(CLFLUSH_MONITOR, X86_SYNTH( 7)) /* clflush reqd with monitor */
+/* Bit 7 unused */
XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */
@@ -52,6 +52,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
+#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 24 Jun 2025 15:20:52 +0100
Subject: Revert part of "x86/mwait-idle: disable IBRS during long idle"
Most of the patch (handling of CPUIDLE_FLAG_IBRS) is fine, but the
adjustements to mwait_idle() are not; spec_ctrl_enter_idle() does more than
just alter MSR_SPEC_CTRL.IBRS.
The only reason this doesn't need an XSA is because the unconditional
spec_ctrl_{enter,exit}_idle() in mwait_idle_with_hints() were left unaltered,
and thus the MWAIT remained properly protected.
There (would have been) two problems. In the ibrs_disable (== deep C) case:
* On entry, VERW and RSB-stuffing are architecturally skipped.
* On exit, there's a branch crossing the WRMSR which reinstates the
speculative safety for indirect branches.
All this change did was double up the expensive operations in the deep C case,
and fail to optimise the intended case.
I have an idea of how to plumb this more nicely, but it requires larger
changes to legacy IBRS handling to not make spec_ctrl_enter_idle() vulnerable
in other ways. In the short term, simply take out the perf hit.
Fixes: 08acdf9a2615 ("x86/mwait-idle: disable IBRS during long idle")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 07d7163334a7507d329958b19d976be769580999)
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index ffdc6fb2fc0e..12c36257b7be 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -902,7 +902,6 @@ static const struct cpuidle_state snr_cstates[] = {
static void cf_check mwait_idle(void)
{
unsigned int cpu = smp_processor_id();
- struct cpu_info *info = get_cpu_info();
struct acpi_processor_power *power = processor_powers[cpu];
struct acpi_processor_cx *cx = NULL;
unsigned int next_state;
@@ -929,6 +928,8 @@ static void cf_check mwait_idle(void)
pm_idle_save();
else
{
+ struct cpu_info *info = get_cpu_info();
+
spec_ctrl_enter_idle(info);
safe_halt();
spec_ctrl_exit_idle(info);
@@ -955,11 +956,6 @@ static void cf_check mwait_idle(void)
if ((cx->type >= 3) && errata_c6_workaround())
cx = power->safe_state;
- if (cx->ibrs_disable) {
- ASSERT(!cx->irq_enable_early);
- spec_ctrl_enter_idle(info);
- }
-
#if 0 /* XXX Can we/do we need to do something similar on Xen? */
/*
* leave_mm() to avoid costly and often unnecessary wakeups
@@ -991,10 +987,6 @@ static void cf_check mwait_idle(void)
/* Now back in C0. */
update_idle_stats(power, cx, before, after);
-
- if (cx->ibrs_disable)
- spec_ctrl_exit_idle(info);
-
local_irq_enable();
TRACE_6D(TRC_PM_IDLE_EXIT, cx->type, after,
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 14:46:01 +0100
Subject: x86/cpu-policy: Simplify logic in
guest_common_default_feature_adjustments()
For features which are unconditionally set in the max policies, making the
default policy to match the host can be done with a conditional clear.
This is simpler than the unconditional clear, conditional set currently
performed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 30f8fed68f3c2e63594ff9202b3d05b971781e36)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index c813df35cbb0..42bd039e7c2e 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -511,17 +511,14 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* reasons, so reset the default policy back to the host values in
* case we're unaffected.
*/
- __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- if ( cpu_has_md_clear )
- __set_bit(X86_FEATURE_MD_CLEAR, fs);
+ if ( !cpu_has_md_clear )
+ __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- if ( cpu_has_fb_clear )
- __set_bit(X86_FEATURE_FB_CLEAR, fs);
+ if ( !cpu_has_fb_clear )
+ __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
- if ( cpu_has_rfds_clear )
- __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+ if ( !cpu_has_rfds_clear )
+ __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
/*
* The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -541,13 +538,11 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* Topology information is at the toolstack's discretion so these are
* unconditionally set in max, but pick a default which matches the host.
*/
- __clear_bit(X86_FEATURE_HTT, fs);
- if ( cpu_has_htt )
- __set_bit(X86_FEATURE_HTT, fs);
+ if ( !cpu_has_htt )
+ __clear_bit(X86_FEATURE_HTT, fs);
- __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
- if ( cpu_has_cmp_legacy )
- __set_bit(X86_FEATURE_CMP_LEGACY, fs);
+ if ( !cpu_has_cmp_legacy )
+ __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
/*
* On certain hardware, speculative or errata workarounds can result in
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 11:33:41 +0100
Subject: x86/cpu-policy: Fix handling of leaf 0x80000021
When support was originally introduced, ebx, ecx and edx were reserved and
should have been zeroed in recalculate_misc() to avoid leaking into guests.
Since then, fields have been added into ebx. Guests can't load microcode, so
shouldn't see ucode_size, and while in principle we do want to support larger
RAP sizes in guests, virtualising this for guests depends on AMD procuding any
official documentation for ERAPS, which is long overdue and with no ETA.
This patch will cause a difference in guests on Zen5 CPUs, but as the main
ERAPS feature is hidden, guests should be ignoring the rap_size field too.
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 10dc35c516f7b9224590a7a4e2722bbfd70fa87a)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 42bd039e7c2e..8f006fe08acb 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -340,6 +340,9 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1e] = EMPTY_LEAF; /* TopoExt APIC ID/Core/Node */
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
+ p->extd.raw[0x21].b = 0;
+ p->extd.raw[0x21].c = 0;
+ p->extd.raw[0x21].d = 0;
break;
}
}
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index 6d5e9edd269b..ba29bfe9b414 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -324,7 +324,10 @@ struct cpu_policy
uint32_t e21a;
struct { DECL_BITFIELD(e21a); };
};
- uint32_t /* b */:32, /* c */:32, /* d */:32;
+ uint16_t ucode_size; /* Units of 16 bytes */
+ uint8_t rap_size; /* Units of 8 entries */
+ uint8_t :8;
+ uint32_t /* c */:32, /* d */:32;
};
} extd;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 15:51:53 +0100
Subject: x86/idle: Remove broken MWAIT implementation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI certain targets. Also,
there's no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ
on it's way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line, and not in a separate cpuidle_mwait_flags variable, in order
to do this in a race-free way.
Arranging to fix this while keeping the old implementation is prohibitive, so
strip the current one out in order to implement the new one cleanly. In the
interim, this causes IPIs to be used unconditionally which is safe albeit
suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3faf0866a33070b926ab78e6298290403f85e76c)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 110e467d6375..7e3b3d7543bf 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -448,27 +448,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -477,42 +456,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -913,7 +875,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 50d788cb6e72..30ff3de029a4 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -187,8 +187,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 276e3419d778..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool_t __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 521a8deb04c2..ddd37fe27a2e 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 18:13:27 +0100
Subject: x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
isn't correct either, as this only pertains to local ordering. All we need is
a compiler barrier().
Merge the barier() into the monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e7710dd843ba9d204f6ee2973d6120c1984958a6)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 7e3b3d7543bf..33d1fbb02855 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -65,8 +65,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be reordered over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -465,7 +469,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
const unsigned int *this_softirq_pending = &softirq_pending(cpu);
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:40:51 +0100
Subject: x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b0ca0f93f47c43f8984981137af07ca3d161e3ec)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 33d1fbb02855..b085f36df219 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,14 +452,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 3c96bafe395f..af4a52ec1ba5 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -403,7 +403,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index 97bf361505f0..f377d9fdcad4 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
bool mwait_pc10_supported(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:04:17 +0100
Subject: xen/softirq: Rework arch_skip_send_event_check() into
arch_set_softirq()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up the
target) and, prior to the removal of the broken IPI-elision algorithm, was
racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to also
sample the status of the target in one atomic action. Implement a new arch
helper with different semantics; to make the softirq pending and decide about
IPIs together. For now, implement the default helper. It will be overridden
by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b473e5e212e445d3c193c1c83b52b129af571b19)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index b085f36df219..7c7676e9ce91 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,11 +452,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 063e93cbe33b..89685b381d85 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_set_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_set_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 1f6c4783da87..5b4c03bfe37c 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -21,6 +21,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_set_softirq
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:26:24 +0100
Subject: x86/idle: Implement a new MWAIT IPI-elision algorithm
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3e0bc4b50350bd357304fd79a5dc0472790dba91)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 7c7676e9ce91..b876c7781eef 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -455,7 +455,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising ourselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
@@ -467,6 +481,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..55b65c9747b1 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,50 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t prev, old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ for ( ;; )
+ {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ prev = cmpxchg(ptr, old, new);
+ if ( prev == old )
+ break;
+
+ old = prev;
+ }
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ *
+ * We can't access the in_mwait field nicely, so use some BUILD_BUG_ON()'s
+ * to cross-check the (1UL << 32) opencoding.
+ */
+ BUILD_BUG_ON(sizeof(irq_stat[0].softirq_mwait_raw) != 8);
+ BUILD_BUG_ON((offsetof(irq_cpustat_t, in_mwait) -
+ offsetof(irq_cpustat_t, softirq_mwait_raw)) != 4);
+
+ return new & (1UL << 32) /* in_mwait */;
+
+}
+#define arch_set_softirq arch_set_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Jul 2025 14:51:38 +0100
Subject: x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on
Xeons"
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b0f0f6e235618c2764e925b58c4bfe412730ced)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index b876c7781eef..0b7e7636bc0c 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -79,6 +79,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -473,12 +480,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 12c36257b7be..ad421d8bb76a 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -973,12 +973,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 4 Jul 2025 17:53:15 +0100
Subject: x86/xen-cpuid: Fix backports of new features
Xen 4.18 doesn't automatically generate feature names like Xen 4.19 does, and
these hunks were missed on prior security fixes.
Fixes: 8bced9a15c8c ("x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX")
Fixes: f132c82fa65d ("x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware")
Fixes: dba055661292 ("x86/spec-ctrl: Support Intel's new PB-OPT")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
[For 4.17, PB-OPT wasn't backported]
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 5ceea8be073b..b277b78b654f 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -199,6 +199,7 @@ static const char *const str_e21a[32] =
/* 26 */ [27] = "sbpb",
[28] = "ibpb-brtype", [29] = "srso-no",
+ [30] = "srso-us-no", [31] = "srso-msr-fix",
};
static const char *const str_7b1[32] =
@@ -222,7 +223,7 @@ static const char *const str_7d2[32] =
[ 4] = "bhi-ctrl", [ 5] = "mcdt-no",
};
-static const char *const str_m10Al[32] =
+static const char *const str_m10Al[64] =
{
[ 0] = "rdcl-no", [ 1] = "eibrs",
[ 2] = "rsba", [ 3] = "skip-l1dfl",
@@ -239,10 +240,8 @@ static const char *const str_m10Al[32] =
[24] = "pbrsb-no", [25] = "gds-ctrl",
[26] = "gds-no", [27] = "rfds-no",
[28] = "rfds-clear",
-};
-static const char *const str_m10Ah[32] =
-{
+ [62] = "its-no",
};
static const struct {
@@ -268,7 +267,7 @@ static const struct {
{ "CPUID 0x00000007:1.ecx", "7c1", str_7c1 },
{ "CPUID 0x00000007:1.edx", "7d1", str_7d1 },
{ "MSR_ARCH_CAPS.lo", "m10Al", str_m10Al },
- { "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Ah },
+ { "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Al + 32 },
};
#define COL_ALIGN "24"
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 17:19:19 +0100
Subject: x86/cpu-policy: Rearrange guest_common_*_feature_adjustments()
Turn the if()s into switch()es, as we're going to need AMD sections.
Move the RTM adjustments into the Intel section, where they ought to live.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 8f006fe08acb..498fa4f9957c 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -414,8 +414,9 @@ static void __init calculate_host_policy(void)
static void __init guest_common_max_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* MSR_ARCH_CAPS is just feature data, and we can offer it to guests
* unconditionally, although limit it to Intel systems as it is highly
@@ -460,6 +461,22 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__set_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * To mitigate Native-BHI, one option is to use a TSX Abort on capable
+ * systems. This is safe even if RTM has been disabled for other
+ * reasons via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel
+ * doesn't get to know this type of information.
+ *
+ * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to
+ * instead mean "XBEGIN won't fault". This is enough for a guest
+ * kernel to make an informed choice WRT mitigating Native-BHI.
+ *
+ * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
+ */
+ if ( test_bit(X86_FEATURE_RTM, fs) )
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ break;
}
/*
@@ -471,27 +488,13 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
*/
__set_bit(X86_FEATURE_HTT, fs);
__set_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * To mitigate Native-BHI, one option is to use a TSX Abort on capable
- * systems. This is safe even if RTM has been disabled for other reasons
- * via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel doesn't get to
- * know this type of information.
- *
- * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to instead
- * mean "XBEGIN won't fault". This is enough for a guest kernel to make
- * an informed choice WRT mitigating Native-BHI.
- *
- * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
- */
- if ( test_bit(X86_FEATURE_RTM, fs) )
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
static void __init guest_common_default_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* IvyBridge client parts suffer from leakage of RDRAND data due to SRBDS
* (XSA-320 / CVE-2020-0543), and won't be receiving microcode to
@@ -535,6 +538,23 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__clear_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * On certain hardware, speculative or errata workarounds can result
+ * in TSX being placed in "force-abort" mode, where it doesn't
+ * actually function as expected, but is technically compatible with
+ * the ISA.
+ *
+ * Do not advertise RTM to guests by default if it won't actually
+ * work. Instead, advertise RTM_ALWAYS_ABORT indicating that TSX
+ * Aborts are safe to use, e.g. for mitigating Native-BHI.
+ */
+ if ( rtm_disabled )
+ {
+ __clear_bit(X86_FEATURE_RTM, fs);
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ }
+ break;
}
/*
@@ -546,21 +566,6 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
if ( !cpu_has_cmp_legacy )
__clear_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * On certain hardware, speculative or errata workarounds can result in
- * TSX being placed in "force-abort" mode, where it doesn't actually
- * function as expected, but is technically compatible with the ISA.
- *
- * Do not advertise RTM to guests by default if it won't actually work.
- * Instead, advertise RTM_ALWAYS_ABORT indicating that TSX Aborts are safe
- * to use, e.g. for mitigating Native-BHI.
- */
- if ( rtm_disabled )
- {
- __clear_bit(X86_FEATURE_RTM, fs);
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
- }
}
static void __init guest_common_feature_adjustments(uint32_t *fs)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 19:55:15 +0100
Subject: x86/cpu-policy: Infrastructure for CPUID leaf 0x80000021.ecx
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index 5c66d094b2dc..493f615c9d35 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -344,6 +344,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *policy, const char* str)
CPUID_ENTRY(0x00000007, 1, CPUID_REG_EDX),
MSR_ENTRY(0x10a, CPUID_REG_EAX),
MSR_ENTRY(0x10a, CPUID_REG_EDX),
+ CPUID_ENTRY(0x80000021, NA, CPUID_REG_ECX),
#undef MSR_ENTRY
#undef CPUID_ENTRY
};
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index b277b78b654f..7704980b8a9b 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -244,6 +244,10 @@ static const char *const str_m10Al[64] =
[62] = "its-no",
};
+static const char *const str_e21c[32] =
+{
+};
+
static const struct {
const char *name;
const char *abbr;
@@ -268,6 +272,7 @@ static const struct {
{ "CPUID 0x00000007:1.edx", "7d1", str_7d1 },
{ "MSR_ARCH_CAPS.lo", "m10Al", str_m10Al },
{ "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Al + 32 },
+ { "CPUID 0x80000021.ecx", "e21c", str_e21c },
};
#define COL_ALIGN "24"
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 498fa4f9957c..b4d3fa824363 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -341,7 +341,6 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
p->extd.raw[0x21].b = 0;
- p->extd.raw[0x21].c = 0;
p->extd.raw[0x21].d = 0;
break;
}
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 14a3a97806f0..5d648622fe6f 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -469,7 +469,9 @@ static void generic_identify(struct cpuinfo_x86 *c)
if (c->extended_cpuid_level >= 0x80000008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x80000008);
if (c->extended_cpuid_level >= 0x80000021)
- c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x80000021);
+ cpuid(0x80000021,
+ &c->x86_capability[FEATURESET_e21a], &tmp,
+ &c->x86_capability[FEATURESET_e21c], &tmp);
/* Intel-defined flags: level 0x00000007 */
if (c->cpuid_level >= 7) {
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 99c4dc1ffd40..03cd1419c5cb 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -337,6 +337,8 @@ XEN_CPUFEATURE(RFDS_CLEAR, 16*32+28) /*!A Register File(s) cleared by VE
/* Intel-defined CPU features, MSR_ARCH_CAPS 0x10a.edx, word 17 (express in terms of word 16) */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
+/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+
#endif /* XEN_CPUFEATURE */
/* Clean up from a default include. Close the enum (for C). */
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index ba29bfe9b414..7bb90edf830d 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -22,6 +22,7 @@
#define FEATURESET_7d1 15 /* 0x00000007:1.edx */
#define FEATURESET_m10Al 16 /* 0x0000010a.eax */
#define FEATURESET_m10Ah 17 /* 0x0000010a.edx */
+#define FEATURESET_e21c 18 /* 0x80000021.ecx */
struct cpuid_leaf
{
@@ -327,7 +328,11 @@ struct cpu_policy
uint16_t ucode_size; /* Units of 16 bytes */
uint8_t rap_size; /* Units of 8 entries */
uint8_t :8;
- uint32_t /* c */:32, /* d */:32;
+ union {
+ uint32_t e21c;
+ struct { DECL_BITFIELD(e21c); };
+ };
+ uint32_t /* d */:32;
};
} extd;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index 07e550191448..22fd162c9dca 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -81,6 +81,7 @@ void x86_cpu_policy_to_featureset(
fs[FEATURESET_7d1] = p->feat._7d1;
fs[FEATURESET_m10Al] = p->arch_caps.lo;
fs[FEATURESET_m10Ah] = p->arch_caps.hi;
+ fs[FEATURESET_e21c] = p->extd.e21c;
}
void x86_cpu_featureset_to_policy(
@@ -104,6 +105,7 @@ void x86_cpu_featureset_to_policy(
p->feat._7d1 = fs[FEATURESET_7d1];
p->arch_caps.lo = fs[FEATURESET_m10Al];
p->arch_caps.hi = fs[FEATURESET_m10Ah];
+ p->extd.e21c = fs[FEATURESET_e21c];
}
void x86_cpu_policy_recalc_synth(struct cpu_policy *p)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Sep 2024 11:28:39 +0100
Subject: x86/ucode: Digests for TSA microcode
AMD are releasing microcode for TSA, so extend the known-provenance list with
their hashes. These were produced before the remediation of the microcode
signature issues (the entrysign vulnerability), so can be OS-loaded on
out-of-date firmware.
Include an off-by-default check for the sorted-ness of patch_digests[]. It's
not worth running generally under SELF_TESTS, but is useful when editing the
digest list.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu/microcode/amd-patch-digests.c b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
index d32761226712..d2c4e0178a1e 100644
--- a/xen/arch/x86/cpu/microcode/amd-patch-digests.c
+++ b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
@@ -80,6 +80,15 @@
0x0d, 0x5b, 0x65, 0x34, 0x69, 0xb2, 0x62, 0x21,
},
},
+{
+ .patch_id = 0x0a0011d7,
+ .digest = {
+ 0x35, 0x07, 0xcd, 0x40, 0x94, 0xbc, 0x81, 0x6b,
+ 0xfc, 0x61, 0x56, 0x1a, 0xe2, 0xdb, 0x96, 0x12,
+ 0x1c, 0x1c, 0x31, 0xb1, 0x02, 0x6f, 0xe5, 0xd2,
+ 0xfe, 0x1b, 0x04, 0x03, 0x2c, 0x8f, 0x4c, 0x36,
+ },
+},
{
.patch_id = 0x0a001238,
.digest = {
@@ -89,6 +98,15 @@
0xc0, 0xcd, 0x33, 0xf2, 0x8d, 0xf9, 0xef, 0x59,
},
},
+{
+ .patch_id = 0x0a00123b,
+ .digest = {
+ 0xef, 0xa1, 0x1e, 0x71, 0xf1, 0xc3, 0x2c, 0xe2,
+ 0xc3, 0xef, 0x69, 0x41, 0x7a, 0x54, 0xca, 0xc3,
+ 0x8f, 0x62, 0x84, 0xee, 0xc2, 0x39, 0xd9, 0x28,
+ 0x95, 0xa7, 0x12, 0x49, 0x1e, 0x30, 0x71, 0x72,
+ },
+},
{
.patch_id = 0x0a00820c,
.digest = {
@@ -98,6 +116,15 @@
0xe1, 0x3b, 0x8d, 0xb2, 0xf8, 0x22, 0x03, 0xe2,
},
},
+{
+ .patch_id = 0x0a00820d,
+ .digest = {
+ 0xf9, 0x2a, 0xc0, 0xf4, 0x9e, 0xa4, 0x87, 0xa4,
+ 0x7d, 0x87, 0x00, 0xfd, 0xab, 0xda, 0x19, 0xca,
+ 0x26, 0x51, 0x32, 0xc1, 0x57, 0x91, 0xdf, 0xc1,
+ 0x05, 0xeb, 0x01, 0x7c, 0x5a, 0x95, 0x21, 0xb7,
+ },
+},
{
.patch_id = 0x0a101148,
.digest = {
@@ -107,6 +134,15 @@
0xf1, 0x5e, 0xb0, 0xde, 0xb4, 0x98, 0xae, 0xc4,
},
},
+{
+ .patch_id = 0x0a10114c,
+ .digest = {
+ 0x9e, 0xb6, 0xa2, 0xd9, 0x87, 0x38, 0xc5, 0x64,
+ 0xd8, 0x88, 0xfa, 0x78, 0x98, 0xf9, 0x6f, 0x74,
+ 0x39, 0x90, 0x1b, 0xa5, 0xcf, 0x5e, 0xb4, 0x2a,
+ 0x02, 0xff, 0xd4, 0x8c, 0x71, 0x8b, 0xe2, 0xc0,
+ },
+},
{
.patch_id = 0x0a101248,
.digest = {
@@ -116,6 +152,15 @@
0x1b, 0x7d, 0x64, 0x9d, 0x4b, 0x53, 0x13, 0x75,
},
},
+{
+ .patch_id = 0x0a10124c,
+ .digest = {
+ 0x29, 0xea, 0xf1, 0x2c, 0xb2, 0xe4, 0xef, 0x90,
+ 0xa4, 0xcd, 0x1d, 0x86, 0x97, 0x17, 0x61, 0x46,
+ 0xfc, 0x22, 0xcb, 0x57, 0x75, 0x19, 0xc8, 0xcc,
+ 0x0c, 0xf5, 0xbc, 0xac, 0x81, 0x9d, 0x9a, 0xd2,
+ },
+},
{
.patch_id = 0x0a108108,
.digest = {
@@ -125,6 +170,15 @@
0x28, 0x1e, 0x9c, 0x59, 0x69, 0x99, 0x4d, 0x16,
},
},
+{
+ .patch_id = 0x0a108109,
+ .digest = {
+ 0x85, 0xb4, 0xbd, 0x7c, 0x49, 0xa7, 0xbd, 0xfa,
+ 0x49, 0x36, 0x80, 0x81, 0xc5, 0xb7, 0x39, 0x1b,
+ 0x9a, 0xaa, 0x50, 0xde, 0x9b, 0xe9, 0x32, 0x35,
+ 0x42, 0x7e, 0x51, 0x4f, 0x52, 0x2c, 0x28, 0x59,
+ },
+},
{
.patch_id = 0x0a20102d,
.digest = {
@@ -134,6 +188,15 @@
0x8c, 0xe9, 0x19, 0x3e, 0xcc, 0x3f, 0x7b, 0xb4,
},
},
+{
+ .patch_id = 0x0a20102e,
+ .digest = {
+ 0xbe, 0x1f, 0x32, 0x04, 0x0d, 0x3c, 0x9c, 0xdd,
+ 0xe1, 0xa4, 0xbf, 0x76, 0x3a, 0xec, 0xc2, 0xf6,
+ 0x11, 0x00, 0xa7, 0xaf, 0x0f, 0xe5, 0x02, 0xc5,
+ 0x54, 0x3a, 0x1f, 0x8c, 0x16, 0xb5, 0xff, 0xbe,
+ },
+},
{
.patch_id = 0x0a201210,
.digest = {
@@ -143,6 +206,15 @@
0xf7, 0x55, 0xf0, 0x13, 0xbb, 0x22, 0xf6, 0x41,
},
},
+{
+ .patch_id = 0x0a201211,
+ .digest = {
+ 0x69, 0xa1, 0x17, 0xec, 0xd0, 0xf6, 0x6c, 0x95,
+ 0xe2, 0x1e, 0xc5, 0x59, 0x1a, 0x52, 0x0a, 0x27,
+ 0xc4, 0xed, 0xd5, 0x59, 0x1f, 0xbf, 0x00, 0xff,
+ 0x08, 0x88, 0xb5, 0xe1, 0x12, 0xb6, 0xcc, 0x27,
+ },
+},
{
.patch_id = 0x0a404107,
.digest = {
@@ -152,6 +224,15 @@
0x13, 0xbc, 0xc5, 0x25, 0xe4, 0xc5, 0xc3, 0x99,
},
},
+{
+ .patch_id = 0x0a404108,
+ .digest = {
+ 0x69, 0x67, 0x43, 0x06, 0xf8, 0x0c, 0x62, 0xdc,
+ 0xa4, 0x21, 0x30, 0x4f, 0x0f, 0x21, 0x2c, 0xcb,
+ 0xcc, 0x37, 0xf1, 0x1c, 0xc3, 0xf8, 0x2f, 0x19,
+ 0xdf, 0x53, 0x53, 0x46, 0xb1, 0x15, 0xea, 0x00,
+ },
+},
{
.patch_id = 0x0a500011,
.digest = {
@@ -161,6 +242,15 @@
0x11, 0x5e, 0x96, 0x7e, 0x71, 0xe9, 0xfc, 0x74,
},
},
+{
+ .patch_id = 0x0a500012,
+ .digest = {
+ 0xeb, 0x74, 0x0d, 0x47, 0xa1, 0x8e, 0x09, 0xe4,
+ 0x93, 0x4c, 0xad, 0x03, 0x32, 0x4c, 0x38, 0x16,
+ 0x10, 0x39, 0xdd, 0x06, 0xaa, 0xce, 0xd6, 0x0f,
+ 0x62, 0x83, 0x9d, 0x8e, 0x64, 0x55, 0xbe, 0x63,
+ },
+},
{
.patch_id = 0x0a601209,
.digest = {
@@ -170,6 +260,15 @@
0xe8, 0x73, 0xe2, 0xd6, 0xdb, 0xd2, 0x77, 0x1d,
},
},
+{
+ .patch_id = 0x0a60120a,
+ .digest = {
+ 0x0c, 0x8b, 0x3d, 0xfd, 0x52, 0x52, 0x85, 0x7d,
+ 0x20, 0x3a, 0xe1, 0x7e, 0xa4, 0x21, 0x3b, 0x7b,
+ 0x17, 0x86, 0xae, 0xac, 0x13, 0xb8, 0x63, 0x9d,
+ 0x06, 0x01, 0xd0, 0xa0, 0x51, 0x9a, 0x91, 0x2c,
+ },
+},
{
.patch_id = 0x0a704107,
.digest = {
@@ -179,6 +278,15 @@
0x64, 0x39, 0x71, 0x8c, 0xce, 0xe7, 0x41, 0x39,
},
},
+{
+ .patch_id = 0x0a704108,
+ .digest = {
+ 0xd7, 0x55, 0x15, 0x2b, 0xfe, 0xc4, 0xbc, 0x93,
+ 0xec, 0x91, 0xa0, 0xae, 0x45, 0xb7, 0xc3, 0x98,
+ 0x4e, 0xff, 0x61, 0x77, 0x88, 0xc2, 0x70, 0x49,
+ 0xe0, 0x3a, 0x1d, 0x84, 0x38, 0x52, 0xbf, 0x5a,
+ },
+},
{
.patch_id = 0x0a705206,
.digest = {
@@ -188,6 +296,15 @@
0x03, 0x35, 0xe9, 0xbe, 0xfb, 0x06, 0xdf, 0xfc,
},
},
+{
+ .patch_id = 0x0a705208,
+ .digest = {
+ 0x30, 0x1d, 0x55, 0x24, 0xbc, 0x6b, 0x5a, 0x19,
+ 0x0c, 0x7d, 0x1d, 0x74, 0xaa, 0xd1, 0xeb, 0xd2,
+ 0x16, 0x62, 0xf7, 0x5b, 0xe1, 0x1f, 0x18, 0x11,
+ 0x5c, 0xf0, 0x94, 0x90, 0x26, 0xec, 0x69, 0xff,
+ },
+},
{
.patch_id = 0x0a708007,
.digest = {
@@ -197,6 +314,15 @@
0xdf, 0x92, 0x73, 0x84, 0x87, 0x3c, 0x73, 0x93,
},
},
+{
+ .patch_id = 0x0a708008,
+ .digest = {
+ 0x08, 0x6e, 0xf0, 0x22, 0x4b, 0x8e, 0xc4, 0x46,
+ 0x58, 0x34, 0xe6, 0x47, 0xa2, 0x28, 0xfd, 0xab,
+ 0x22, 0x3d, 0xdd, 0xd8, 0x52, 0x9e, 0x1d, 0x16,
+ 0xfa, 0x01, 0x68, 0x14, 0x79, 0x3e, 0xe8, 0x6b,
+ },
+},
{
.patch_id = 0x0a70c005,
.digest = {
@@ -206,6 +332,15 @@
0xee, 0x49, 0xac, 0xe1, 0x8b, 0x13, 0xc5, 0x13,
},
},
+{
+ .patch_id = 0x0a70c008,
+ .digest = {
+ 0x0f, 0xdb, 0x37, 0xa1, 0x10, 0xaf, 0xd4, 0x21,
+ 0x94, 0x0d, 0xa4, 0xa2, 0xe9, 0x86, 0x6c, 0x0e,
+ 0x85, 0x7c, 0x36, 0x30, 0xa3, 0x3a, 0x78, 0x66,
+ 0x18, 0x10, 0x60, 0x0d, 0x78, 0x3d, 0x44, 0xd0,
+ },
+},
{
.patch_id = 0x0aa00116,
.digest = {
@@ -224,3 +359,12 @@
0x68, 0x2f, 0x46, 0xee, 0xfe, 0xc6, 0x6d, 0xef,
},
},
+{
+ .patch_id = 0x0aa00216,
+ .digest = {
+ 0x79, 0xfb, 0x5b, 0x9f, 0xb6, 0xe6, 0xa8, 0xf5,
+ 0x4e, 0x7c, 0x4f, 0x8e, 0x1d, 0xad, 0xd0, 0x08,
+ 0xc2, 0x43, 0x7c, 0x8b, 0xe6, 0xdb, 0xd0, 0xd2,
+ 0xe8, 0x39, 0x26, 0xc1, 0xe5, 0x5a, 0x48, 0xf1,
+ },
+},
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Apr 2025 03:18:59 +0100
Subject: x86/idle: Rearrange VERW and MONITOR in mwait_idle_with_hints()
In order to mitigate TSA, Xen will need to issue VERW before going idle.
On AMD CPUs, the VERW scrubbing side effects cancel an active MONITOR, causing
the MWAIT to exit without entering an idle state. Therefore the VERW must be
ahead of MONITOR.
Split spec_ctrl_enter_idle() in two and allow the VERW aspect to be handled
separately. While adjusting, update a stale comment concerning MSBDS; more
issues have been mitigated using VERW since it was written.
By moving VERW earlier, it is ahead of the determination of whether to go
idle. We can't move the check on softirq_pending (for correctness reasons),
but we can duplicate it earlier as a best effort attempt to skip the
speculative overhead.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 0b7e7636bc0c..3ba1bd500dad 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -462,9 +462,18 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
+ struct cpu_info *info = get_cpu_info();
irq_cpustat_t *stat = &irq_stat[cpu];
const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+ /*
+ * Heuristic: if we're definitely not going to idle, bail early as the
+ * speculative safety can be expensive. This is a performance
+ * consideration not a correctness issue.
+ */
+ if ( *this_softirq_pending )
+ return;
+
/*
* By setting in_mwait, we promise to other CPUs that we'll notice changes
* to __softirq_pending without being sent an IPI. We achieve this by
@@ -478,15 +487,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
"", X86_BUG_MONITOR,
[in_mwait] "=m" (stat->in_mwait));
+ /*
+ * On AMD systems, side effects from VERW cancel MONITOR, causing MWAIT to
+ * wake up immediately. Therefore, VERW must come ahead of MONITOR.
+ */
+ __spec_ctrl_enter_idle_verw(info);
+
monitor(this_softirq_pending, 0, 0);
ASSERT(!local_irq_is_enabled());
if ( !*this_softirq_pending )
{
- struct cpu_info *info = get_cpu_info();
-
- spec_ctrl_enter_idle(info);
+ __spec_ctrl_enter_idle(info, false /* VERW handled above */);
if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
mwait(eax, ecx);
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 4439a1b24346..60844b755dff 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -126,8 +126,22 @@ static inline void init_shadow_spec_ctrl_state(void)
info->verw_sel = __HYPERVISOR_DS32;
}
+static always_inline void __spec_ctrl_enter_idle_verw(struct cpu_info *info)
+{
+ /*
+ * Flush/scrub structures which are statically partitioned between active
+ * threads. Otherwise data of ours (of unknown sensitivity) will become
+ * available to our sibling when we go idle.
+ *
+ * Note: VERW must be encoded with a memory operand, as it is only that
+ * form with side effects.
+ */
+ alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
+ [sel] "m" (info->verw_sel));
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
-static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+static always_inline void __spec_ctrl_enter_idle(struct cpu_info *info, bool verw)
{
uint32_t val = 0;
@@ -146,21 +160,8 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
"a" (val), "c" (MSR_SPEC_CTRL), "d" (0));
barrier();
- /*
- * Microarchitectural Store Buffer Data Sampling:
- *
- * On vulnerable systems, store buffer entries are statically partitioned
- * between active threads. When entering idle, our store buffer entries
- * are re-partitioned to allow the other threads to use them.
- *
- * Flush the buffers to ensure that no sensitive data of ours can be
- * leaked by a sibling after it gets our store buffer entries.
- *
- * Note: VERW must be encoded with a memory operand, as it is only that
- * form which causes a flush.
- */
- alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
- [sel] "m" (info->verw_sel));
+ if ( verw ) /* Expected to be const-propagated. */
+ __spec_ctrl_enter_idle_verw(info);
/*
* Cross-Thread Return Address Predictions:
@@ -178,6 +179,12 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
: "rax", "rcx");
}
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+ __spec_ctrl_enter_idle(info, true /* VERW */);
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 29 Aug 2024 17:36:11 +0100
Subject: x86/spec-ctrl: Mitigate Transitive Scheduler Attacks
TSA affects AMD Fam19h CPUs (Zen3 and 4 microarchitectures).
Three new CPUID bits have been defined. Two (TSA_SQ_NO and TSA_L1_NO)
indicate that the system is unaffected, and must be synthesised by Xen on
unaffected parts to date.
A third new bit indicates that VERW now has a flushing side effect. Xen must
synthesise this bit on affected systems based on microcode version. As with
other VERW-based flushing features, VERW_CLEAR needs OR-ing across a resource
pool, and guests which have seen it can safely migrate in.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 7704980b8a9b..04035d565eb5 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -194,6 +194,7 @@ static const char *const str_7a1[32] =
static const char *const str_e21a[32] =
{
[ 2] = "lfence+",
+ /* 4 */ [ 5] = "verw-clear",
[ 6] = "nscb",
[ 8] = "auto-ibrs",
@@ -246,6 +247,8 @@ static const char *const str_m10Al[64] =
static const char *const str_e21c[32] =
{
+ /* 0 */ [ 1] = "tsa-sq-no",
+ [ 2] = "tsa-l1-no",
};
static const struct {
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index b4d3fa824363..f259c77435ea 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -476,6 +476,17 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
if ( test_bit(X86_FEATURE_RTM, fs) )
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicates that the VERW instruction may have gained
+ * scrubbing side effects. With pooling, it means "you might migrate
+ * somewhere where scrubbing is necessary", and may need exposing on
+ * unaffected hardware. This is fine, because the VERW instruction
+ * has been around since the 286.
+ */
+ __set_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
@@ -554,6 +565,17 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicate that the VERW instruction may have gained
+ * scrubbing side effects. The max policy has it set for migration
+ * reasons, so reset the default policy back to the host value in case
+ * we're unaffected.
+ */
+ if ( !cpu_has_verw_clear )
+ __clear_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 8779856fb5a6..9233d6fbfbc5 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -94,6 +94,9 @@ __UNLIKELY_END(nsvm_hap)
pop %rdi
sti
+
+ SPEC_CTRL_COND_VERW /* Req: %rsp=eframe Clob: efl */
+
vmrun
SAVE_ALL
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 020414e98c4d..801a8cbbc016 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -150,6 +150,7 @@
/* CPUID level 0x80000021.eax */
#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_verw_clear boot_cpu_has(X86_FEATURE_VERW_CLEAR)
#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
/* MSR_ARCH_CAPS */
@@ -170,6 +171,10 @@
#define cpu_has_rfds_clear boot_cpu_has(X86_FEATURE_RFDS_CLEAR)
#define cpu_has_its_no boot_cpu_has(X86_FEATURE_ITS_NO)
+/* CPUID level 0x80000021.ecx */
+#define cpu_has_tsa_sq_no boot_cpu_has(X86_FEATURE_TSA_SQ_NO)
+#define cpu_has_tsa_l1_no boot_cpu_has(X86_FEATURE_TSA_L1_NO)
+
/* Synthesized. */
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 559ee90b44dc..5005f0acdde9 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -496,7 +496,7 @@ custom_param("pv-l1tf", parse_pv_l1tf);
static void __init print_details(enum ind_thunk thunk)
{
- unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
+ unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, e21c = 0, max = 0, tmp;
uint64_t caps = 0;
/* Collect diagnostics about available mitigations. */
@@ -507,7 +507,7 @@ static void __init print_details(enum ind_thunk thunk)
if ( boot_cpu_data.extended_cpuid_level >= 0x80000008 )
cpuid(0x80000008, &tmp, &e8b, &tmp, &tmp);
if ( boot_cpu_data.extended_cpuid_level >= 0x80000021 )
- cpuid(0x80000021, &e21a, &tmp, &tmp, &tmp);
+ cpuid(0x80000021U, &e21a, &tmp, &e21c, &tmp);
if ( cpu_has_arch_caps )
rdmsrl(MSR_ARCH_CAPABILITIES, caps);
@@ -517,7 +517,7 @@ static void __init print_details(enum ind_thunk thunk)
* Hardware read-only information, stating immunity to certain issues, or
* suggestions of which mitigation to use.
*/
- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
(caps & ARCH_CAPS_EIBRS) ? " EIBRS" : "",
(caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
@@ -541,10 +541,12 @@ static void __init print_details(enum ind_thunk thunk)
(e8b & cpufeat_mask(X86_FEATURE_BTC_NO)) ? " BTC_NO" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBPB_RET)) ? " IBPB_RET" : "",
(e21a & cpufeat_mask(X86_FEATURE_IBPB_BRTYPE)) ? " IBPB_BRTYPE" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_SQ_NO)) ? " TSA_SQ_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_L1_NO)) ? " TSA_L1_NO" : "");
/* Hardware features which need driving to mitigate issues. */
- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
(_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -563,7 +565,8 @@ static void __init print_details(enum ind_thunk thunk)
(caps & ARCH_CAPS_FB_CLEAR_CTRL) ? " FB_CLEAR_CTRL" : "",
(caps & ARCH_CAPS_GDS_CTRL) ? " GDS_CTRL" : "",
(caps & ARCH_CAPS_RFDS_CLEAR) ? " RFDS_CLEAR" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "",
+ (e21a & cpufeat_mask(X86_FEATURE_VERW_CLEAR)) ? " VERW_CLEAR" : "");
/* Compiled-in support which pertains to mitigations. */
if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) ||
@@ -1526,6 +1529,77 @@ static void __init rfds_calculations(void)
setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
}
+/*
+ * Transient Scheduler Attacks
+ *
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
+ */
+static void __init tsa_calculations(void)
+{
+ unsigned int curr_rev, min_rev;
+
+ /* TSA is only known to affect AMD processors at this time. */
+ if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
+ return;
+
+ /* If we're virtualised, don't attempt to synthesise anything. */
+ if ( cpu_has_hypervisor )
+ return;
+
+ /*
+ * According to the whitepaper, some Fam1A CPUs (Models 0x00...0x4f,
+ * 0x60...0x7f) are not vulnerable but don't enumerate TSA_{SQ,L1}_NO. If
+ * we see either enumerated, assume both are correct ...
+ */
+ if ( cpu_has_tsa_sq_no || cpu_has_tsa_l1_no )
+ return;
+
+ /*
+ * ... otherwise, synthesise them. CPUs other than Fam19 (Zen3/4) are
+ * stated to be not vulnerable.
+ */
+ if ( boot_cpu_data.x86 != 0x19 )
+ {
+ setup_force_cpu_cap(X86_FEATURE_TSA_SQ_NO);
+ setup_force_cpu_cap(X86_FEATURE_TSA_L1_NO);
+ return;
+ }
+
+ /*
+ * Fam19 CPUs get VERW_CLEAR with new enough microcode, but must
+ * synthesise the CPUID bit.
+ */
+ curr_rev = this_cpu(cpu_sig).rev;
+ switch ( curr_rev >> 8 )
+ {
+ case 0x0a0011: min_rev = 0x0a0011d7; break;
+ case 0x0a0012: min_rev = 0x0a00123b; break;
+ case 0x0a0082: min_rev = 0x0a00820d; break;
+ case 0x0a1011: min_rev = 0x0a10114c; break;
+ case 0x0a1012: min_rev = 0x0a10124c; break;
+ case 0x0a1081: min_rev = 0x0a108109; break;
+ case 0x0a2010: min_rev = 0x0a20102e; break;
+ case 0x0a2012: min_rev = 0x0a201211; break;
+ case 0x0a4041: min_rev = 0x0a404108; break;
+ case 0x0a5000: min_rev = 0x0a500012; break;
+ case 0x0a6012: min_rev = 0x0a60120a; break;
+ case 0x0a7041: min_rev = 0x0a704108; break;
+ case 0x0a7052: min_rev = 0x0a705208; break;
+ case 0x0a7080: min_rev = 0x0a708008; break;
+ case 0x0a70c0: min_rev = 0x0a70c008; break;
+ case 0x0aa002: min_rev = 0x0aa00216; break;
+ default:
+ printk(XENLOG_WARNING
+ "Unrecognised CPU %02x-%02x-%02x, ucode 0x%08x for TSA mitigation\n",
+ boot_cpu_data.x86, boot_cpu_data.x86_model,
+ boot_cpu_data.x86_mask, curr_rev);
+ return;
+ }
+
+ if ( curr_rev >= min_rev )
+ setup_force_cpu_cap(X86_FEATURE_VERW_CLEAR);
+}
+
static bool __init cpu_has_gds(void)
{
/*
@@ -2219,6 +2293,7 @@ void __init init_speculation_mitigations(void)
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
*
* Relevant ucodes:
*
@@ -2251,9 +2326,18 @@ void __init init_speculation_mitigations(void)
*
* - March 2023, for RFDS. Enumerate RFDS_CLEAR to mean that VERW now
* scrubs non-architectural entries from certain register files.
+ *
+ * - July 2025, for TSA. Introduces VERW side effects to mitigate
+ * TSA_{SQ/L1}. Xen must synthesise the VERW_CLEAR feature based on
+ * microcode version.
+ *
+ * Note, these microcode updates were produced before the remediation of
+ * the microcode signature issues, and are included in the firwmare
+ * updates fixing the entrysign vulnerability from ~December 2024.
*/
mds_calculations();
rfds_calculations();
+ tsa_calculations();
/*
* Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -2285,21 +2369,27 @@ void __init init_speculation_mitigations(void)
* MLPDS/MFBDS when SMT is enabled.
*/
if ( opt_verw_pv == -1 )
- opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_pv = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
if ( opt_verw_hvm == -1 )
- opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_hvm = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
/*
- * If SMT is active, and we're protecting against MDS or MMIO stale data,
+ * If SMT is active, and we're protecting against any of:
+ * - MSBDS
+ * - MMIO stale data
+ * - TSA-SQ
* we need to scrub before going idle as well as on return to guest.
* Various pipeline resources are repartitioned amongst non-idle threads.
*
- * We don't need to scrub on idle for RFDS. There are no affected cores
- * which support SMT, despite there being affected cores in hybrid systems
- * which have SMT elsewhere in the platform.
+ * We don't need to scrub on idle for:
+ * - RFDS (no SMT affected cores)
+ * - TSA-L1 (utags never shared between threads)
*/
if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+ (cpu_has_verw_clear && !cpu_has_tsa_sq_no) ||
opt_verw_mmio) && hw_smt_enabled )
setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 03cd1419c5cb..42db132b4c2f 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -284,6 +284,7 @@ XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */
/* AMD-defined CPU features, CPUID level 0x80000021.eax, word 11 */
XEN_CPUFEATURE(LFENCE_DISPATCH, 11*32+ 2) /*A LFENCE always serializing */
+XEN_CPUFEATURE(VERW_CLEAR, 11*32+ 5) /*!A VERW clears microarchitectural buffers */
XEN_CPUFEATURE(NSCB, 11*32+ 6) /*A Null Selector Clears Base (and limit too) */
XEN_CPUFEATURE(AUTO_IBRS, 11*32+ 8) /* Automatic IBRS */
XEN_CPUFEATURE(SBPB, 11*32+27) /*A Selective Branch Predictor Barrier */
@@ -338,6 +339,8 @@ XEN_CPUFEATURE(RFDS_CLEAR, 16*32+28) /*!A Register File(s) cleared by VE
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+XEN_CPUFEATURE(TSA_SQ_NO, 18*32+ 1) /*A No Store Queue Transitive Scheduler Attacks */
+XEN_CPUFEATURE(TSA_L1_NO, 18*32+ 2) /*A No L1D Transitive Scheduler Attacks */
#endif /* XEN_CPUFEATURE */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 20:59:37 +0100
Subject: x86/cpufeature: Reposition cpu_has_{lfence_dispatch,nscb}
LFENCE_DISPATCH used to be a synthetic feature, but was given a real CPUID bit
by AMD. The define wasn't moved when this was changed.
NSCB has always been a real CPUID bit, and was misplaced when introduced in
the synthetic block alongside LFENCE_DISPATCH.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a039b050071eba644ab414d76ac5d5fc9e067a5)
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 515f3f64d55b..919a9e31f04e 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -190,6 +190,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_avx512_bf16 boot_cpu_has(X86_FEATURE_AVX512_BF16)
#define cpu_has_avx_ifma boot_cpu_has(X86_FEATURE_AVX_IFMA)
+/* CPUID level 0x80000021.eax */
+#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
+
/* CPUID level 0x00000007:1.edx */
#define cpu_has_avx_vnni_int8 boot_cpu_has(X86_FEATURE_AVX_VNNI_INT8)
#define cpu_has_avx_ne_convert boot_cpu_has(X86_FEATURE_AVX_NE_CONVERT)
@@ -218,8 +222,6 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
#define cpu_has_aperfmperf boot_cpu_has(X86_FEATURE_APERFMPERF)
-#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
-#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
#define cpu_has_xen_lbr boot_cpu_has(X86_FEATURE_XEN_LBR)
#define cpu_has_xen_shstk boot_cpu_has(X86_FEATURE_XEN_SHSTK)
#define cpu_has_xen_ibt boot_cpu_has(X86_FEATURE_XEN_IBT)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 14:59:01 +0100
Subject: x86/idle: Move monitor()/mwait() wrappers into cpu-idle.c
They're not used by any other translation unit, so shouldn't live in
asm/processor.h, which is included almost everywhere.
Our new toolchain baseline knows the MONITOR/MWAIT instructions, so use them
directly rather than using raw hex.
Change the hint/extention parameters from long to int. They're specified to
remain 32bit operands even 64-bit mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 61e10fc28ccddff7c72c14acec56dc7ef2b155d1)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 040bab60b6fb..ec2d570dc27b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -59,6 +59,19 @@
/*#define DEBUG_PM_CX*/
+static always_inline void monitor(
+ const void *addr, unsigned int ecx, unsigned int edx)
+{
+ asm volatile ( "monitor"
+ :: "a" (addr), "c" (ecx), "d" (edx) );
+}
+
+static always_inline void mwait(unsigned int eax, unsigned int ecx)
+{
+ asm volatile ( "mwait"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -482,7 +495,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mb();
}
- __monitor(monitor_addr, 0, 0);
+ monitor(monitor_addr, 0, 0);
smp_mb();
/*
@@ -496,7 +509,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
spec_ctrl_enter_idle(info);
- __mwait(eax, ecx);
+ mwait(eax, ecx);
spec_ctrl_exit_idle(info);
cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
@@ -927,9 +940,9 @@ void cf_check acpi_dead_idle(void)
*/
mb();
clflush(mwait_ptr);
- __monitor(mwait_ptr, 0, 0);
+ monitor(mwait_ptr, 0, 0);
mb();
- __mwait(cx->address, 0);
+ mwait(cx->address, 0);
}
}
else if ( (current_cpu_data.x86_vendor &
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 1bba4c5002c6..c5e5c72341ad 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -312,23 +312,6 @@ static always_inline void set_in_cr4 (unsigned long mask)
write_cr4(read_cr4() | mask);
}
-static always_inline void __monitor(const void *eax, unsigned long ecx,
- unsigned long edx)
-{
- /* "monitor %eax,%ecx,%edx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc8;"
- : : "a" (eax), "c" (ecx), "d"(edx) );
-}
-
-static always_inline void __mwait(unsigned long eax, unsigned long ecx)
-{
- /* "mwait %eax,%ecx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc9;"
- : : "a" (eax), "c" (ecx) );
-}
-
#define IOBMP_BYTES 8192
#define IOBMP_INVALID_OFFSET 0x8000
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 15:55:29 +0100
Subject: x86/idle: Remove MFENCEs for CLFLUSH_MONITOR
Commit 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround") was
inherited from Linux and added MFENCEs around the AAI65 errata fix.
The SDM now states:
Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
and fence instructions[1].
with footnote 1 reading:
Earlier versions of this manual specified that executions of the CLFLUSH
instruction were ordered only by the MFENCE instruction. All processors
implementing the CLFLUSH instruction also order it relative to the other
operations enumerated above.
I.e. the MFENCEs came about because of an incorrect statement in the SDM.
The Spec Update (no longer available on Intel's website) simply says "issue a
CLFLUSH", with no mention of MFENCEs.
As this erratum is specific to Intel, it's fine to remove the the MFENCEs; AMD
CPUs of a similar vintage do sport otherwise-unordered CLFLUSHs.
Move the feature bit into the BUG range (rather than FEATURE), and move the
workaround into monitor() itself.
The erratum check itself must use setup_force_cpu_cap(). It needs activating
if any CPU needs it, not if all of them need it.
Fixes: 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround")
Fixes: 96d1b237ae9b ("x86/Intel: work around Xeon 7400 series erratum AAI65")
Link: https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f77ef3443542a2c2bbd59ee66178287d4fa5b43f)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index ec2d570dc27b..f1b9e2fbf6e7 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -62,6 +62,9 @@
static always_inline void monitor(
const void *addr, unsigned int ecx, unsigned int edx)
{
+ alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
+ [addr] "a" (addr));
+
asm volatile ( "monitor"
:: "a" (addr), "c" (ecx), "d" (edx) );
}
@@ -488,13 +491,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
s_time_t expires = per_cpu(timer_deadline, cpu);
const void *monitor_addr = &mwait_wakeup(cpu);
- if ( boot_cpu_has(X86_FEATURE_CLFLUSH_MONITOR) )
- {
- mb();
- clflush(monitor_addr);
- mb();
- }
-
monitor(monitor_addr, 0, 0);
smp_mb();
@@ -929,19 +925,7 @@ void cf_check acpi_dead_idle(void)
while ( 1 )
{
- /*
- * 1. The CLFLUSH is a workaround for erratum AAI65 for
- * the Xeon 7400 series.
- * 2. The WBINVD is insufficient due to the spurious-wakeup
- * case where we return around the loop.
- * 3. Unlike wbinvd, clflush is a light weight but not serializing
- * instruction, hence memory fence is necessary to make sure all
- * load/store visible before flush cache line.
- */
- mb();
- clflush(mwait_ptr);
monitor(mwait_ptr, 0, 0);
- mb();
mwait(cx->address, 0);
}
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 6f2b3fffdd34..04a002e1e0c9 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -446,6 +446,7 @@ static void __init probe_mwait_errata(void)
*
* Xeon 7400 erratum AAI65 (and further newer Xeons)
* MONITOR/MWAIT may have excessive false wakeups
+ * https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
*/
static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
{
@@ -463,7 +464,7 @@ static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
if (c->x86 == 6 && cpu_has_clflush &&
(c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
- __set_bit(X86_FEATURE_CLFLUSH_MONITOR, c->x86_capability);
+ setup_force_cpu_cap(X86_BUG_CLFLUSH_MONITOR);
probe_c3_errata(c);
if (system_state < SYS_STATE_smp_boot)
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..84c93292c80c 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -19,7 +19,7 @@ XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon
XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */
XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */
XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */
-XEN_CPUFEATURE(CLFLUSH_MONITOR, X86_SYNTH( 7)) /* clflush reqd with monitor */
+/* Bit 7 unused */
XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */
@@ -52,6 +52,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
+#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 24 Jun 2025 15:20:52 +0100
Subject: Revert part of "x86/mwait-idle: disable IBRS during long idle"
Most of the patch (handling of CPUIDLE_FLAG_IBRS) is fine, but the
adjustements to mwait_idle() are not; spec_ctrl_enter_idle() does more than
just alter MSR_SPEC_CTRL.IBRS.
The only reason this doesn't need an XSA is because the unconditional
spec_ctrl_{enter,exit}_idle() in mwait_idle_with_hints() were left unaltered,
and thus the MWAIT remained properly protected.
There (would have been) two problems. In the ibrs_disable (== deep C) case:
* On entry, VERW and RSB-stuffing are architecturally skipped.
* On exit, there's a branch crossing the WRMSR which reinstates the
speculative safety for indirect branches.
All this change did was double up the expensive operations in the deep C case,
and fail to optimise the intended case.
I have an idea of how to plumb this more nicely, but it requires larger
changes to legacy IBRS handling to not make spec_ctrl_enter_idle() vulnerable
in other ways. In the short term, simply take out the perf hit.
Fixes: 08acdf9a2615 ("x86/mwait-idle: disable IBRS during long idle")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 07d7163334a7507d329958b19d976be769580999)
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index ff5c808bc952..e95fe5d88907 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -891,7 +891,6 @@ static const struct cpuidle_state snr_cstates[] = {
static void cf_check mwait_idle(void)
{
unsigned int cpu = smp_processor_id();
- struct cpu_info *info = get_cpu_info();
struct acpi_processor_power *power = processor_powers[cpu];
struct acpi_processor_cx *cx = NULL;
unsigned int next_state;
@@ -918,6 +917,8 @@ static void cf_check mwait_idle(void)
pm_idle_save();
else
{
+ struct cpu_info *info = get_cpu_info();
+
spec_ctrl_enter_idle(info);
safe_halt();
spec_ctrl_exit_idle(info);
@@ -944,11 +945,6 @@ static void cf_check mwait_idle(void)
if ((cx->type >= 3) && errata_c6_workaround())
cx = power->safe_state;
- if (cx->ibrs_disable) {
- ASSERT(!cx->irq_enable_early);
- spec_ctrl_enter_idle(info);
- }
-
#if 0 /* XXX Can we/do we need to do something similar on Xen? */
/*
* leave_mm() to avoid costly and often unnecessary wakeups
@@ -980,10 +976,6 @@ static void cf_check mwait_idle(void)
/* Now back in C0. */
update_idle_stats(power, cx, before, after);
-
- if (cx->ibrs_disable)
- spec_ctrl_exit_idle(info);
-
local_irq_enable();
TRACE_6D(TRC_PM_IDLE_EXIT, cx->type, after,
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 14:46:01 +0100
Subject: x86/cpu-policy: Simplify logic in
guest_common_default_feature_adjustments()
For features which are unconditionally set in the max policies, making the
default policy to match the host can be done with a conditional clear.
This is simpler than the unconditional clear, conditional set currently
performed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 30f8fed68f3c2e63594ff9202b3d05b971781e36)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 7745d5d2d50d..608e03fe5e3b 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -540,17 +540,14 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* reasons, so reset the default policy back to the host values in
* case we're unaffected.
*/
- __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- if ( cpu_has_md_clear )
- __set_bit(X86_FEATURE_MD_CLEAR, fs);
+ if ( !cpu_has_md_clear )
+ __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- if ( cpu_has_fb_clear )
- __set_bit(X86_FEATURE_FB_CLEAR, fs);
+ if ( !cpu_has_fb_clear )
+ __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
- if ( cpu_has_rfds_clear )
- __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+ if ( !cpu_has_rfds_clear )
+ __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
/*
* The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -570,13 +567,11 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* Topology information is at the toolstack's discretion so these are
* unconditionally set in max, but pick a default which matches the host.
*/
- __clear_bit(X86_FEATURE_HTT, fs);
- if ( cpu_has_htt )
- __set_bit(X86_FEATURE_HTT, fs);
+ if ( !cpu_has_htt )
+ __clear_bit(X86_FEATURE_HTT, fs);
- __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
- if ( cpu_has_cmp_legacy )
- __set_bit(X86_FEATURE_CMP_LEGACY, fs);
+ if ( !cpu_has_cmp_legacy )
+ __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
/*
* On certain hardware, speculative or errata workarounds can result in
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 11:33:41 +0100
Subject: x86/cpu-policy: Fix handling of leaf 0x80000021
When support was originally introduced, ebx, ecx and edx were reserved and
should have been zeroed in recalculate_misc() to avoid leaking into guests.
Since then, fields have been added into ebx. Guests can't load microcode, so
shouldn't see ucode_size, and while in principle we do want to support larger
RAP sizes in guests, virtualising this for guests depends on AMD procuding any
official documentation for ERAPS, which is long overdue and with no ETA.
This patch will cause a difference in guests on Zen5 CPUs, but as the main
ERAPS feature is hidden, guests should be ignoring the rap_size field too.
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 10dc35c516f7b9224590a7a4e2722bbfd70fa87a)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 608e03fe5e3b..8f332fdbd9ae 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -341,6 +341,9 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1e] = EMPTY_LEAF; /* TopoExt APIC ID/Core/Node */
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
+ p->extd.raw[0x21].b = 0;
+ p->extd.raw[0x21].c = 0;
+ p->extd.raw[0x21].d = 0;
break;
}
}
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index bab3eecda6c1..f335929a70c4 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -324,7 +324,10 @@ struct cpu_policy
uint32_t e21a;
struct { DECL_BITFIELD(e21a); };
};
- uint32_t /* b */:32, /* c */:32, /* d */:32;
+ uint16_t ucode_size; /* Units of 16 bytes */
+ uint8_t rap_size; /* Units of 8 entries */
+ uint8_t :8;
+ uint32_t /* c */:32, /* d */:32;
};
} extd;
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 15:51:53 +0100
Subject: x86/idle: Remove broken MWAIT implementation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI certain targets. Also,
there's no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ
on it's way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line, and not in a separate cpuidle_mwait_flags variable, in order
to do this in a race-free way.
Arranging to fix this while keeping the old implementation is prohibitive, so
strip the current one out in order to implement the new one cleanly. In the
interim, this causes IPIs to be used unconditionally which is safe albeit
suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3faf0866a33070b926ab78e6298290403f85e76c)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index f1b9e2fbf6e7..1b316e849d6a 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -448,27 +448,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -477,42 +456,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -913,7 +875,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 317ef63fb5f8..8f5069df2ae7 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -187,8 +187,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 276e3419d778..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool_t __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 705d0c1135f0..120e354fe340 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 18:13:27 +0100
Subject: x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
isn't correct either, as this only pertains to local ordering. All we need is
a compiler barrier().
Merge the barier() into the monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e7710dd843ba9d204f6ee2973d6120c1984958a6)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 1b316e849d6a..8de89b117aa3 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -65,8 +65,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be reordered over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -465,7 +469,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
const unsigned int *this_softirq_pending = &softirq_pending(cpu);
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:40:51 +0100
Subject: x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b0ca0f93f47c43f8984981137af07ca3d161e3ec)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 8de89b117aa3..4272c33d1ca4 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,14 +452,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 04a002e1e0c9..3274e0f5e8af 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -436,7 +436,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index 97bf361505f0..f377d9fdcad4 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
bool mwait_pc10_supported(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:04:17 +0100
Subject: xen/softirq: Rework arch_skip_send_event_check() into
arch_set_softirq()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up the
target) and, prior to the removal of the broken IPI-elision algorithm, was
racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to also
sample the status of the target in one atomic action. Implement a new arch
helper with different semantics; to make the softirq pending and decide about
IPIs together. For now, implement the default helper. It will be overridden
by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b473e5e212e445d3c193c1c83b52b129af571b19)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 4272c33d1ca4..43ab4533b791 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,11 +452,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 321d26902d37..e733c8f74b44 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_set_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_set_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 33d6f2ecd223..5c2361865b49 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -21,6 +21,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_set_softirq
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:26:24 +0100
Subject: x86/idle: Implement a new MWAIT IPI-elision algorithm
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3e0bc4b50350bd357304fd79a5dc0472790dba91)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 43ab4533b791..be767a2c668f 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -455,7 +455,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising ourselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
@@ -467,6 +481,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..55b65c9747b1 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,50 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t prev, old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ for ( ;; )
+ {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ prev = cmpxchg(ptr, old, new);
+ if ( prev == old )
+ break;
+
+ old = prev;
+ }
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ *
+ * We can't access the in_mwait field nicely, so use some BUILD_BUG_ON()'s
+ * to cross-check the (1UL << 32) opencoding.
+ */
+ BUILD_BUG_ON(sizeof(irq_stat[0].softirq_mwait_raw) != 8);
+ BUILD_BUG_ON((offsetof(irq_cpustat_t, in_mwait) -
+ offsetof(irq_cpustat_t, softirq_mwait_raw)) != 4);
+
+ return new & (1UL << 32) /* in_mwait */;
+
+}
+#define arch_set_softirq arch_set_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Jul 2025 14:51:38 +0100
Subject: x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on
Xeons"
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b0f0f6e235618c2764e925b58c4bfe412730ced)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index be767a2c668f..1589325baa56 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -79,6 +79,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -473,12 +480,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index e95fe5d88907..8967fb1f6f36 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -962,12 +962,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 4 Jul 2025 17:53:15 +0100
Subject: x86/xen-cpuid: Fix backports of new features
Xen 4.18 doesn't automatically generate feature names like Xen 4.19 does, and
these hunks were missed on prior security fixes.
Fixes: 8bced9a15c8c ("x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX")
Fixes: f132c82fa65d ("x86/spec-ctrl: Synthesise ITS_NO to guests on unaffected hardware")
Fixes: dba055661292 ("x86/spec-ctrl: Support Intel's new PB-OPT")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 52e451a806c1..6bd31f1e156c 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -205,6 +205,7 @@ static const char *const str_e21a[32] =
/* 26 */ [27] = "sbpb",
[28] = "ibpb-brtype", [29] = "srso-no",
+ [30] = "srso-us-no", [31] = "srso-msr-fix",
};
static const char *const str_7b1[32] =
@@ -230,7 +231,7 @@ static const char *const str_7d2[32] =
[ 4] = "bhi-ctrl", [ 5] = "mcdt-no",
};
-static const char *const str_m10Al[32] =
+static const char *const str_m10Al[64] =
{
[ 0] = "rdcl-no", [ 1] = "eibrs",
[ 2] = "rsba", [ 3] = "skip-l1dfl",
@@ -247,10 +248,10 @@ static const char *const str_m10Al[32] =
[24] = "pbrsb-no", [25] = "gds-ctrl",
[26] = "gds-no", [27] = "rfds-no",
[28] = "rfds-clear",
-};
-static const char *const str_m10Ah[32] =
-{
+ [32] = "pb-opt-ctrl",
+
+ [62] = "its-no",
};
static const struct {
@@ -276,7 +277,7 @@ static const struct {
{ "CPUID 0x00000007:1.ecx", "7c1", str_7c1 },
{ "CPUID 0x00000007:1.edx", "7d1", str_7d1 },
{ "MSR_ARCH_CAPS.lo", "m10Al", str_m10Al },
- { "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Ah },
+ { "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Al + 32 },
};
#define COL_ALIGN "24"
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 17:19:19 +0100
Subject: x86/cpu-policy: Rearrange guest_common_*_feature_adjustments()
Turn the if()s into switch()es, as we're going to need AMD sections.
Move the RTM adjustments into the Intel section, where they ought to live.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 8f332fdbd9ae..36d36ea60e61 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -443,8 +443,9 @@ static void __init guest_common_default_leaves(struct cpu_policy *p)
static void __init guest_common_max_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* MSR_ARCH_CAPS is just feature data, and we can offer it to guests
* unconditionally, although limit it to Intel systems as it is highly
@@ -489,6 +490,22 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__set_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * To mitigate Native-BHI, one option is to use a TSX Abort on capable
+ * systems. This is safe even if RTM has been disabled for other
+ * reasons via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel
+ * doesn't get to know this type of information.
+ *
+ * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to
+ * instead mean "XBEGIN won't fault". This is enough for a guest
+ * kernel to make an informed choice WRT mitigating Native-BHI.
+ *
+ * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
+ */
+ if ( test_bit(X86_FEATURE_RTM, fs) )
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ break;
}
/*
@@ -500,27 +517,13 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
*/
__set_bit(X86_FEATURE_HTT, fs);
__set_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * To mitigate Native-BHI, one option is to use a TSX Abort on capable
- * systems. This is safe even if RTM has been disabled for other reasons
- * via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel doesn't get to
- * know this type of information.
- *
- * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to instead
- * mean "XBEGIN won't fault". This is enough for a guest kernel to make
- * an informed choice WRT mitigating Native-BHI.
- *
- * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
- */
- if ( test_bit(X86_FEATURE_RTM, fs) )
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
static void __init guest_common_default_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* IvyBridge client parts suffer from leakage of RDRAND data due to SRBDS
* (XSA-320 / CVE-2020-0543), and won't be receiving microcode to
@@ -564,6 +567,23 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__clear_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * On certain hardware, speculative or errata workarounds can result
+ * in TSX being placed in "force-abort" mode, where it doesn't
+ * actually function as expected, but is technically compatible with
+ * the ISA.
+ *
+ * Do not advertise RTM to guests by default if it won't actually
+ * work. Instead, advertise RTM_ALWAYS_ABORT indicating that TSX
+ * Aborts are safe to use, e.g. for mitigating Native-BHI.
+ */
+ if ( rtm_disabled )
+ {
+ __clear_bit(X86_FEATURE_RTM, fs);
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ }
+ break;
}
/*
@@ -575,21 +595,6 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
if ( !cpu_has_cmp_legacy )
__clear_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * On certain hardware, speculative or errata workarounds can result in
- * TSX being placed in "force-abort" mode, where it doesn't actually
- * function as expected, but is technically compatible with the ISA.
- *
- * Do not advertise RTM to guests by default if it won't actually work.
- * Instead, advertise RTM_ALWAYS_ABORT indicating that TSX Aborts are safe
- * to use, e.g. for mitigating Native-BHI.
- */
- if ( rtm_disabled )
- {
- __clear_bit(X86_FEATURE_RTM, fs);
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
- }
}
static void __init guest_common_feature_adjustments(uint32_t *fs)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 19:55:15 +0100
Subject: x86/cpu-policy: Infrastructure for CPUID leaf 0x80000021.ecx
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index ce4f3c7095ba..df5529ac1fa9 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -342,6 +342,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *policy, const char* str)
CPUID_ENTRY(0x00000007, 1, CPUID_REG_EDX),
MSR_ENTRY(0x10a, CPUID_REG_EAX),
MSR_ENTRY(0x10a, CPUID_REG_EDX),
+ CPUID_ENTRY(0x80000021, NA, CPUID_REG_ECX),
#undef MSR_ENTRY
#undef CPUID_ENTRY
};
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 6bd31f1e156c..19b9068d36ec 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -254,6 +254,10 @@ static const char *const str_m10Al[64] =
[62] = "its-no",
};
+static const char *const str_e21c[32] =
+{
+};
+
static const struct {
const char *name;
const char *abbr;
@@ -278,6 +282,7 @@ static const struct {
{ "CPUID 0x00000007:1.edx", "7d1", str_7d1 },
{ "MSR_ARCH_CAPS.lo", "m10Al", str_m10Al },
{ "MSR_ARCH_CAPS.hi", "m10Ah", str_m10Al + 32 },
+ { "CPUID 0x80000021.ecx", "e21c", str_e21c },
};
#define COL_ALIGN "24"
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 36d36ea60e61..662002333879 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -342,7 +342,6 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
p->extd.raw[0x21].b = 0;
- p->extd.raw[0x21].c = 0;
p->extd.raw[0x21].d = 0;
break;
}
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index edec0a25462c..007e935992ba 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -479,7 +479,9 @@ static void generic_identify(struct cpuinfo_x86 *c)
if (c->extended_cpuid_level >= 0x80000008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x80000008);
if (c->extended_cpuid_level >= 0x80000021)
- c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x80000021);
+ cpuid(0x80000021,
+ &c->x86_capability[FEATURESET_e21a], &tmp,
+ &c->x86_capability[FEATURESET_e21c], &tmp);
/* Intel-defined flags: level 0x00000007 */
if (c->cpuid_level >= 7) {
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index e6ed4c02e1da..86e44dd85258 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -348,6 +348,8 @@ XEN_CPUFEATURE(RFDS_CLEAR, 16*32+28) /*!A Register File(s) cleared by VE
XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
+/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+
#endif /* XEN_CPUFEATURE */
/* Clean up from a default include. Close the enum (for C). */
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index f335929a70c4..5fe4127d9b75 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -22,6 +22,7 @@
#define FEATURESET_7d1 15 /* 0x00000007:1.edx */
#define FEATURESET_m10Al 16 /* 0x0000010a.eax */
#define FEATURESET_m10Ah 17 /* 0x0000010a.edx */
+#define FEATURESET_e21c 18 /* 0x80000021.ecx */
struct cpuid_leaf
{
@@ -327,7 +328,11 @@ struct cpu_policy
uint16_t ucode_size; /* Units of 16 bytes */
uint8_t rap_size; /* Units of 8 entries */
uint8_t :8;
- uint32_t /* c */:32, /* d */:32;
+ union {
+ uint32_t e21c;
+ struct { DECL_BITFIELD(e21c); };
+ };
+ uint32_t /* d */:32;
};
} extd;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index eb7698dc7325..6298d051f2a6 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -81,6 +81,7 @@ void x86_cpu_policy_to_featureset(
fs[FEATURESET_7d1] = p->feat._7d1;
fs[FEATURESET_m10Al] = p->arch_caps.lo;
fs[FEATURESET_m10Ah] = p->arch_caps.hi;
+ fs[FEATURESET_e21c] = p->extd.e21c;
}
void x86_cpu_featureset_to_policy(
@@ -104,6 +105,7 @@ void x86_cpu_featureset_to_policy(
p->feat._7d1 = fs[FEATURESET_7d1];
p->arch_caps.lo = fs[FEATURESET_m10Al];
p->arch_caps.hi = fs[FEATURESET_m10Ah];
+ p->extd.e21c = fs[FEATURESET_e21c];
}
void x86_cpu_policy_recalc_synth(struct cpu_policy *p)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Sep 2024 11:28:39 +0100
Subject: x86/ucode: Digests for TSA microcode
AMD are releasing microcode for TSA, so extend the known-provenance list with
their hashes. These were produced before the remediation of the microcode
signature issues (the entrysign vulnerability), so can be OS-loaded on
out-of-date firmware.
Include an off-by-default check for the sorted-ness of patch_digests[]. It's
not worth running generally under SELF_TESTS, but is useful when editing the
digest list.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu/microcode/amd-patch-digests.c b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
index d32761226712..d2c4e0178a1e 100644
--- a/xen/arch/x86/cpu/microcode/amd-patch-digests.c
+++ b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
@@ -80,6 +80,15 @@
0x0d, 0x5b, 0x65, 0x34, 0x69, 0xb2, 0x62, 0x21,
},
},
+{
+ .patch_id = 0x0a0011d7,
+ .digest = {
+ 0x35, 0x07, 0xcd, 0x40, 0x94, 0xbc, 0x81, 0x6b,
+ 0xfc, 0x61, 0x56, 0x1a, 0xe2, 0xdb, 0x96, 0x12,
+ 0x1c, 0x1c, 0x31, 0xb1, 0x02, 0x6f, 0xe5, 0xd2,
+ 0xfe, 0x1b, 0x04, 0x03, 0x2c, 0x8f, 0x4c, 0x36,
+ },
+},
{
.patch_id = 0x0a001238,
.digest = {
@@ -89,6 +98,15 @@
0xc0, 0xcd, 0x33, 0xf2, 0x8d, 0xf9, 0xef, 0x59,
},
},
+{
+ .patch_id = 0x0a00123b,
+ .digest = {
+ 0xef, 0xa1, 0x1e, 0x71, 0xf1, 0xc3, 0x2c, 0xe2,
+ 0xc3, 0xef, 0x69, 0x41, 0x7a, 0x54, 0xca, 0xc3,
+ 0x8f, 0x62, 0x84, 0xee, 0xc2, 0x39, 0xd9, 0x28,
+ 0x95, 0xa7, 0x12, 0x49, 0x1e, 0x30, 0x71, 0x72,
+ },
+},
{
.patch_id = 0x0a00820c,
.digest = {
@@ -98,6 +116,15 @@
0xe1, 0x3b, 0x8d, 0xb2, 0xf8, 0x22, 0x03, 0xe2,
},
},
+{
+ .patch_id = 0x0a00820d,
+ .digest = {
+ 0xf9, 0x2a, 0xc0, 0xf4, 0x9e, 0xa4, 0x87, 0xa4,
+ 0x7d, 0x87, 0x00, 0xfd, 0xab, 0xda, 0x19, 0xca,
+ 0x26, 0x51, 0x32, 0xc1, 0x57, 0x91, 0xdf, 0xc1,
+ 0x05, 0xeb, 0x01, 0x7c, 0x5a, 0x95, 0x21, 0xb7,
+ },
+},
{
.patch_id = 0x0a101148,
.digest = {
@@ -107,6 +134,15 @@
0xf1, 0x5e, 0xb0, 0xde, 0xb4, 0x98, 0xae, 0xc4,
},
},
+{
+ .patch_id = 0x0a10114c,
+ .digest = {
+ 0x9e, 0xb6, 0xa2, 0xd9, 0x87, 0x38, 0xc5, 0x64,
+ 0xd8, 0x88, 0xfa, 0x78, 0x98, 0xf9, 0x6f, 0x74,
+ 0x39, 0x90, 0x1b, 0xa5, 0xcf, 0x5e, 0xb4, 0x2a,
+ 0x02, 0xff, 0xd4, 0x8c, 0x71, 0x8b, 0xe2, 0xc0,
+ },
+},
{
.patch_id = 0x0a101248,
.digest = {
@@ -116,6 +152,15 @@
0x1b, 0x7d, 0x64, 0x9d, 0x4b, 0x53, 0x13, 0x75,
},
},
+{
+ .patch_id = 0x0a10124c,
+ .digest = {
+ 0x29, 0xea, 0xf1, 0x2c, 0xb2, 0xe4, 0xef, 0x90,
+ 0xa4, 0xcd, 0x1d, 0x86, 0x97, 0x17, 0x61, 0x46,
+ 0xfc, 0x22, 0xcb, 0x57, 0x75, 0x19, 0xc8, 0xcc,
+ 0x0c, 0xf5, 0xbc, 0xac, 0x81, 0x9d, 0x9a, 0xd2,
+ },
+},
{
.patch_id = 0x0a108108,
.digest = {
@@ -125,6 +170,15 @@
0x28, 0x1e, 0x9c, 0x59, 0x69, 0x99, 0x4d, 0x16,
},
},
+{
+ .patch_id = 0x0a108109,
+ .digest = {
+ 0x85, 0xb4, 0xbd, 0x7c, 0x49, 0xa7, 0xbd, 0xfa,
+ 0x49, 0x36, 0x80, 0x81, 0xc5, 0xb7, 0x39, 0x1b,
+ 0x9a, 0xaa, 0x50, 0xde, 0x9b, 0xe9, 0x32, 0x35,
+ 0x42, 0x7e, 0x51, 0x4f, 0x52, 0x2c, 0x28, 0x59,
+ },
+},
{
.patch_id = 0x0a20102d,
.digest = {
@@ -134,6 +188,15 @@
0x8c, 0xe9, 0x19, 0x3e, 0xcc, 0x3f, 0x7b, 0xb4,
},
},
+{
+ .patch_id = 0x0a20102e,
+ .digest = {
+ 0xbe, 0x1f, 0x32, 0x04, 0x0d, 0x3c, 0x9c, 0xdd,
+ 0xe1, 0xa4, 0xbf, 0x76, 0x3a, 0xec, 0xc2, 0xf6,
+ 0x11, 0x00, 0xa7, 0xaf, 0x0f, 0xe5, 0x02, 0xc5,
+ 0x54, 0x3a, 0x1f, 0x8c, 0x16, 0xb5, 0xff, 0xbe,
+ },
+},
{
.patch_id = 0x0a201210,
.digest = {
@@ -143,6 +206,15 @@
0xf7, 0x55, 0xf0, 0x13, 0xbb, 0x22, 0xf6, 0x41,
},
},
+{
+ .patch_id = 0x0a201211,
+ .digest = {
+ 0x69, 0xa1, 0x17, 0xec, 0xd0, 0xf6, 0x6c, 0x95,
+ 0xe2, 0x1e, 0xc5, 0x59, 0x1a, 0x52, 0x0a, 0x27,
+ 0xc4, 0xed, 0xd5, 0x59, 0x1f, 0xbf, 0x00, 0xff,
+ 0x08, 0x88, 0xb5, 0xe1, 0x12, 0xb6, 0xcc, 0x27,
+ },
+},
{
.patch_id = 0x0a404107,
.digest = {
@@ -152,6 +224,15 @@
0x13, 0xbc, 0xc5, 0x25, 0xe4, 0xc5, 0xc3, 0x99,
},
},
+{
+ .patch_id = 0x0a404108,
+ .digest = {
+ 0x69, 0x67, 0x43, 0x06, 0xf8, 0x0c, 0x62, 0xdc,
+ 0xa4, 0x21, 0x30, 0x4f, 0x0f, 0x21, 0x2c, 0xcb,
+ 0xcc, 0x37, 0xf1, 0x1c, 0xc3, 0xf8, 0x2f, 0x19,
+ 0xdf, 0x53, 0x53, 0x46, 0xb1, 0x15, 0xea, 0x00,
+ },
+},
{
.patch_id = 0x0a500011,
.digest = {
@@ -161,6 +242,15 @@
0x11, 0x5e, 0x96, 0x7e, 0x71, 0xe9, 0xfc, 0x74,
},
},
+{
+ .patch_id = 0x0a500012,
+ .digest = {
+ 0xeb, 0x74, 0x0d, 0x47, 0xa1, 0x8e, 0x09, 0xe4,
+ 0x93, 0x4c, 0xad, 0x03, 0x32, 0x4c, 0x38, 0x16,
+ 0x10, 0x39, 0xdd, 0x06, 0xaa, 0xce, 0xd6, 0x0f,
+ 0x62, 0x83, 0x9d, 0x8e, 0x64, 0x55, 0xbe, 0x63,
+ },
+},
{
.patch_id = 0x0a601209,
.digest = {
@@ -170,6 +260,15 @@
0xe8, 0x73, 0xe2, 0xd6, 0xdb, 0xd2, 0x77, 0x1d,
},
},
+{
+ .patch_id = 0x0a60120a,
+ .digest = {
+ 0x0c, 0x8b, 0x3d, 0xfd, 0x52, 0x52, 0x85, 0x7d,
+ 0x20, 0x3a, 0xe1, 0x7e, 0xa4, 0x21, 0x3b, 0x7b,
+ 0x17, 0x86, 0xae, 0xac, 0x13, 0xb8, 0x63, 0x9d,
+ 0x06, 0x01, 0xd0, 0xa0, 0x51, 0x9a, 0x91, 0x2c,
+ },
+},
{
.patch_id = 0x0a704107,
.digest = {
@@ -179,6 +278,15 @@
0x64, 0x39, 0x71, 0x8c, 0xce, 0xe7, 0x41, 0x39,
},
},
+{
+ .patch_id = 0x0a704108,
+ .digest = {
+ 0xd7, 0x55, 0x15, 0x2b, 0xfe, 0xc4, 0xbc, 0x93,
+ 0xec, 0x91, 0xa0, 0xae, 0x45, 0xb7, 0xc3, 0x98,
+ 0x4e, 0xff, 0x61, 0x77, 0x88, 0xc2, 0x70, 0x49,
+ 0xe0, 0x3a, 0x1d, 0x84, 0x38, 0x52, 0xbf, 0x5a,
+ },
+},
{
.patch_id = 0x0a705206,
.digest = {
@@ -188,6 +296,15 @@
0x03, 0x35, 0xe9, 0xbe, 0xfb, 0x06, 0xdf, 0xfc,
},
},
+{
+ .patch_id = 0x0a705208,
+ .digest = {
+ 0x30, 0x1d, 0x55, 0x24, 0xbc, 0x6b, 0x5a, 0x19,
+ 0x0c, 0x7d, 0x1d, 0x74, 0xaa, 0xd1, 0xeb, 0xd2,
+ 0x16, 0x62, 0xf7, 0x5b, 0xe1, 0x1f, 0x18, 0x11,
+ 0x5c, 0xf0, 0x94, 0x90, 0x26, 0xec, 0x69, 0xff,
+ },
+},
{
.patch_id = 0x0a708007,
.digest = {
@@ -197,6 +314,15 @@
0xdf, 0x92, 0x73, 0x84, 0x87, 0x3c, 0x73, 0x93,
},
},
+{
+ .patch_id = 0x0a708008,
+ .digest = {
+ 0x08, 0x6e, 0xf0, 0x22, 0x4b, 0x8e, 0xc4, 0x46,
+ 0x58, 0x34, 0xe6, 0x47, 0xa2, 0x28, 0xfd, 0xab,
+ 0x22, 0x3d, 0xdd, 0xd8, 0x52, 0x9e, 0x1d, 0x16,
+ 0xfa, 0x01, 0x68, 0x14, 0x79, 0x3e, 0xe8, 0x6b,
+ },
+},
{
.patch_id = 0x0a70c005,
.digest = {
@@ -206,6 +332,15 @@
0xee, 0x49, 0xac, 0xe1, 0x8b, 0x13, 0xc5, 0x13,
},
},
+{
+ .patch_id = 0x0a70c008,
+ .digest = {
+ 0x0f, 0xdb, 0x37, 0xa1, 0x10, 0xaf, 0xd4, 0x21,
+ 0x94, 0x0d, 0xa4, 0xa2, 0xe9, 0x86, 0x6c, 0x0e,
+ 0x85, 0x7c, 0x36, 0x30, 0xa3, 0x3a, 0x78, 0x66,
+ 0x18, 0x10, 0x60, 0x0d, 0x78, 0x3d, 0x44, 0xd0,
+ },
+},
{
.patch_id = 0x0aa00116,
.digest = {
@@ -224,3 +359,12 @@
0x68, 0x2f, 0x46, 0xee, 0xfe, 0xc6, 0x6d, 0xef,
},
},
+{
+ .patch_id = 0x0aa00216,
+ .digest = {
+ 0x79, 0xfb, 0x5b, 0x9f, 0xb6, 0xe6, 0xa8, 0xf5,
+ 0x4e, 0x7c, 0x4f, 0x8e, 0x1d, 0xad, 0xd0, 0x08,
+ 0xc2, 0x43, 0x7c, 0x8b, 0xe6, 0xdb, 0xd0, 0xd2,
+ 0xe8, 0x39, 0x26, 0xc1, 0xe5, 0x5a, 0x48, 0xf1,
+ },
+},
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Apr 2025 03:18:59 +0100
Subject: x86/idle: Rearrange VERW and MONITOR in mwait_idle_with_hints()
In order to mitigate TSA, Xen will need to issue VERW before going idle.
On AMD CPUs, the VERW scrubbing side effects cancel an active MONITOR, causing
the MWAIT to exit without entering an idle state. Therefore the VERW must be
ahead of MONITOR.
Split spec_ctrl_enter_idle() in two and allow the VERW aspect to be handled
separately. While adjusting, update a stale comment concerning MSBDS; more
issues have been mitigated using VERW since it was written.
By moving VERW earlier, it is ahead of the determination of whether to go
idle. We can't move the check on softirq_pending (for correctness reasons),
but we can duplicate it earlier as a best effort attempt to skip the
speculative overhead.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 1589325baa56..2673bc797f1e 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -462,9 +462,18 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
+ struct cpu_info *info = get_cpu_info();
irq_cpustat_t *stat = &irq_stat[cpu];
const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+ /*
+ * Heuristic: if we're definitely not going to idle, bail early as the
+ * speculative safety can be expensive. This is a performance
+ * consideration not a correctness issue.
+ */
+ if ( *this_softirq_pending )
+ return;
+
/*
* By setting in_mwait, we promise to other CPUs that we'll notice changes
* to __softirq_pending without being sent an IPI. We achieve this by
@@ -478,15 +487,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
"", X86_BUG_MONITOR,
[in_mwait] "=m" (stat->in_mwait));
+ /*
+ * On AMD systems, side effects from VERW cancel MONITOR, causing MWAIT to
+ * wake up immediately. Therefore, VERW must come ahead of MONITOR.
+ */
+ __spec_ctrl_enter_idle_verw(info);
+
monitor(this_softirq_pending, 0, 0);
ASSERT(!local_irq_is_enabled());
if ( !*this_softirq_pending )
{
- struct cpu_info *info = get_cpu_info();
-
- spec_ctrl_enter_idle(info);
+ __spec_ctrl_enter_idle(info, false /* VERW handled above */);
if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
mwait(eax, ecx);
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 077225418956..6724d3812029 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -115,8 +115,22 @@ static inline void init_shadow_spec_ctrl_state(void)
info->verw_sel = __HYPERVISOR_DS32;
}
+static always_inline void __spec_ctrl_enter_idle_verw(struct cpu_info *info)
+{
+ /*
+ * Flush/scrub structures which are statically partitioned between active
+ * threads. Otherwise data of ours (of unknown sensitivity) will become
+ * available to our sibling when we go idle.
+ *
+ * Note: VERW must be encoded with a memory operand, as it is only that
+ * form with side effects.
+ */
+ alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
+ [sel] "m" (info->verw_sel));
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
-static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+static always_inline void __spec_ctrl_enter_idle(struct cpu_info *info, bool verw)
{
uint32_t val = 0;
@@ -135,21 +149,8 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
"a" (val), "c" (MSR_SPEC_CTRL), "d" (0));
barrier();
- /*
- * Microarchitectural Store Buffer Data Sampling:
- *
- * On vulnerable systems, store buffer entries are statically partitioned
- * between active threads. When entering idle, our store buffer entries
- * are re-partitioned to allow the other threads to use them.
- *
- * Flush the buffers to ensure that no sensitive data of ours can be
- * leaked by a sibling after it gets our store buffer entries.
- *
- * Note: VERW must be encoded with a memory operand, as it is only that
- * form which causes a flush.
- */
- alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
- [sel] "m" (info->verw_sel));
+ if ( verw ) /* Expected to be const-propagated. */
+ __spec_ctrl_enter_idle_verw(info);
/*
* Cross-Thread Return Address Predictions:
@@ -167,6 +168,12 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
: "rax", "rcx");
}
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+ __spec_ctrl_enter_idle(info, true /* VERW */);
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 29 Aug 2024 17:36:11 +0100
Subject: x86/spec-ctrl: Mitigate Transitive Scheduler Attacks
TSA affects AMD Fam19h CPUs (Zen3 and 4 microarchitectures).
Three new CPUID bits have been defined. Two (TSA_SQ_NO and TSA_L1_NO)
indicate that the system is unaffected, and must be synthesised by Xen on
unaffected parts to date.
A third new bit indicates that VERW now has a flushing side effect. Xen must
synthesise this bit on affected systems based on microcode version. As with
other VERW-based flushing features, VERW_CLEAR needs OR-ing across a resource
pool, and guests which have seen it can safely migrate in.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 19b9068d36ec..4fbfd63dfdc4 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -198,6 +198,7 @@ static const char *const str_7a1[32] =
static const char *const str_e21a[32] =
{
[ 2] = "lfence+",
+ /* 4 */ [ 5] = "verw-clear",
[ 6] = "nscb",
[ 8] = "auto-ibrs",
@@ -256,6 +257,8 @@ static const char *const str_m10Al[64] =
static const char *const str_e21c[32] =
{
+ /* 0 */ [ 1] = "tsa-sq-no",
+ [ 2] = "tsa-l1-no",
};
static const struct {
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 662002333879..aae8e4983c03 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -505,6 +505,17 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
if ( test_bit(X86_FEATURE_RTM, fs) )
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicates that the VERW instruction may have gained
+ * scrubbing side effects. With pooling, it means "you might migrate
+ * somewhere where scrubbing is necessary", and may need exposing on
+ * unaffected hardware. This is fine, because the VERW instruction
+ * has been around since the 286.
+ */
+ __set_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
@@ -583,6 +594,17 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicate that the VERW instruction may have gained
+ * scrubbing side effects. The max policy has it set for migration
+ * reasons, so reset the default policy back to the host value in case
+ * we're unaffected.
+ */
+ if ( !cpu_has_verw_clear )
+ __clear_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 9fb457ad958e..db49813d772a 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -99,6 +99,8 @@ __UNLIKELY_END(nsvm_hap)
pop %rsi
pop %rdi
+ SPEC_CTRL_COND_VERW /* Req: %rsp=eframe Clob: efl */
+
vmrun
SAVE_ALL
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 919a9e31f04e..bbae9305de4e 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -192,6 +192,7 @@ static inline bool boot_cpu_has(unsigned int feat)
/* CPUID level 0x80000021.eax */
#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_verw_clear boot_cpu_has(X86_FEATURE_VERW_CLEAR)
#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
/* CPUID level 0x00000007:1.edx */
@@ -218,6 +219,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_pb_opt_ctrl boot_cpu_has(X86_FEATURE_PB_OPT_CTRL)
#define cpu_has_its_no boot_cpu_has(X86_FEATURE_ITS_NO)
+/* CPUID level 0x80000021.ecx */
+#define cpu_has_tsa_sq_no boot_cpu_has(X86_FEATURE_TSA_SQ_NO)
+#define cpu_has_tsa_l1_no boot_cpu_has(X86_FEATURE_TSA_L1_NO)
+
/* Synthesized. */
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index d8702b0e189c..c1ad335d9ea9 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -492,7 +492,7 @@ custom_param("pv-l1tf", parse_pv_l1tf);
static void __init print_details(enum ind_thunk thunk)
{
- unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
+ unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, e21c = 0, max = 0, tmp;
uint64_t caps = 0;
/* Collect diagnostics about available mitigations. */
@@ -503,7 +503,7 @@ static void __init print_details(enum ind_thunk thunk)
if ( boot_cpu_data.extended_cpuid_level >= 0x80000008 )
cpuid(0x80000008, &tmp, &e8b, &tmp, &tmp);
if ( boot_cpu_data.extended_cpuid_level >= 0x80000021 )
- cpuid(0x80000021, &e21a, &tmp, &tmp, &tmp);
+ cpuid(0x80000021U, &e21a, &tmp, &e21c, &tmp);
if ( cpu_has_arch_caps )
rdmsrl(MSR_ARCH_CAPABILITIES, caps);
@@ -513,7 +513,7 @@ static void __init print_details(enum ind_thunk thunk)
* Hardware read-only information, stating immunity to certain issues, or
* suggestions of which mitigation to use.
*/
- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
(caps & ARCH_CAPS_EIBRS) ? " EIBRS" : "",
(caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
@@ -538,10 +538,12 @@ static void __init print_details(enum ind_thunk thunk)
(e8b & cpufeat_mask(X86_FEATURE_IBPB_RET)) ? " IBPB_RET" : "",
(e21a & cpufeat_mask(X86_FEATURE_IBPB_BRTYPE)) ? " IBPB_BRTYPE" : "",
(e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_SQ_NO)) ? " TSA_SQ_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_L1_NO)) ? " TSA_L1_NO" : "");
/* Hardware features which need driving to mitigate issues. */
- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
(_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -561,7 +563,8 @@ static void __init print_details(enum ind_thunk thunk)
(caps & ARCH_CAPS_GDS_CTRL) ? " GDS_CTRL" : "",
(caps & ARCH_CAPS_RFDS_CLEAR) ? " RFDS_CLEAR" : "",
(e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "",
+ (e21a & cpufeat_mask(X86_FEATURE_VERW_CLEAR)) ? " VERW_CLEAR" : "");
/* Compiled-in support which pertains to mitigations. */
if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) ||
@@ -1545,6 +1548,77 @@ static void __init rfds_calculations(void)
setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
}
+/*
+ * Transient Scheduler Attacks
+ *
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
+ */
+static void __init tsa_calculations(void)
+{
+ unsigned int curr_rev, min_rev;
+
+ /* TSA is only known to affect AMD processors at this time. */
+ if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
+ return;
+
+ /* If we're virtualised, don't attempt to synthesise anything. */
+ if ( cpu_has_hypervisor )
+ return;
+
+ /*
+ * According to the whitepaper, some Fam1A CPUs (Models 0x00...0x4f,
+ * 0x60...0x7f) are not vulnerable but don't enumerate TSA_{SQ,L1}_NO. If
+ * we see either enumerated, assume both are correct ...
+ */
+ if ( cpu_has_tsa_sq_no || cpu_has_tsa_l1_no )
+ return;
+
+ /*
+ * ... otherwise, synthesise them. CPUs other than Fam19 (Zen3/4) are
+ * stated to be not vulnerable.
+ */
+ if ( boot_cpu_data.x86 != 0x19 )
+ {
+ setup_force_cpu_cap(X86_FEATURE_TSA_SQ_NO);
+ setup_force_cpu_cap(X86_FEATURE_TSA_L1_NO);
+ return;
+ }
+
+ /*
+ * Fam19 CPUs get VERW_CLEAR with new enough microcode, but must
+ * synthesise the CPUID bit.
+ */
+ curr_rev = this_cpu(cpu_sig).rev;
+ switch ( curr_rev >> 8 )
+ {
+ case 0x0a0011: min_rev = 0x0a0011d7; break;
+ case 0x0a0012: min_rev = 0x0a00123b; break;
+ case 0x0a0082: min_rev = 0x0a00820d; break;
+ case 0x0a1011: min_rev = 0x0a10114c; break;
+ case 0x0a1012: min_rev = 0x0a10124c; break;
+ case 0x0a1081: min_rev = 0x0a108109; break;
+ case 0x0a2010: min_rev = 0x0a20102e; break;
+ case 0x0a2012: min_rev = 0x0a201211; break;
+ case 0x0a4041: min_rev = 0x0a404108; break;
+ case 0x0a5000: min_rev = 0x0a500012; break;
+ case 0x0a6012: min_rev = 0x0a60120a; break;
+ case 0x0a7041: min_rev = 0x0a704108; break;
+ case 0x0a7052: min_rev = 0x0a705208; break;
+ case 0x0a7080: min_rev = 0x0a708008; break;
+ case 0x0a70c0: min_rev = 0x0a70c008; break;
+ case 0x0aa002: min_rev = 0x0aa00216; break;
+ default:
+ printk(XENLOG_WARNING
+ "Unrecognised CPU %02x-%02x-%02x, ucode 0x%08x for TSA mitigation\n",
+ boot_cpu_data.x86, boot_cpu_data.x86_model,
+ boot_cpu_data.x86_mask, curr_rev);
+ return;
+ }
+
+ if ( curr_rev >= min_rev )
+ setup_force_cpu_cap(X86_FEATURE_VERW_CLEAR);
+}
+
static bool __init cpu_has_gds(void)
{
/*
@@ -2238,6 +2312,7 @@ void __init init_speculation_mitigations(void)
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
*
* Relevant ucodes:
*
@@ -2270,9 +2345,18 @@ void __init init_speculation_mitigations(void)
*
* - March 2023, for RFDS. Enumerate RFDS_CLEAR to mean that VERW now
* scrubs non-architectural entries from certain register files.
+ *
+ * - July 2025, for TSA. Introduces VERW side effects to mitigate
+ * TSA_{SQ/L1}. Xen must synthesise the VERW_CLEAR feature based on
+ * microcode version.
+ *
+ * Note, these microcode updates were produced before the remediation of
+ * the microcode signature issues, and are included in the firwmare
+ * updates fixing the entrysign vulnerability from ~December 2024.
*/
mds_calculations();
rfds_calculations();
+ tsa_calculations();
/*
* Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -2304,21 +2388,27 @@ void __init init_speculation_mitigations(void)
* MLPDS/MFBDS when SMT is enabled.
*/
if ( opt_verw_pv == -1 )
- opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_pv = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
if ( opt_verw_hvm == -1 )
- opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_hvm = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
/*
- * If SMT is active, and we're protecting against MDS or MMIO stale data,
+ * If SMT is active, and we're protecting against any of:
+ * - MSBDS
+ * - MMIO stale data
+ * - TSA-SQ
* we need to scrub before going idle as well as on return to guest.
* Various pipeline resources are repartitioned amongst non-idle threads.
*
- * We don't need to scrub on idle for RFDS. There are no affected cores
- * which support SMT, despite there being affected cores in hybrid systems
- * which have SMT elsewhere in the platform.
+ * We don't need to scrub on idle for:
+ * - RFDS (no SMT affected cores)
+ * - TSA-L1 (utags never shared between threads)
*/
if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+ (cpu_has_verw_clear && !cpu_has_tsa_sq_no) ||
opt_verw_mmio) && hw_smt_enabled )
setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 86e44dd85258..7b80cd0c19c5 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -288,6 +288,7 @@ XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
/* AMD-defined CPU features, CPUID level 0x80000021.eax, word 11 */
XEN_CPUFEATURE(LFENCE_DISPATCH, 11*32+ 2) /*A LFENCE always serializing */
+XEN_CPUFEATURE(VERW_CLEAR, 11*32+ 5) /*!A VERW clears microarchitectural buffers */
XEN_CPUFEATURE(NSCB, 11*32+ 6) /*A Null Selector Clears Base (and limit too) */
XEN_CPUFEATURE(AUTO_IBRS, 11*32+ 8) /*S Automatic IBRS */
XEN_CPUFEATURE(CPUID_USER_DIS, 11*32+17) /* CPUID disable for CPL > 0 software */
@@ -349,6 +350,8 @@ XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+XEN_CPUFEATURE(TSA_SQ_NO, 18*32+ 1) /*A No Store Queue Transitive Scheduler Attacks */
+XEN_CPUFEATURE(TSA_L1_NO, 18*32+ 2) /*A No L1D Transitive Scheduler Attacks */
#endif /* XEN_CPUFEATURE */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 20:59:37 +0100
Subject: x86/cpufeature: Reposition cpu_has_{lfence_dispatch,nscb}
LFENCE_DISPATCH used to be a synthetic feature, but was given a real CPUID bit
by AMD. The define wasn't moved when this was changed.
NSCB has always been a real CPUID bit, and was misplaced when introduced in
the synthetic block alongside LFENCE_DISPATCH.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a039b050071eba644ab414d76ac5d5fc9e067a5)
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 6dbe6dfe0990..5e1090a5470b 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -194,6 +194,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_avx512_bf16 boot_cpu_has(X86_FEATURE_AVX512_BF16)
#define cpu_has_avx_ifma boot_cpu_has(X86_FEATURE_AVX_IFMA)
+/* CPUID level 0x80000021.eax */
+#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
+
/* CPUID level 0x00000007:1.edx */
#define cpu_has_avx_vnni_int8 boot_cpu_has(X86_FEATURE_AVX_VNNI_INT8)
#define cpu_has_avx_ne_convert boot_cpu_has(X86_FEATURE_AVX_NE_CONVERT)
@@ -223,8 +227,6 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
#define cpu_has_aperfmperf boot_cpu_has(X86_FEATURE_APERFMPERF)
-#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
-#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
#define cpu_has_xen_lbr boot_cpu_has(X86_FEATURE_XEN_LBR)
#define cpu_has_xen_shstk (IS_ENABLED(CONFIG_XEN_SHSTK) && \
boot_cpu_has(X86_FEATURE_XEN_SHSTK))
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 14:59:01 +0100
Subject: x86/idle: Move monitor()/mwait() wrappers into cpu-idle.c
They're not used by any other translation unit, so shouldn't live in
asm/processor.h, which is included almost everywhere.
Our new toolchain baseline knows the MONITOR/MWAIT instructions, so use them
directly rather than using raw hex.
Change the hint/extention parameters from long to int. They're specified to
remain 32bit operands even 64-bit mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 61e10fc28ccddff7c72c14acec56dc7ef2b155d1)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index d0607d8a6952..45a3140bdc26 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -59,6 +59,19 @@
/*#define DEBUG_PM_CX*/
+static always_inline void monitor(
+ const void *addr, unsigned int ecx, unsigned int edx)
+{
+ asm volatile ( "monitor"
+ :: "a" (addr), "c" (ecx), "d" (edx) );
+}
+
+static always_inline void mwait(unsigned int eax, unsigned int ecx)
+{
+ asm volatile ( "mwait"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -482,7 +495,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mb();
}
- __monitor(monitor_addr, 0, 0);
+ monitor(monitor_addr, 0, 0);
smp_mb();
/*
@@ -496,7 +509,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
spec_ctrl_enter_idle(info);
- __mwait(eax, ecx);
+ mwait(eax, ecx);
spec_ctrl_exit_idle(info);
cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
@@ -927,9 +940,9 @@ void cf_check acpi_dead_idle(void)
*/
mb();
clflush(mwait_ptr);
- __monitor(mwait_ptr, 0, 0);
+ monitor(mwait_ptr, 0, 0);
mb();
- __mwait(cx->address, 0);
+ mwait(cx->address, 0);
}
}
else if ( (current_cpu_data.x86_vendor &
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index c709d337c9b9..c02566a915bd 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -319,23 +319,6 @@ static always_inline void set_in_cr4 (unsigned long mask)
write_cr4(read_cr4() | mask);
}
-static always_inline void __monitor(const void *eax, unsigned long ecx,
- unsigned long edx)
-{
- /* "monitor %eax,%ecx,%edx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc8;"
- : : "a" (eax), "c" (ecx), "d"(edx) );
-}
-
-static always_inline void __mwait(unsigned long eax, unsigned long ecx)
-{
- /* "mwait %eax,%ecx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc9;"
- : : "a" (eax), "c" (ecx) );
-}
-
#define IOBMP_BYTES 8192
#define IOBMP_INVALID_OFFSET 0x8000
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 15:55:29 +0100
Subject: x86/idle: Remove MFENCEs for CLFLUSH_MONITOR
Commit 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround") was
inherited from Linux and added MFENCEs around the AAI65 errata fix.
The SDM now states:
Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
and fence instructions[1].
with footnote 1 reading:
Earlier versions of this manual specified that executions of the CLFLUSH
instruction were ordered only by the MFENCE instruction. All processors
implementing the CLFLUSH instruction also order it relative to the other
operations enumerated above.
I.e. the MFENCEs came about because of an incorrect statement in the SDM.
The Spec Update (no longer available on Intel's website) simply says "issue a
CLFLUSH", with no mention of MFENCEs.
As this erratum is specific to Intel, it's fine to remove the the MFENCEs; AMD
CPUs of a similar vintage do sport otherwise-unordered CLFLUSHs.
Move the feature bit into the BUG range (rather than FEATURE), and move the
workaround into monitor() itself.
The erratum check itself must use setup_force_cpu_cap(). It needs activating
if any CPU needs it, not if all of them need it.
Fixes: 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround")
Fixes: 96d1b237ae9b ("x86/Intel: work around Xeon 7400 series erratum AAI65")
Link: https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f77ef3443542a2c2bbd59ee66178287d4fa5b43f)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 45a3140bdc26..41d771d8f395 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -62,6 +62,9 @@
static always_inline void monitor(
const void *addr, unsigned int ecx, unsigned int edx)
{
+ alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
+ [addr] "a" (addr));
+
asm volatile ( "monitor"
:: "a" (addr), "c" (ecx), "d" (edx) );
}
@@ -488,13 +491,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
s_time_t expires = per_cpu(timer_deadline, cpu);
const void *monitor_addr = &mwait_wakeup(cpu);
- if ( boot_cpu_has(X86_FEATURE_CLFLUSH_MONITOR) )
- {
- mb();
- clflush(monitor_addr);
- mb();
- }
-
monitor(monitor_addr, 0, 0);
smp_mb();
@@ -929,19 +925,7 @@ void cf_check acpi_dead_idle(void)
while ( 1 )
{
- /*
- * 1. The CLFLUSH is a workaround for erratum AAI65 for
- * the Xeon 7400 series.
- * 2. The WBINVD is insufficient due to the spurious-wakeup
- * case where we return around the loop.
- * 3. Unlike wbinvd, clflush is a light weight but not serializing
- * instruction, hence memory fence is necessary to make sure all
- * load/store visible before flush cache line.
- */
- mb();
- clflush(mwait_ptr);
monitor(mwait_ptr, 0, 0);
- mb();
mwait(cx->address, 0);
}
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index f03eedcc2511..57258220e822 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -446,6 +446,7 @@ static void __init probe_mwait_errata(void)
*
* Xeon 7400 erratum AAI65 (and further newer Xeons)
* MONITOR/MWAIT may have excessive false wakeups
+ * https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
*/
static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
{
@@ -463,7 +464,7 @@ static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
if (c->x86 == 6 && cpu_has_clflush &&
(c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
- __set_bit(X86_FEATURE_CLFLUSH_MONITOR, c->x86_capability);
+ setup_force_cpu_cap(X86_BUG_CLFLUSH_MONITOR);
probe_c3_errata(c);
if (system_state < SYS_STATE_smp_boot)
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..84c93292c80c 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -19,7 +19,7 @@ XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon
XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */
XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */
XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */
-XEN_CPUFEATURE(CLFLUSH_MONITOR, X86_SYNTH( 7)) /* clflush reqd with monitor */
+/* Bit 7 unused */
XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */
@@ -52,6 +52,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
+#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 24 Jun 2025 15:20:52 +0100
Subject: Revert part of "x86/mwait-idle: disable IBRS during long idle"
Most of the patch (handling of CPUIDLE_FLAG_IBRS) is fine, but the
adjustements to mwait_idle() are not; spec_ctrl_enter_idle() does more than
just alter MSR_SPEC_CTRL.IBRS.
The only reason this doesn't need an XSA is because the unconditional
spec_ctrl_{enter,exit}_idle() in mwait_idle_with_hints() were left unaltered,
and thus the MWAIT remained properly protected.
There (would have been) two problems. In the ibrs_disable (== deep C) case:
* On entry, VERW and RSB-stuffing are architecturally skipped.
* On exit, there's a branch crossing the WRMSR which reinstates the
speculative safety for indirect branches.
All this change did was double up the expensive operations in the deep C case,
and fail to optimise the intended case.
I have an idea of how to plumb this more nicely, but it requires larger
changes to legacy IBRS handling to not make spec_ctrl_enter_idle() vulnerable
in other ways. In the short term, simply take out the perf hit.
Fixes: 08acdf9a2615 ("x86/mwait-idle: disable IBRS during long idle")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 07d7163334a7507d329958b19d976be769580999)
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index ae6987117169..182518528a6e 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -891,7 +891,6 @@ static const struct cpuidle_state snr_cstates[] = {
static void cf_check mwait_idle(void)
{
unsigned int cpu = smp_processor_id();
- struct cpu_info *info = get_cpu_info();
struct acpi_processor_power *power = processor_powers[cpu];
struct acpi_processor_cx *cx = NULL;
unsigned int next_state;
@@ -918,6 +917,8 @@ static void cf_check mwait_idle(void)
pm_idle_save();
else
{
+ struct cpu_info *info = get_cpu_info();
+
spec_ctrl_enter_idle(info);
safe_halt();
spec_ctrl_exit_idle(info);
@@ -944,11 +945,6 @@ static void cf_check mwait_idle(void)
if ((cx->type >= 3) && errata_c6_workaround())
cx = power->safe_state;
- if (cx->ibrs_disable) {
- ASSERT(!cx->irq_enable_early);
- spec_ctrl_enter_idle(info);
- }
-
#if 0 /* XXX Can we/do we need to do something similar on Xen? */
/*
* leave_mm() to avoid costly and often unnecessary wakeups
@@ -980,10 +976,6 @@ static void cf_check mwait_idle(void)
/* Now back in C0. */
update_idle_stats(power, cx, before, after);
-
- if (cx->ibrs_disable)
- spec_ctrl_exit_idle(info);
-
local_irq_enable();
TRACE_TIME(TRC_PM_IDLE_EXIT, cx->type, after,
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 14:46:01 +0100
Subject: x86/cpu-policy: Simplify logic in
guest_common_default_feature_adjustments()
For features which are unconditionally set in the max policies, making the
default policy to match the host can be done with a conditional clear.
This is simpler than the unconditional clear, conditional set currently
performed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 30f8fed68f3c2e63594ff9202b3d05b971781e36)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 787785c41ae3..e34cba189c75 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -515,17 +515,14 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* reasons, so reset the default policy back to the host values in
* case we're unaffected.
*/
- __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- if ( cpu_has_md_clear )
- __set_bit(X86_FEATURE_MD_CLEAR, fs);
+ if ( !cpu_has_md_clear )
+ __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- if ( cpu_has_fb_clear )
- __set_bit(X86_FEATURE_FB_CLEAR, fs);
+ if ( !cpu_has_fb_clear )
+ __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
- if ( cpu_has_rfds_clear )
- __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+ if ( !cpu_has_rfds_clear )
+ __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
/*
* The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -545,13 +542,11 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* Topology information is at the toolstack's discretion so these are
* unconditionally set in max, but pick a default which matches the host.
*/
- __clear_bit(X86_FEATURE_HTT, fs);
- if ( cpu_has_htt )
- __set_bit(X86_FEATURE_HTT, fs);
+ if ( !cpu_has_htt )
+ __clear_bit(X86_FEATURE_HTT, fs);
- __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
- if ( cpu_has_cmp_legacy )
- __set_bit(X86_FEATURE_CMP_LEGACY, fs);
+ if ( !cpu_has_cmp_legacy )
+ __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
/*
* On certain hardware, speculative or errata workarounds can result in
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 15:51:53 +0100
Subject: x86/idle: Remove broken MWAIT implementation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI certain targets. Also,
there's no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ
on it's way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line, and not in a separate cpuidle_mwait_flags variable, in order
to do this in a race-free way.
Arranging to fix this while keeping the old implementation is prohibitive, so
strip the current one out in order to implement the new one cleanly. In the
interim, this causes IPIs to be used unconditionally which is safe albeit
suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3faf0866a33070b926ab78e6298290403f85e76c)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 41d771d8f395..4ed1878e262c 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -448,27 +448,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -477,42 +456,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -913,7 +875,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 2f54d3188966..84f820fef605 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -187,8 +187,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 342361cb6fdd..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 705d0c1135f0..120e354fe340 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 18:13:27 +0100
Subject: x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
isn't correct either, as this only pertains to local ordering. All we need is
a compiler barrier().
Merge the barier() into the monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e7710dd843ba9d204f6ee2973d6120c1984958a6)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 4ed1878e262c..a4a6f8694373 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -65,8 +65,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be reordered over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -465,7 +469,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
const unsigned int *this_softirq_pending = &softirq_pending(cpu);
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:40:51 +0100
Subject: x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b0ca0f93f47c43f8984981137af07ca3d161e3ec)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index a4a6f8694373..c42ffb244e8b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,14 +452,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 57258220e822..dbf17be1287f 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -436,7 +436,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index 1f1e39775b99..9298f987c435 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
bool mwait_pc10_supported(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:04:17 +0100
Subject: xen/softirq: Rework arch_skip_send_event_check() into
arch_set_softirq()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up the
target) and, prior to the removal of the broken IPI-elision algorithm, was
racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to also
sample the status of the target in one atomic action. Implement a new arch
helper with different semantics; to make the softirq pending and decide about
IPIs together. For now, implement the default helper. It will be overridden
by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b473e5e212e445d3c193c1c83b52b129af571b19)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c42ffb244e8b..489d894c2f66 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -452,11 +452,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index bee4a82009c3..626c47de82ac 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_set_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_set_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 33d6f2ecd223..5c2361865b49 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -21,6 +21,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_set_softirq
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:26:24 +0100
Subject: x86/idle: Implement a new MWAIT IPI-elision algorithm
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3e0bc4b50350bd357304fd79a5dc0472790dba91)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 489d894c2f66..176df1ed174f 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -455,7 +455,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising ourselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
@@ -467,6 +481,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..55b65c9747b1 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,50 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t prev, old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ for ( ;; )
+ {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ prev = cmpxchg(ptr, old, new);
+ if ( prev == old )
+ break;
+
+ old = prev;
+ }
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ *
+ * We can't access the in_mwait field nicely, so use some BUILD_BUG_ON()'s
+ * to cross-check the (1UL << 32) opencoding.
+ */
+ BUILD_BUG_ON(sizeof(irq_stat[0].softirq_mwait_raw) != 8);
+ BUILD_BUG_ON((offsetof(irq_cpustat_t, in_mwait) -
+ offsetof(irq_cpustat_t, softirq_mwait_raw)) != 4);
+
+ return new & (1UL << 32) /* in_mwait */;
+
+}
+#define arch_set_softirq arch_set_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Jul 2025 14:51:38 +0100
Subject: x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on
Xeons"
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b0f0f6e235618c2764e925b58c4bfe412730ced)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 176df1ed174f..69857a58ef5a 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -79,6 +79,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -473,12 +480,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 182518528a6e..3c63f0d45a11 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -962,12 +962,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 17:19:19 +0100
Subject: x86/cpu-policy: Rearrange guest_common_*_feature_adjustments()
Turn the if()s into switch()es, as we're going to need AMD sections.
Move the RTM adjustments into the Intel section, where they ought to live.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index e34cba189c75..af2b4d7fa000 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -415,8 +415,9 @@ static void __init guest_common_default_leaves(struct cpu_policy *p)
static void __init guest_common_max_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* MSR_ARCH_CAPS is just feature data, and we can offer it to guests
* unconditionally, although limit it to Intel systems as it is highly
@@ -461,6 +462,22 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__set_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * To mitigate Native-BHI, one option is to use a TSX Abort on capable
+ * systems. This is safe even if RTM has been disabled for other
+ * reasons via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel
+ * doesn't get to know this type of information.
+ *
+ * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to
+ * instead mean "XBEGIN won't fault". This is enough for a guest
+ * kernel to make an informed choice WRT mitigating Native-BHI.
+ *
+ * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
+ */
+ if ( test_bit(X86_FEATURE_RTM, fs) )
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ break;
}
/*
@@ -472,27 +489,13 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
*/
__set_bit(X86_FEATURE_HTT, fs);
__set_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * To mitigate Native-BHI, one option is to use a TSX Abort on capable
- * systems. This is safe even if RTM has been disabled for other reasons
- * via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel doesn't get to
- * know this type of information.
- *
- * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to instead
- * mean "XBEGIN won't fault". This is enough for a guest kernel to make
- * an informed choice WRT mitigating Native-BHI.
- *
- * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
- */
- if ( test_bit(X86_FEATURE_RTM, fs) )
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
static void __init guest_common_default_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* IvyBridge client parts suffer from leakage of RDRAND data due to SRBDS
* (XSA-320 / CVE-2020-0543), and won't be receiving microcode to
@@ -536,6 +539,23 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__clear_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * On certain hardware, speculative or errata workarounds can result
+ * in TSX being placed in "force-abort" mode, where it doesn't
+ * actually function as expected, but is technically compatible with
+ * the ISA.
+ *
+ * Do not advertise RTM to guests by default if it won't actually
+ * work. Instead, advertise RTM_ALWAYS_ABORT indicating that TSX
+ * Aborts are safe to use, e.g. for mitigating Native-BHI.
+ */
+ if ( rtm_disabled )
+ {
+ __clear_bit(X86_FEATURE_RTM, fs);
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ }
+ break;
}
/*
@@ -547,21 +567,6 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
if ( !cpu_has_cmp_legacy )
__clear_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * On certain hardware, speculative or errata workarounds can result in
- * TSX being placed in "force-abort" mode, where it doesn't actually
- * function as expected, but is technically compatible with the ISA.
- *
- * Do not advertise RTM to guests by default if it won't actually work.
- * Instead, advertise RTM_ALWAYS_ABORT indicating that TSX Aborts are safe
- * to use, e.g. for mitigating Native-BHI.
- */
- if ( rtm_disabled )
- {
- __clear_bit(X86_FEATURE_RTM, fs);
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
- }
}
static void __init guest_common_feature_adjustments(uint32_t *fs)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 19:55:15 +0100
Subject: x86/cpu-policy: Infrastructure for CPUID leaf 0x80000021.ecx
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index 063fe86eb72f..f738e17b19e4 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -342,6 +342,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *policy, const char* str)
CPUID_ENTRY(0x00000007, 1, CPUID_REG_EDX),
MSR_ENTRY(0x10a, CPUID_REG_EAX),
MSR_ENTRY(0x10a, CPUID_REG_EDX),
+ CPUID_ENTRY(0x80000021, NA, CPUID_REG_ECX),
#undef MSR_ENTRY
#undef CPUID_ENTRY
};
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 4c4593528dfe..8e36b8e69600 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -37,6 +37,7 @@ static const struct {
{ "CPUID 0x00000007:1.edx", "7d1" },
{ "MSR_ARCH_CAPS.lo", "m10Al" },
{ "MSR_ARCH_CAPS.hi", "m10Ah" },
+ { "CPUID 0x80000021.ecx", "e21c" },
};
#define COL_ALIGN "24"
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index af2b4d7fa000..f40b25c91681 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -327,7 +327,6 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
p->extd.raw[0x21].b = 0;
- p->extd.raw[0x21].c = 0;
p->extd.raw[0x21].d = 0;
break;
}
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index ff4cd2289797..d4d21da9c560 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -479,7 +479,9 @@ static void generic_identify(struct cpuinfo_x86 *c)
if (c->extended_cpuid_level >= 0x80000008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x80000008);
if (c->extended_cpuid_level >= 0x80000021)
- c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x80000021);
+ cpuid(0x80000021,
+ &c->x86_capability[FEATURESET_e21a], &tmp,
+ &c->x86_capability[FEATURESET_e21c], &tmp);
/* Intel-defined flags: level 0x00000007 */
if (c->cpuid_level >= 7) {
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 3a2b646f0268..03acd49387aa 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -369,6 +369,8 @@ XEN_CPUFEATURE(RFDS_CLEAR, 16*32+28) /*!A| Register File(s) cleared by V
XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
+/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+
#endif /* XEN_CPUFEATURE */
/* Clean up from a default include. Close the enum (for C). */
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index 753ac78114da..ae0db6f3e16f 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -22,6 +22,7 @@
#define FEATURESET_7d1 15 /* 0x00000007:1.edx */
#define FEATURESET_m10Al 16 /* 0x0000010a.eax */
#define FEATURESET_m10Ah 17 /* 0x0000010a.edx */
+#define FEATURESET_e21c 18 /* 0x80000021.ecx */
struct cpuid_leaf
{
@@ -328,7 +329,11 @@ struct cpu_policy
uint16_t ucode_size; /* Units of 16 bytes */
uint8_t rap_size; /* Units of 8 entries */
uint8_t :8;
- uint32_t /* c */:32, /* d */:32;
+ union {
+ uint32_t e21c;
+ struct { DECL_BITFIELD(e21c); };
+ };
+ uint32_t /* d */:32;
};
} extd;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index eb7698dc7325..6298d051f2a6 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -81,6 +81,7 @@ void x86_cpu_policy_to_featureset(
fs[FEATURESET_7d1] = p->feat._7d1;
fs[FEATURESET_m10Al] = p->arch_caps.lo;
fs[FEATURESET_m10Ah] = p->arch_caps.hi;
+ fs[FEATURESET_e21c] = p->extd.e21c;
}
void x86_cpu_featureset_to_policy(
@@ -104,6 +105,7 @@ void x86_cpu_featureset_to_policy(
p->feat._7d1 = fs[FEATURESET_7d1];
p->arch_caps.lo = fs[FEATURESET_m10Al];
p->arch_caps.hi = fs[FEATURESET_m10Ah];
+ p->extd.e21c = fs[FEATURESET_e21c];
}
void x86_cpu_policy_recalc_synth(struct cpu_policy *p)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Sep 2024 11:28:39 +0100
Subject: x86/ucode: Digests for TSA microcode
AMD are releasing microcode for TSA, so extend the known-provenance list with
their hashes. These were produced before the remediation of the microcode
signature issues (the entrysign vulnerability), so can be OS-loaded on
out-of-date firmware.
Include an off-by-default check for the sorted-ness of patch_digests[]. It's
not worth running generally under SELF_TESTS, but is useful when editing the
digest list.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu/microcode/amd-patch-digests.c b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
index d32761226712..d2c4e0178a1e 100644
--- a/xen/arch/x86/cpu/microcode/amd-patch-digests.c
+++ b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
@@ -80,6 +80,15 @@
0x0d, 0x5b, 0x65, 0x34, 0x69, 0xb2, 0x62, 0x21,
},
},
+{
+ .patch_id = 0x0a0011d7,
+ .digest = {
+ 0x35, 0x07, 0xcd, 0x40, 0x94, 0xbc, 0x81, 0x6b,
+ 0xfc, 0x61, 0x56, 0x1a, 0xe2, 0xdb, 0x96, 0x12,
+ 0x1c, 0x1c, 0x31, 0xb1, 0x02, 0x6f, 0xe5, 0xd2,
+ 0xfe, 0x1b, 0x04, 0x03, 0x2c, 0x8f, 0x4c, 0x36,
+ },
+},
{
.patch_id = 0x0a001238,
.digest = {
@@ -89,6 +98,15 @@
0xc0, 0xcd, 0x33, 0xf2, 0x8d, 0xf9, 0xef, 0x59,
},
},
+{
+ .patch_id = 0x0a00123b,
+ .digest = {
+ 0xef, 0xa1, 0x1e, 0x71, 0xf1, 0xc3, 0x2c, 0xe2,
+ 0xc3, 0xef, 0x69, 0x41, 0x7a, 0x54, 0xca, 0xc3,
+ 0x8f, 0x62, 0x84, 0xee, 0xc2, 0x39, 0xd9, 0x28,
+ 0x95, 0xa7, 0x12, 0x49, 0x1e, 0x30, 0x71, 0x72,
+ },
+},
{
.patch_id = 0x0a00820c,
.digest = {
@@ -98,6 +116,15 @@
0xe1, 0x3b, 0x8d, 0xb2, 0xf8, 0x22, 0x03, 0xe2,
},
},
+{
+ .patch_id = 0x0a00820d,
+ .digest = {
+ 0xf9, 0x2a, 0xc0, 0xf4, 0x9e, 0xa4, 0x87, 0xa4,
+ 0x7d, 0x87, 0x00, 0xfd, 0xab, 0xda, 0x19, 0xca,
+ 0x26, 0x51, 0x32, 0xc1, 0x57, 0x91, 0xdf, 0xc1,
+ 0x05, 0xeb, 0x01, 0x7c, 0x5a, 0x95, 0x21, 0xb7,
+ },
+},
{
.patch_id = 0x0a101148,
.digest = {
@@ -107,6 +134,15 @@
0xf1, 0x5e, 0xb0, 0xde, 0xb4, 0x98, 0xae, 0xc4,
},
},
+{
+ .patch_id = 0x0a10114c,
+ .digest = {
+ 0x9e, 0xb6, 0xa2, 0xd9, 0x87, 0x38, 0xc5, 0x64,
+ 0xd8, 0x88, 0xfa, 0x78, 0x98, 0xf9, 0x6f, 0x74,
+ 0x39, 0x90, 0x1b, 0xa5, 0xcf, 0x5e, 0xb4, 0x2a,
+ 0x02, 0xff, 0xd4, 0x8c, 0x71, 0x8b, 0xe2, 0xc0,
+ },
+},
{
.patch_id = 0x0a101248,
.digest = {
@@ -116,6 +152,15 @@
0x1b, 0x7d, 0x64, 0x9d, 0x4b, 0x53, 0x13, 0x75,
},
},
+{
+ .patch_id = 0x0a10124c,
+ .digest = {
+ 0x29, 0xea, 0xf1, 0x2c, 0xb2, 0xe4, 0xef, 0x90,
+ 0xa4, 0xcd, 0x1d, 0x86, 0x97, 0x17, 0x61, 0x46,
+ 0xfc, 0x22, 0xcb, 0x57, 0x75, 0x19, 0xc8, 0xcc,
+ 0x0c, 0xf5, 0xbc, 0xac, 0x81, 0x9d, 0x9a, 0xd2,
+ },
+},
{
.patch_id = 0x0a108108,
.digest = {
@@ -125,6 +170,15 @@
0x28, 0x1e, 0x9c, 0x59, 0x69, 0x99, 0x4d, 0x16,
},
},
+{
+ .patch_id = 0x0a108109,
+ .digest = {
+ 0x85, 0xb4, 0xbd, 0x7c, 0x49, 0xa7, 0xbd, 0xfa,
+ 0x49, 0x36, 0x80, 0x81, 0xc5, 0xb7, 0x39, 0x1b,
+ 0x9a, 0xaa, 0x50, 0xde, 0x9b, 0xe9, 0x32, 0x35,
+ 0x42, 0x7e, 0x51, 0x4f, 0x52, 0x2c, 0x28, 0x59,
+ },
+},
{
.patch_id = 0x0a20102d,
.digest = {
@@ -134,6 +188,15 @@
0x8c, 0xe9, 0x19, 0x3e, 0xcc, 0x3f, 0x7b, 0xb4,
},
},
+{
+ .patch_id = 0x0a20102e,
+ .digest = {
+ 0xbe, 0x1f, 0x32, 0x04, 0x0d, 0x3c, 0x9c, 0xdd,
+ 0xe1, 0xa4, 0xbf, 0x76, 0x3a, 0xec, 0xc2, 0xf6,
+ 0x11, 0x00, 0xa7, 0xaf, 0x0f, 0xe5, 0x02, 0xc5,
+ 0x54, 0x3a, 0x1f, 0x8c, 0x16, 0xb5, 0xff, 0xbe,
+ },
+},
{
.patch_id = 0x0a201210,
.digest = {
@@ -143,6 +206,15 @@
0xf7, 0x55, 0xf0, 0x13, 0xbb, 0x22, 0xf6, 0x41,
},
},
+{
+ .patch_id = 0x0a201211,
+ .digest = {
+ 0x69, 0xa1, 0x17, 0xec, 0xd0, 0xf6, 0x6c, 0x95,
+ 0xe2, 0x1e, 0xc5, 0x59, 0x1a, 0x52, 0x0a, 0x27,
+ 0xc4, 0xed, 0xd5, 0x59, 0x1f, 0xbf, 0x00, 0xff,
+ 0x08, 0x88, 0xb5, 0xe1, 0x12, 0xb6, 0xcc, 0x27,
+ },
+},
{
.patch_id = 0x0a404107,
.digest = {
@@ -152,6 +224,15 @@
0x13, 0xbc, 0xc5, 0x25, 0xe4, 0xc5, 0xc3, 0x99,
},
},
+{
+ .patch_id = 0x0a404108,
+ .digest = {
+ 0x69, 0x67, 0x43, 0x06, 0xf8, 0x0c, 0x62, 0xdc,
+ 0xa4, 0x21, 0x30, 0x4f, 0x0f, 0x21, 0x2c, 0xcb,
+ 0xcc, 0x37, 0xf1, 0x1c, 0xc3, 0xf8, 0x2f, 0x19,
+ 0xdf, 0x53, 0x53, 0x46, 0xb1, 0x15, 0xea, 0x00,
+ },
+},
{
.patch_id = 0x0a500011,
.digest = {
@@ -161,6 +242,15 @@
0x11, 0x5e, 0x96, 0x7e, 0x71, 0xe9, 0xfc, 0x74,
},
},
+{
+ .patch_id = 0x0a500012,
+ .digest = {
+ 0xeb, 0x74, 0x0d, 0x47, 0xa1, 0x8e, 0x09, 0xe4,
+ 0x93, 0x4c, 0xad, 0x03, 0x32, 0x4c, 0x38, 0x16,
+ 0x10, 0x39, 0xdd, 0x06, 0xaa, 0xce, 0xd6, 0x0f,
+ 0x62, 0x83, 0x9d, 0x8e, 0x64, 0x55, 0xbe, 0x63,
+ },
+},
{
.patch_id = 0x0a601209,
.digest = {
@@ -170,6 +260,15 @@
0xe8, 0x73, 0xe2, 0xd6, 0xdb, 0xd2, 0x77, 0x1d,
},
},
+{
+ .patch_id = 0x0a60120a,
+ .digest = {
+ 0x0c, 0x8b, 0x3d, 0xfd, 0x52, 0x52, 0x85, 0x7d,
+ 0x20, 0x3a, 0xe1, 0x7e, 0xa4, 0x21, 0x3b, 0x7b,
+ 0x17, 0x86, 0xae, 0xac, 0x13, 0xb8, 0x63, 0x9d,
+ 0x06, 0x01, 0xd0, 0xa0, 0x51, 0x9a, 0x91, 0x2c,
+ },
+},
{
.patch_id = 0x0a704107,
.digest = {
@@ -179,6 +278,15 @@
0x64, 0x39, 0x71, 0x8c, 0xce, 0xe7, 0x41, 0x39,
},
},
+{
+ .patch_id = 0x0a704108,
+ .digest = {
+ 0xd7, 0x55, 0x15, 0x2b, 0xfe, 0xc4, 0xbc, 0x93,
+ 0xec, 0x91, 0xa0, 0xae, 0x45, 0xb7, 0xc3, 0x98,
+ 0x4e, 0xff, 0x61, 0x77, 0x88, 0xc2, 0x70, 0x49,
+ 0xe0, 0x3a, 0x1d, 0x84, 0x38, 0x52, 0xbf, 0x5a,
+ },
+},
{
.patch_id = 0x0a705206,
.digest = {
@@ -188,6 +296,15 @@
0x03, 0x35, 0xe9, 0xbe, 0xfb, 0x06, 0xdf, 0xfc,
},
},
+{
+ .patch_id = 0x0a705208,
+ .digest = {
+ 0x30, 0x1d, 0x55, 0x24, 0xbc, 0x6b, 0x5a, 0x19,
+ 0x0c, 0x7d, 0x1d, 0x74, 0xaa, 0xd1, 0xeb, 0xd2,
+ 0x16, 0x62, 0xf7, 0x5b, 0xe1, 0x1f, 0x18, 0x11,
+ 0x5c, 0xf0, 0x94, 0x90, 0x26, 0xec, 0x69, 0xff,
+ },
+},
{
.patch_id = 0x0a708007,
.digest = {
@@ -197,6 +314,15 @@
0xdf, 0x92, 0x73, 0x84, 0x87, 0x3c, 0x73, 0x93,
},
},
+{
+ .patch_id = 0x0a708008,
+ .digest = {
+ 0x08, 0x6e, 0xf0, 0x22, 0x4b, 0x8e, 0xc4, 0x46,
+ 0x58, 0x34, 0xe6, 0x47, 0xa2, 0x28, 0xfd, 0xab,
+ 0x22, 0x3d, 0xdd, 0xd8, 0x52, 0x9e, 0x1d, 0x16,
+ 0xfa, 0x01, 0x68, 0x14, 0x79, 0x3e, 0xe8, 0x6b,
+ },
+},
{
.patch_id = 0x0a70c005,
.digest = {
@@ -206,6 +332,15 @@
0xee, 0x49, 0xac, 0xe1, 0x8b, 0x13, 0xc5, 0x13,
},
},
+{
+ .patch_id = 0x0a70c008,
+ .digest = {
+ 0x0f, 0xdb, 0x37, 0xa1, 0x10, 0xaf, 0xd4, 0x21,
+ 0x94, 0x0d, 0xa4, 0xa2, 0xe9, 0x86, 0x6c, 0x0e,
+ 0x85, 0x7c, 0x36, 0x30, 0xa3, 0x3a, 0x78, 0x66,
+ 0x18, 0x10, 0x60, 0x0d, 0x78, 0x3d, 0x44, 0xd0,
+ },
+},
{
.patch_id = 0x0aa00116,
.digest = {
@@ -224,3 +359,12 @@
0x68, 0x2f, 0x46, 0xee, 0xfe, 0xc6, 0x6d, 0xef,
},
},
+{
+ .patch_id = 0x0aa00216,
+ .digest = {
+ 0x79, 0xfb, 0x5b, 0x9f, 0xb6, 0xe6, 0xa8, 0xf5,
+ 0x4e, 0x7c, 0x4f, 0x8e, 0x1d, 0xad, 0xd0, 0x08,
+ 0xc2, 0x43, 0x7c, 0x8b, 0xe6, 0xdb, 0xd0, 0xd2,
+ 0xe8, 0x39, 0x26, 0xc1, 0xe5, 0x5a, 0x48, 0xf1,
+ },
+},
diff --git a/xen/arch/x86/cpu/microcode/amd.c b/xen/arch/x86/cpu/microcode/amd.c
index 4f236e439929..f25d74fccba2 100644
--- a/xen/arch/x86/cpu/microcode/amd.c
+++ b/xen/arch/x86/cpu/microcode/amd.c
@@ -521,3 +521,18 @@ void __init ucode_probe_amd(struct microcode_ops *ops)
*ops = amd_ucode_ops;
}
+
+#if 0 /* Manual CONFIG_SELF_TESTS */
+static void __init __constructor test_digests_sorted(void)
+{
+ for ( unsigned int i = 1; i < ARRAY_SIZE(patch_digests); ++i )
+ {
+ if ( patch_digests[i - 1].patch_id < patch_digests[i].patch_id )
+ continue;
+
+ panic("patch_digests[] not sorted: %08x >= %08x\n",
+ patch_digests[i - 1].patch_id,
+ patch_digests[i].patch_id);
+ }
+}
+#endif /* CONFIG_SELF_TESTS */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Apr 2025 03:18:59 +0100
Subject: x86/idle: Rearrange VERW and MONITOR in mwait_idle_with_hints()
In order to mitigate TSA, Xen will need to issue VERW before going idle.
On AMD CPUs, the VERW scrubbing side effects cancel an active MONITOR, causing
the MWAIT to exit without entering an idle state. Therefore the VERW must be
ahead of MONITOR.
Split spec_ctrl_enter_idle() in two and allow the VERW aspect to be handled
separately. While adjusting, update a stale comment concerning MSBDS; more
issues have been mitigated using VERW since it was written.
By moving VERW earlier, it is ahead of the determination of whether to go
idle. We can't move the check on softirq_pending (for correctness reasons),
but we can duplicate it earlier as a best effort attempt to skip the
speculative overhead.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 69857a58ef5a..1045d87eed12 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -462,9 +462,18 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
+ struct cpu_info *info = get_cpu_info();
irq_cpustat_t *stat = &irq_stat[cpu];
const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+ /*
+ * Heuristic: if we're definitely not going to idle, bail early as the
+ * speculative safety can be expensive. This is a performance
+ * consideration not a correctness issue.
+ */
+ if ( *this_softirq_pending )
+ return;
+
/*
* By setting in_mwait, we promise to other CPUs that we'll notice changes
* to __softirq_pending without being sent an IPI. We achieve this by
@@ -478,15 +487,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
"", X86_BUG_MONITOR,
[in_mwait] "=m" (stat->in_mwait));
+ /*
+ * On AMD systems, side effects from VERW cancel MONITOR, causing MWAIT to
+ * wake up immediately. Therefore, VERW must come ahead of MONITOR.
+ */
+ __spec_ctrl_enter_idle_verw(info);
+
monitor(this_softirq_pending, 0, 0);
ASSERT(!local_irq_is_enabled());
if ( !*this_softirq_pending )
{
- struct cpu_info *info = get_cpu_info();
-
- spec_ctrl_enter_idle(info);
+ __spec_ctrl_enter_idle(info, false /* VERW handled above */);
if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
mwait(eax, ecx);
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 077225418956..6724d3812029 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -115,8 +115,22 @@ static inline void init_shadow_spec_ctrl_state(void)
info->verw_sel = __HYPERVISOR_DS32;
}
+static always_inline void __spec_ctrl_enter_idle_verw(struct cpu_info *info)
+{
+ /*
+ * Flush/scrub structures which are statically partitioned between active
+ * threads. Otherwise data of ours (of unknown sensitivity) will become
+ * available to our sibling when we go idle.
+ *
+ * Note: VERW must be encoded with a memory operand, as it is only that
+ * form with side effects.
+ */
+ alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
+ [sel] "m" (info->verw_sel));
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
-static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+static always_inline void __spec_ctrl_enter_idle(struct cpu_info *info, bool verw)
{
uint32_t val = 0;
@@ -135,21 +149,8 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
"a" (val), "c" (MSR_SPEC_CTRL), "d" (0));
barrier();
- /*
- * Microarchitectural Store Buffer Data Sampling:
- *
- * On vulnerable systems, store buffer entries are statically partitioned
- * between active threads. When entering idle, our store buffer entries
- * are re-partitioned to allow the other threads to use them.
- *
- * Flush the buffers to ensure that no sensitive data of ours can be
- * leaked by a sibling after it gets our store buffer entries.
- *
- * Note: VERW must be encoded with a memory operand, as it is only that
- * form which causes a flush.
- */
- alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
- [sel] "m" (info->verw_sel));
+ if ( verw ) /* Expected to be const-propagated. */
+ __spec_ctrl_enter_idle_verw(info);
/*
* Cross-Thread Return Address Predictions:
@@ -167,6 +168,12 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
: "rax", "rcx");
}
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+ __spec_ctrl_enter_idle(info, true /* VERW */);
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 29 Aug 2024 17:36:11 +0100
Subject: x86/spec-ctrl: Mitigate Transitive Scheduler Attacks
TSA affects AMD Fam19h CPUs (Zen3 and 4 microarchitectures).
Three new CPUID bits have been defined. Two (TSA_SQ_NO and TSA_L1_NO)
indicate that the system is unaffected, and must be synthesised by Xen on
unaffected parts to date.
A third new bit indicates that VERW now has a flushing side effect. Xen
must synthesise this bit on affected systems based on microcode version.
As with other VERW-based flushing features, VERW_CLEAR needs OR-ing across
a resource pool, and guests which have seen it can safely migrate in.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index f40b25c91681..c594f05ea9b2 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -477,6 +477,17 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
if ( test_bit(X86_FEATURE_RTM, fs) )
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicates that the VERW instruction may have gained
+ * scrubbing side effects. With pooling, it means "you might migrate
+ * somewhere where scrubbing is necessary", and may need exposing on
+ * unaffected hardware. This is fine, because the VERW instruction
+ * has been around since the 286.
+ */
+ __set_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
@@ -555,6 +566,17 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicate that the VERW instruction may have gained
+ * scrubbing side effects. The max policy has it set for migration
+ * reasons, so reset the default policy back to the host value in case
+ * we're unaffected.
+ */
+ if ( !cpu_has_verw_clear )
+ __clear_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 91edb3345938..610c64bf4c97 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -99,6 +99,8 @@ __UNLIKELY_END(nsvm_hap)
pop %rsi
pop %rdi
+ SPEC_CTRL_COND_VERW /* Req: %rsp=eframe Clob: efl */
+
vmrun
SAVE_ALL
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 5e1090a5470b..ad50e5356a49 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -196,6 +196,7 @@ static inline bool boot_cpu_has(unsigned int feat)
/* CPUID level 0x80000021.eax */
#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_verw_clear boot_cpu_has(X86_FEATURE_VERW_CLEAR)
#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
/* CPUID level 0x00000007:1.edx */
@@ -223,6 +224,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_pb_opt_ctrl boot_cpu_has(X86_FEATURE_PB_OPT_CTRL)
#define cpu_has_its_no boot_cpu_has(X86_FEATURE_ITS_NO)
+/* CPUID level 0x80000021.ecx */
+#define cpu_has_tsa_sq_no boot_cpu_has(X86_FEATURE_TSA_SQ_NO)
+#define cpu_has_tsa_l1_no boot_cpu_has(X86_FEATURE_TSA_L1_NO)
+
/* Synthesized. */
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index fa444caabb09..ef198c221139 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -492,7 +492,7 @@ custom_param("pv-l1tf", parse_pv_l1tf);
static void __init print_details(enum ind_thunk thunk)
{
- unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
+ unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, e21c = 0, max = 0, tmp;
uint64_t caps = 0;
/* Collect diagnostics about available mitigations. */
@@ -503,7 +503,7 @@ static void __init print_details(enum ind_thunk thunk)
if ( boot_cpu_data.extended_cpuid_level >= 0x80000008U )
cpuid(0x80000008U, &tmp, &e8b, &tmp, &tmp);
if ( boot_cpu_data.extended_cpuid_level >= 0x80000021U )
- cpuid(0x80000021U, &e21a, &tmp, &tmp, &tmp);
+ cpuid(0x80000021U, &e21a, &tmp, &e21c, &tmp);
if ( cpu_has_arch_caps )
rdmsrl(MSR_ARCH_CAPABILITIES, caps);
@@ -513,7 +513,7 @@ static void __init print_details(enum ind_thunk thunk)
* Hardware read-only information, stating immunity to certain issues, or
* suggestions of which mitigation to use.
*/
- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
(caps & ARCH_CAPS_EIBRS) ? " EIBRS" : "",
(caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
@@ -538,10 +538,12 @@ static void __init print_details(enum ind_thunk thunk)
(e8b & cpufeat_mask(X86_FEATURE_IBPB_RET)) ? " IBPB_RET" : "",
(e21a & cpufeat_mask(X86_FEATURE_IBPB_BRTYPE)) ? " IBPB_BRTYPE" : "",
(e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_SQ_NO)) ? " TSA_SQ_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_L1_NO)) ? " TSA_L1_NO" : "");
/* Hardware features which need driving to mitigate issues. */
- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
(_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -561,7 +563,8 @@ static void __init print_details(enum ind_thunk thunk)
(caps & ARCH_CAPS_GDS_CTRL) ? " GDS_CTRL" : "",
(caps & ARCH_CAPS_RFDS_CLEAR) ? " RFDS_CLEAR" : "",
(e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "",
+ (e21a & cpufeat_mask(X86_FEATURE_VERW_CLEAR)) ? " VERW_CLEAR" : "");
/* Compiled-in support which pertains to mitigations. */
if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) ||
@@ -1545,6 +1548,77 @@ static void __init rfds_calculations(void)
setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
}
+/*
+ * Transient Scheduler Attacks
+ *
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
+ */
+static void __init tsa_calculations(void)
+{
+ unsigned int curr_rev, min_rev;
+
+ /* TSA is only known to affect AMD processors at this time. */
+ if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
+ return;
+
+ /* If we're virtualised, don't attempt to synthesise anything. */
+ if ( cpu_has_hypervisor )
+ return;
+
+ /*
+ * According to the whitepaper, some Fam1A CPUs (Models 0x00...0x4f,
+ * 0x60...0x7f) are not vulnerable but don't enumerate TSA_{SQ,L1}_NO. If
+ * we see either enumerated, assume both are correct ...
+ */
+ if ( cpu_has_tsa_sq_no || cpu_has_tsa_l1_no )
+ return;
+
+ /*
+ * ... otherwise, synthesise them. CPUs other than Fam19 (Zen3/4) are
+ * stated to be not vulnerable.
+ */
+ if ( boot_cpu_data.x86 != 0x19 )
+ {
+ setup_force_cpu_cap(X86_FEATURE_TSA_SQ_NO);
+ setup_force_cpu_cap(X86_FEATURE_TSA_L1_NO);
+ return;
+ }
+
+ /*
+ * Fam19 CPUs get VERW_CLEAR with new enough microcode, but must
+ * synthesise the CPUID bit.
+ */
+ curr_rev = this_cpu(cpu_sig).rev;
+ switch ( curr_rev >> 8 )
+ {
+ case 0x0a0011: min_rev = 0x0a0011d7; break;
+ case 0x0a0012: min_rev = 0x0a00123b; break;
+ case 0x0a0082: min_rev = 0x0a00820d; break;
+ case 0x0a1011: min_rev = 0x0a10114c; break;
+ case 0x0a1012: min_rev = 0x0a10124c; break;
+ case 0x0a1081: min_rev = 0x0a108109; break;
+ case 0x0a2010: min_rev = 0x0a20102e; break;
+ case 0x0a2012: min_rev = 0x0a201211; break;
+ case 0x0a4041: min_rev = 0x0a404108; break;
+ case 0x0a5000: min_rev = 0x0a500012; break;
+ case 0x0a6012: min_rev = 0x0a60120a; break;
+ case 0x0a7041: min_rev = 0x0a704108; break;
+ case 0x0a7052: min_rev = 0x0a705208; break;
+ case 0x0a7080: min_rev = 0x0a708008; break;
+ case 0x0a70c0: min_rev = 0x0a70c008; break;
+ case 0x0aa002: min_rev = 0x0aa00216; break;
+ default:
+ printk(XENLOG_WARNING
+ "Unrecognised CPU %02x-%02x-%02x, ucode 0x%08x for TSA mitigation\n",
+ boot_cpu_data.x86, boot_cpu_data.x86_model,
+ boot_cpu_data.x86_mask, curr_rev);
+ return;
+ }
+
+ if ( curr_rev >= min_rev )
+ setup_force_cpu_cap(X86_FEATURE_VERW_CLEAR);
+}
+
static bool __init cpu_has_gds(void)
{
/*
@@ -2238,6 +2312,7 @@ void __init init_speculation_mitigations(void)
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
*
* Relevant ucodes:
*
@@ -2270,9 +2345,18 @@ void __init init_speculation_mitigations(void)
*
* - March 2023, for RFDS. Enumerate RFDS_CLEAR to mean that VERW now
* scrubs non-architectural entries from certain register files.
+ *
+ * - July 2025, for TSA. Introduces VERW side effects to mitigate
+ * TSA_{SQ/L1}. Xen must synthesise the VERW_CLEAR feature based on
+ * microcode version.
+ *
+ * Note, these microcode updates were produced before the remediation of
+ * the microcode signature issues, and are included in the firwmare
+ * updates fixing the entrysign vulnerability from ~December 2024.
*/
mds_calculations();
rfds_calculations();
+ tsa_calculations();
/*
* Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -2304,21 +2388,27 @@ void __init init_speculation_mitigations(void)
* MLPDS/MFBDS when SMT is enabled.
*/
if ( opt_verw_pv == -1 )
- opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_pv = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
if ( opt_verw_hvm == -1 )
- opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_hvm = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
/*
- * If SMT is active, and we're protecting against MDS or MMIO stale data,
+ * If SMT is active, and we're protecting against any of:
+ * - MSBDS
+ * - MMIO stale data
+ * - TSA-SQ
* we need to scrub before going idle as well as on return to guest.
* Various pipeline resources are repartitioned amongst non-idle threads.
*
- * We don't need to scrub on idle for RFDS. There are no affected cores
- * which support SMT, despite there being affected cores in hybrid systems
- * which have SMT elsewhere in the platform.
+ * We don't need to scrub on idle for:
+ * - RFDS (no SMT affected cores)
+ * - TSA-L1 (utags never shared between threads)
*/
if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+ (cpu_has_verw_clear && !cpu_has_tsa_sq_no) ||
opt_verw_mmio) && hw_smt_enabled )
setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 03acd49387aa..4ea6d95c7ac6 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -302,6 +302,7 @@ XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
XEN_CPUFEATURE(NO_NEST_BP, 11*32+ 0) /*A No Nested Data Breakpoints */
XEN_CPUFEATURE(FS_GS_NS, 11*32+ 1) /*S| FS/GS base MSRs non-serialising */
XEN_CPUFEATURE(LFENCE_DISPATCH, 11*32+ 2) /*A LFENCE always serializing */
+XEN_CPUFEATURE(VERW_CLEAR, 11*32+ 5) /*!A| VERW clears microarchitectural buffers */
XEN_CPUFEATURE(NSCB, 11*32+ 6) /*A Null Selector Clears Base (and limit too) */
XEN_CPUFEATURE(AUTO_IBRS, 11*32+ 8) /*S Automatic IBRS */
XEN_CPUFEATURE(AMD_FSRS, 11*32+10) /*A Fast Short REP STOSB */
@@ -370,6 +371,8 @@ XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+XEN_CPUFEATURE(TSA_SQ_NO, 18*32+ 1) /*A No Store Queue Transitive Scheduler Attacks */
+XEN_CPUFEATURE(TSA_L1_NO, 18*32+ 2) /*A No L1D Transitive Scheduler Attacks */
#endif /* XEN_CPUFEATURE */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 14:59:01 +0100
Subject: x86/idle: Move monitor()/mwait() wrappers into cpu-idle.c
They're not used by any other translation unit, so shouldn't live in
asm/processor.h, which is included almost everywhere.
Our new toolchain baseline knows the MONITOR/MWAIT instructions, so use them
directly rather than using raw hex.
Change the hint/extention parameters from long to int. They're specified to
remain 32bit operands even 64-bit mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 61e10fc28ccddff7c72c14acec56dc7ef2b155d1)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 1dbf15b01ed7..40af42a18fb8 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -60,6 +60,19 @@
/*#define DEBUG_PM_CX*/
+static always_inline void monitor(
+ const void *addr, unsigned int ecx, unsigned int edx)
+{
+ asm volatile ( "monitor"
+ :: "a" (addr), "c" (ecx), "d" (edx) );
+}
+
+static always_inline void mwait(unsigned int eax, unsigned int ecx)
+{
+ asm volatile ( "mwait"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -470,7 +483,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mb();
}
- __monitor(monitor_addr, 0, 0);
+ monitor(monitor_addr, 0, 0);
smp_mb();
/*
@@ -484,7 +497,7 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
spec_ctrl_enter_idle(info);
- __mwait(eax, ecx);
+ mwait(eax, ecx);
spec_ctrl_exit_idle(info);
cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
@@ -915,9 +928,9 @@ void cf_check acpi_dead_idle(void)
*/
mb();
clflush(mwait_ptr);
- __monitor(mwait_ptr, 0, 0);
+ monitor(mwait_ptr, 0, 0);
mb();
- __mwait(cx->address, 0);
+ mwait(cx->address, 0);
}
}
else if ( (current_cpu_data.x86_vendor &
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index c3cc527f2e73..1aec6691c9ff 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -315,23 +315,6 @@ static always_inline void set_in_cr4 (unsigned long mask)
cr4_pv32_mask |= (mask & XEN_CR4_PV32_BITS);
}
-static always_inline void __monitor(const void *eax, unsigned long ecx,
- unsigned long edx)
-{
- /* "monitor %eax,%ecx,%edx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc8;"
- : : "a" (eax), "c" (ecx), "d"(edx) );
-}
-
-static always_inline void __mwait(unsigned long eax, unsigned long ecx)
-{
- /* "mwait %eax,%ecx;" */
- asm volatile (
- ".byte 0x0f,0x01,0xc9;"
- : : "a" (eax), "c" (ecx) );
-}
-
#define IOBMP_BYTES 8192
#define IOBMP_INVALID_OFFSET 0x8000
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Apr 2025 15:55:29 +0100
Subject: x86/idle: Remove MFENCEs for CLFLUSH_MONITOR
Commit 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround") was
inherited from Linux and added MFENCEs around the AAI65 errata fix.
The SDM now states:
Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
and fence instructions[1].
with footnote 1 reading:
Earlier versions of this manual specified that executions of the CLFLUSH
instruction were ordered only by the MFENCE instruction. All processors
implementing the CLFLUSH instruction also order it relative to the other
operations enumerated above.
I.e. the MFENCEs came about because of an incorrect statement in the SDM.
The Spec Update (no longer available on Intel's website) simply says "issue a
CLFLUSH", with no mention of MFENCEs.
As this erratum is specific to Intel, it's fine to remove the the MFENCEs; AMD
CPUs of a similar vintage do sport otherwise-unordered CLFLUSHs.
Move the feature bit into the BUG range (rather than FEATURE), and move the
workaround into monitor() itself.
The erratum check itself must use setup_force_cpu_cap(). It needs activating
if any CPU needs it, not if all of them need it.
Fixes: 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround")
Fixes: 96d1b237ae9b ("x86/Intel: work around Xeon 7400 series erratum AAI65")
Link: https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f77ef3443542a2c2bbd59ee66178287d4fa5b43f)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 40af42a18fb8..e9493f7f577f 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -63,6 +63,9 @@
static always_inline void monitor(
const void *addr, unsigned int ecx, unsigned int edx)
{
+ alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
+ [addr] "a" (addr));
+
asm volatile ( "monitor"
:: "a" (addr), "c" (ecx), "d" (edx) );
}
@@ -476,13 +479,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
s_time_t expires = per_cpu(timer_deadline, cpu);
const void *monitor_addr = &mwait_wakeup(cpu);
- if ( boot_cpu_has(X86_FEATURE_CLFLUSH_MONITOR) )
- {
- mb();
- clflush(monitor_addr);
- mb();
- }
-
monitor(monitor_addr, 0, 0);
smp_mb();
@@ -917,19 +913,7 @@ void cf_check acpi_dead_idle(void)
while ( 1 )
{
- /*
- * 1. The CLFLUSH is a workaround for erratum AAI65 for
- * the Xeon 7400 series.
- * 2. The WBINVD is insufficient due to the spurious-wakeup
- * case where we return around the loop.
- * 3. Unlike wbinvd, clflush is a light weight but not serializing
- * instruction, hence memory fence is necessary to make sure all
- * load/store visible before flush cache line.
- */
- mb();
- clflush(mwait_ptr);
monitor(mwait_ptr, 0, 0);
- mb();
mwait(cx->address, 0);
}
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 7eaa20ece18c..9f8115008b67 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -446,6 +446,7 @@ static void __init probe_mwait_errata(void)
*
* Xeon 7400 erratum AAI65 (and further newer Xeons)
* MONITOR/MWAIT may have excessive false wakeups
+ * https://web.archive.org/web/20090219054841/http://download.intel.com/design/xeon/specupdt/32033601.pdf
*/
static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
{
@@ -463,7 +464,7 @@ static void Intel_errata_workarounds(struct cpuinfo_x86 *c)
if (c->x86 == 6 && cpu_has_clflush &&
(c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
- __set_bit(X86_FEATURE_CLFLUSH_MONITOR, c->x86_capability);
+ setup_force_cpu_cap(X86_BUG_CLFLUSH_MONITOR);
probe_c3_errata(c);
if (system_state < SYS_STATE_smp_boot)
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..84c93292c80c 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -19,7 +19,7 @@ XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon
XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */
XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */
XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */
-XEN_CPUFEATURE(CLFLUSH_MONITOR, X86_SYNTH( 7)) /* clflush reqd with monitor */
+/* Bit 7 unused */
XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */
@@ -52,6 +52,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_NULL_SEG X86_BUG( 1) /* NULL-ing a selector preserves the base and limit. */
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
+#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 24 Jun 2025 15:20:52 +0100
Subject: Revert part of "x86/mwait-idle: disable IBRS during long idle"
Most of the patch (handling of CPUIDLE_FLAG_IBRS) is fine, but the
adjustements to mwait_idle() are not; spec_ctrl_enter_idle() does more than
just alter MSR_SPEC_CTRL.IBRS.
The only reason this doesn't need an XSA is because the unconditional
spec_ctrl_{enter,exit}_idle() in mwait_idle_with_hints() were left unaltered,
and thus the MWAIT remained properly protected.
There (would have been) two problems. In the ibrs_disable (== deep C) case:
* On entry, VERW and RSB-stuffing are architecturally skipped.
* On exit, there's a branch crossing the WRMSR which reinstates the
speculative safety for indirect branches.
All this change did was double up the expensive operations in the deep C case,
and fail to optimise the intended case.
I have an idea of how to plumb this more nicely, but it requires larger
changes to legacy IBRS handling to not make spec_ctrl_enter_idle() vulnerable
in other ways. In the short term, simply take out the perf hit.
Fixes: 08acdf9a2615 ("x86/mwait-idle: disable IBRS during long idle")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 07d7163334a7507d329958b19d976be769580999)
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 9c16cc166a14..5c16f5ad3a82 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -875,7 +875,6 @@ static const struct cpuidle_state snr_cstates[] = {
static void cf_check mwait_idle(void)
{
unsigned int cpu = smp_processor_id();
- struct cpu_info *info = get_cpu_info();
struct acpi_processor_power *power = processor_powers[cpu];
struct acpi_processor_cx *cx = NULL;
unsigned int next_state;
@@ -902,6 +901,8 @@ static void cf_check mwait_idle(void)
pm_idle_save();
else
{
+ struct cpu_info *info = get_cpu_info();
+
spec_ctrl_enter_idle(info);
safe_halt();
spec_ctrl_exit_idle(info);
@@ -928,11 +929,6 @@ static void cf_check mwait_idle(void)
if ((cx->type >= 3) && errata_c6_workaround())
cx = power->safe_state;
- if (cx->ibrs_disable) {
- ASSERT(!cx->irq_enable_early);
- spec_ctrl_enter_idle(info);
- }
-
#if 0 /* XXX Can we/do we need to do something similar on Xen? */
/*
* leave_mm() to avoid costly and often unnecessary wakeups
@@ -964,10 +960,6 @@ static void cf_check mwait_idle(void)
/* Now back in C0. */
update_idle_stats(power, cx, before, after);
-
- if (cx->ibrs_disable)
- spec_ctrl_exit_idle(info);
-
local_irq_enable();
TRACE_TIME(TRC_PM_IDLE_EXIT, cx->type, after,
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 14:46:01 +0100
Subject: x86/cpu-policy: Simplify logic in
guest_common_default_feature_adjustments()
For features which are unconditionally set in the max policies, making the
default policy to match the host can be done with a conditional clear.
This is simpler than the unconditional clear, conditional set currently
performed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 30f8fed68f3c2e63594ff9202b3d05b971781e36)
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index d696960b1887..c3aaac861d15 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -518,17 +518,14 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* reasons, so reset the default policy back to the host values in
* case we're unaffected.
*/
- __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- if ( cpu_has_md_clear )
- __set_bit(X86_FEATURE_MD_CLEAR, fs);
+ if ( !cpu_has_md_clear )
+ __clear_bit(X86_FEATURE_MD_CLEAR, fs);
- __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- if ( cpu_has_fb_clear )
- __set_bit(X86_FEATURE_FB_CLEAR, fs);
+ if ( !cpu_has_fb_clear )
+ __clear_bit(X86_FEATURE_FB_CLEAR, fs);
- __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
- if ( cpu_has_rfds_clear )
- __set_bit(X86_FEATURE_RFDS_CLEAR, fs);
+ if ( !cpu_has_rfds_clear )
+ __clear_bit(X86_FEATURE_RFDS_CLEAR, fs);
/*
* The Gather Data Sampling microcode mitigation (August 2023) has an
@@ -548,13 +545,11 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
* Topology information is at the toolstack's discretion so these are
* unconditionally set in max, but pick a default which matches the host.
*/
- __clear_bit(X86_FEATURE_HTT, fs);
- if ( cpu_has_htt )
- __set_bit(X86_FEATURE_HTT, fs);
+ if ( !cpu_has_htt )
+ __clear_bit(X86_FEATURE_HTT, fs);
- __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
- if ( cpu_has_cmp_legacy )
- __set_bit(X86_FEATURE_CMP_LEGACY, fs);
+ if ( !cpu_has_cmp_legacy )
+ __clear_bit(X86_FEATURE_CMP_LEGACY, fs);
/*
* On certain hardware, speculative or errata workarounds can result in
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 15:51:53 +0100
Subject: x86/idle: Remove broken MWAIT implementation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI certain targets. Also,
there's no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ
on it's way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line, and not in a separate cpuidle_mwait_flags variable, in order
to do this in a race-free way.
Arranging to fix this while keeping the old implementation is prohibitive, so
strip the current one out in order to implement the new one cleanly. In the
interim, this causes IPIs to be used unconditionally which is safe albeit
suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3faf0866a33070b926ab78e6298290403f85e76c)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index e9493f7f577f..3101d5ce230d 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -436,27 +436,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -465,42 +444,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -901,7 +863,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 51ff7f12f5c0..9290cf7a42a0 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -188,8 +188,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 342361cb6fdd..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 705d0c1135f0..120e354fe340 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 18:13:27 +0100
Subject: x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
isn't correct either, as this only pertains to local ordering. All we need is
a compiler barrier().
Merge the barier() into the monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e7710dd843ba9d204f6ee2973d6120c1984958a6)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 3101d5ce230d..dfa6b93070dc 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -66,8 +66,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be reordered over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -453,7 +457,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
const unsigned int *this_softirq_pending = &softirq_pending(cpu);
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:40:51 +0100
Subject: x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b0ca0f93f47c43f8984981137af07ca3d161e3ec)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index dfa6b93070dc..4b8fb469381c 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -440,14 +440,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 9f8115008b67..98f1db234915 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -436,7 +436,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index c52cd3f51011..000a692f6d19 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
#ifdef CONFIG_INTEL
bool mwait_pc10_supported(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:04:17 +0100
Subject: xen/softirq: Rework arch_skip_send_event_check() into
arch_set_softirq()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up the
target) and, prior to the removal of the broken IPI-elision algorithm, was
racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to also
sample the status of the target in one atomic action. Implement a new arch
helper with different semantics; to make the softirq pending and decide about
IPIs together. For now, implement the default helper. It will be overridden
by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b473e5e212e445d3c193c1c83b52b129af571b19)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 4b8fb469381c..e4679a45b5a6 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -440,11 +440,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 60f344e8425e..dc3aabce3330 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_set_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_set_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 33d6f2ecd223..5c2361865b49 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -21,6 +21,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_set_softirq
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 1 Jul 2025 21:26:24 +0100
Subject: x86/idle: Implement a new MWAIT IPI-elision algorithm
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3e0bc4b50350bd357304fd79a5dc0472790dba91)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index e4679a45b5a6..0d624b9aebb6 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -443,7 +443,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising ourselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
@@ -455,6 +469,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..55b65c9747b1 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,50 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t prev, old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ for ( ;; )
+ {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ prev = cmpxchg(ptr, old, new);
+ if ( prev == old )
+ break;
+
+ old = prev;
+ }
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ *
+ * We can't access the in_mwait field nicely, so use some BUILD_BUG_ON()'s
+ * to cross-check the (1UL << 32) opencoding.
+ */
+ BUILD_BUG_ON(sizeof(irq_stat[0].softirq_mwait_raw) != 8);
+ BUILD_BUG_ON((offsetof(irq_cpustat_t, in_mwait) -
+ offsetof(irq_cpustat_t, softirq_mwait_raw)) != 4);
+
+ return new & (1UL << 32) /* in_mwait */;
+
+}
+#define arch_set_softirq arch_set_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Jul 2025 14:51:38 +0100
Subject: x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on
Xeons"
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b0f0f6e235618c2764e925b58c4bfe412730ced)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 0d624b9aebb6..c58a51a09f33 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -80,6 +80,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 5c16f5ad3a82..5e98011bfd0c 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -946,12 +946,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Jun 2025 17:19:19 +0100
Subject: x86/cpu-policy: Rearrange guest_common_*_feature_adjustments()
Turn the if()s into switch()es, as we're going to need AMD sections.
Move the RTM adjustments into the Intel section, where they ought to live.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index c3aaac861d15..47ee1ff47460 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -418,8 +418,9 @@ static void __init guest_common_default_leaves(struct cpu_policy *p)
static void __init guest_common_max_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* MSR_ARCH_CAPS is just feature data, and we can offer it to guests
* unconditionally, although limit it to Intel systems as it is highly
@@ -464,6 +465,22 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__set_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * To mitigate Native-BHI, one option is to use a TSX Abort on capable
+ * systems. This is safe even if RTM has been disabled for other
+ * reasons via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel
+ * doesn't get to know this type of information.
+ *
+ * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to
+ * instead mean "XBEGIN won't fault". This is enough for a guest
+ * kernel to make an informed choice WRT mitigating Native-BHI.
+ *
+ * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
+ */
+ if ( test_bit(X86_FEATURE_RTM, fs) )
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ break;
}
/*
@@ -475,27 +492,13 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
*/
__set_bit(X86_FEATURE_HTT, fs);
__set_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * To mitigate Native-BHI, one option is to use a TSX Abort on capable
- * systems. This is safe even if RTM has been disabled for other reasons
- * via MSR_TSX_{CTRL,FORCE_ABORT}. However, a guest kernel doesn't get to
- * know this type of information.
- *
- * Therefore the meaning of RTM_ALWAYS_ABORT has been adjusted, to instead
- * mean "XBEGIN won't fault". This is enough for a guest kernel to make
- * an informed choice WRT mitigating Native-BHI.
- *
- * If RTM-capable, we can run a VM which has seen RTM_ALWAYS_ABORT.
- */
- if ( test_bit(X86_FEATURE_RTM, fs) )
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
static void __init guest_common_default_feature_adjustments(uint32_t *fs)
{
- if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL )
+ switch ( boot_cpu_data.x86_vendor )
{
+ case X86_VENDOR_INTEL:
/*
* IvyBridge client parts suffer from leakage of RDRAND data due to SRBDS
* (XSA-320 / CVE-2020-0543), and won't be receiving microcode to
@@ -539,6 +542,23 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
boot_cpu_data.x86_model == INTEL_FAM6_SKYLAKE_X &&
raw_cpu_policy.feat.clwb )
__clear_bit(X86_FEATURE_CLWB, fs);
+
+ /*
+ * On certain hardware, speculative or errata workarounds can result
+ * in TSX being placed in "force-abort" mode, where it doesn't
+ * actually function as expected, but is technically compatible with
+ * the ISA.
+ *
+ * Do not advertise RTM to guests by default if it won't actually
+ * work. Instead, advertise RTM_ALWAYS_ABORT indicating that TSX
+ * Aborts are safe to use, e.g. for mitigating Native-BHI.
+ */
+ if ( rtm_disabled )
+ {
+ __clear_bit(X86_FEATURE_RTM, fs);
+ __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
+ }
+ break;
}
/*
@@ -550,21 +570,6 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
if ( !cpu_has_cmp_legacy )
__clear_bit(X86_FEATURE_CMP_LEGACY, fs);
-
- /*
- * On certain hardware, speculative or errata workarounds can result in
- * TSX being placed in "force-abort" mode, where it doesn't actually
- * function as expected, but is technically compatible with the ISA.
- *
- * Do not advertise RTM to guests by default if it won't actually work.
- * Instead, advertise RTM_ALWAYS_ABORT indicating that TSX Aborts are safe
- * to use, e.g. for mitigating Native-BHI.
- */
- if ( rtm_disabled )
- {
- __clear_bit(X86_FEATURE_RTM, fs);
- __set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
- }
}
static void __init guest_common_feature_adjustments(uint32_t *fs)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Tue, 10 Sep 2024 19:55:15 +0100
Subject: x86/cpu-policy: Infrastructure for CPUID leaf 0x80000021.ecx
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index 063fe86eb72f..f738e17b19e4 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -342,6 +342,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *policy, const char* str)
CPUID_ENTRY(0x00000007, 1, CPUID_REG_EDX),
MSR_ENTRY(0x10a, CPUID_REG_EAX),
MSR_ENTRY(0x10a, CPUID_REG_EDX),
+ CPUID_ENTRY(0x80000021, NA, CPUID_REG_ECX),
#undef MSR_ENTRY
#undef CPUID_ENTRY
};
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index 4c4593528dfe..8e36b8e69600 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -37,6 +37,7 @@ static const struct {
{ "CPUID 0x00000007:1.edx", "7d1" },
{ "MSR_ARCH_CAPS.lo", "m10Al" },
{ "MSR_ARCH_CAPS.hi", "m10Ah" },
+ { "CPUID 0x80000021.ecx", "e21c" },
};
#define COL_ALIGN "24"
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 47ee1ff47460..9d1ff6268d79 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -330,7 +330,6 @@ static void recalculate_misc(struct cpu_policy *p)
p->extd.raw[0x1f] = EMPTY_LEAF; /* SEV */
p->extd.raw[0x20] = EMPTY_LEAF; /* Platform QoS */
p->extd.raw[0x21].b = 0;
- p->extd.raw[0x21].c = 0;
p->extd.raw[0x21].d = 0;
break;
}
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 067d855badf0..4eacdaac0f4f 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -478,7 +478,9 @@ static void generic_identify(struct cpuinfo_x86 *c)
if (c->extended_cpuid_level >= 0x80000008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x80000008);
if (c->extended_cpuid_level >= 0x80000021)
- c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x80000021);
+ cpuid(0x80000021,
+ &c->x86_capability[FEATURESET_e21a], &tmp,
+ &c->x86_capability[FEATURESET_e21c], &tmp);
/* Intel-defined flags: level 0x00000007 */
if (c->cpuid_level >= 7) {
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 10462771e52d..cb926782a8f7 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -383,6 +383,8 @@ XEN_CPUFEATURE(RFDS_CLEAR, 16*32+28) /*!A| Register File(s) cleared by V
XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
+/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+
#endif /* XEN_CPUFEATURE */
/* Clean up from a default include. Close the enum (for C). */
diff --git a/xen/include/xen/lib/x86/cpu-policy.h b/xen/include/xen/lib/x86/cpu-policy.h
index f08f30afeca3..dd204a825b07 100644
--- a/xen/include/xen/lib/x86/cpu-policy.h
+++ b/xen/include/xen/lib/x86/cpu-policy.h
@@ -22,6 +22,7 @@
#define FEATURESET_7d1 15 /* 0x00000007:1.edx */
#define FEATURESET_m10Al 16 /* 0x0000010a.eax */
#define FEATURESET_m10Ah 17 /* 0x0000010a.edx */
+#define FEATURESET_e21c 18 /* 0x80000021.ecx */
struct cpuid_leaf
{
@@ -328,7 +329,11 @@ struct cpu_policy
uint16_t ucode_size; /* Units of 16 bytes */
uint8_t rap_size; /* Units of 8 entries */
uint8_t :8;
- uint32_t /* c */:32, /* d */:32;
+ union {
+ uint32_t e21c;
+ struct { DECL_BITFIELD(e21c); };
+ };
+ uint32_t /* d */:32;
};
} extd;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index eb7698dc7325..6298d051f2a6 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -81,6 +81,7 @@ void x86_cpu_policy_to_featureset(
fs[FEATURESET_7d1] = p->feat._7d1;
fs[FEATURESET_m10Al] = p->arch_caps.lo;
fs[FEATURESET_m10Ah] = p->arch_caps.hi;
+ fs[FEATURESET_e21c] = p->extd.e21c;
}
void x86_cpu_featureset_to_policy(
@@ -104,6 +105,7 @@ void x86_cpu_featureset_to_policy(
p->feat._7d1 = fs[FEATURESET_7d1];
p->arch_caps.lo = fs[FEATURESET_m10Al];
p->arch_caps.hi = fs[FEATURESET_m10Ah];
+ p->extd.e21c = fs[FEATURESET_e21c];
}
void x86_cpu_policy_recalc_synth(struct cpu_policy *p)
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri, 27 Sep 2024 11:28:39 +0100
Subject: x86/ucode: Digests for TSA microcode
AMD are releasing microcode for TSA, so extend the known-provenance list with
their hashes. These were produced before the remediation of the microcode
signature issues (the entrysign vulnerability), so can be OS-loaded on
out-of-date firmware.
Include an off-by-default check for the sorted-ness of patch_digests[]. It's
not worth running generally under SELF_TESTS, but is useful when editing the
digest list.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu/microcode/amd-patch-digests.c b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
index d32761226712..d2c4e0178a1e 100644
--- a/xen/arch/x86/cpu/microcode/amd-patch-digests.c
+++ b/xen/arch/x86/cpu/microcode/amd-patch-digests.c
@@ -80,6 +80,15 @@
0x0d, 0x5b, 0x65, 0x34, 0x69, 0xb2, 0x62, 0x21,
},
},
+{
+ .patch_id = 0x0a0011d7,
+ .digest = {
+ 0x35, 0x07, 0xcd, 0x40, 0x94, 0xbc, 0x81, 0x6b,
+ 0xfc, 0x61, 0x56, 0x1a, 0xe2, 0xdb, 0x96, 0x12,
+ 0x1c, 0x1c, 0x31, 0xb1, 0x02, 0x6f, 0xe5, 0xd2,
+ 0xfe, 0x1b, 0x04, 0x03, 0x2c, 0x8f, 0x4c, 0x36,
+ },
+},
{
.patch_id = 0x0a001238,
.digest = {
@@ -89,6 +98,15 @@
0xc0, 0xcd, 0x33, 0xf2, 0x8d, 0xf9, 0xef, 0x59,
},
},
+{
+ .patch_id = 0x0a00123b,
+ .digest = {
+ 0xef, 0xa1, 0x1e, 0x71, 0xf1, 0xc3, 0x2c, 0xe2,
+ 0xc3, 0xef, 0x69, 0x41, 0x7a, 0x54, 0xca, 0xc3,
+ 0x8f, 0x62, 0x84, 0xee, 0xc2, 0x39, 0xd9, 0x28,
+ 0x95, 0xa7, 0x12, 0x49, 0x1e, 0x30, 0x71, 0x72,
+ },
+},
{
.patch_id = 0x0a00820c,
.digest = {
@@ -98,6 +116,15 @@
0xe1, 0x3b, 0x8d, 0xb2, 0xf8, 0x22, 0x03, 0xe2,
},
},
+{
+ .patch_id = 0x0a00820d,
+ .digest = {
+ 0xf9, 0x2a, 0xc0, 0xf4, 0x9e, 0xa4, 0x87, 0xa4,
+ 0x7d, 0x87, 0x00, 0xfd, 0xab, 0xda, 0x19, 0xca,
+ 0x26, 0x51, 0x32, 0xc1, 0x57, 0x91, 0xdf, 0xc1,
+ 0x05, 0xeb, 0x01, 0x7c, 0x5a, 0x95, 0x21, 0xb7,
+ },
+},
{
.patch_id = 0x0a101148,
.digest = {
@@ -107,6 +134,15 @@
0xf1, 0x5e, 0xb0, 0xde, 0xb4, 0x98, 0xae, 0xc4,
},
},
+{
+ .patch_id = 0x0a10114c,
+ .digest = {
+ 0x9e, 0xb6, 0xa2, 0xd9, 0x87, 0x38, 0xc5, 0x64,
+ 0xd8, 0x88, 0xfa, 0x78, 0x98, 0xf9, 0x6f, 0x74,
+ 0x39, 0x90, 0x1b, 0xa5, 0xcf, 0x5e, 0xb4, 0x2a,
+ 0x02, 0xff, 0xd4, 0x8c, 0x71, 0x8b, 0xe2, 0xc0,
+ },
+},
{
.patch_id = 0x0a101248,
.digest = {
@@ -116,6 +152,15 @@
0x1b, 0x7d, 0x64, 0x9d, 0x4b, 0x53, 0x13, 0x75,
},
},
+{
+ .patch_id = 0x0a10124c,
+ .digest = {
+ 0x29, 0xea, 0xf1, 0x2c, 0xb2, 0xe4, 0xef, 0x90,
+ 0xa4, 0xcd, 0x1d, 0x86, 0x97, 0x17, 0x61, 0x46,
+ 0xfc, 0x22, 0xcb, 0x57, 0x75, 0x19, 0xc8, 0xcc,
+ 0x0c, 0xf5, 0xbc, 0xac, 0x81, 0x9d, 0x9a, 0xd2,
+ },
+},
{
.patch_id = 0x0a108108,
.digest = {
@@ -125,6 +170,15 @@
0x28, 0x1e, 0x9c, 0x59, 0x69, 0x99, 0x4d, 0x16,
},
},
+{
+ .patch_id = 0x0a108109,
+ .digest = {
+ 0x85, 0xb4, 0xbd, 0x7c, 0x49, 0xa7, 0xbd, 0xfa,
+ 0x49, 0x36, 0x80, 0x81, 0xc5, 0xb7, 0x39, 0x1b,
+ 0x9a, 0xaa, 0x50, 0xde, 0x9b, 0xe9, 0x32, 0x35,
+ 0x42, 0x7e, 0x51, 0x4f, 0x52, 0x2c, 0x28, 0x59,
+ },
+},
{
.patch_id = 0x0a20102d,
.digest = {
@@ -134,6 +188,15 @@
0x8c, 0xe9, 0x19, 0x3e, 0xcc, 0x3f, 0x7b, 0xb4,
},
},
+{
+ .patch_id = 0x0a20102e,
+ .digest = {
+ 0xbe, 0x1f, 0x32, 0x04, 0x0d, 0x3c, 0x9c, 0xdd,
+ 0xe1, 0xa4, 0xbf, 0x76, 0x3a, 0xec, 0xc2, 0xf6,
+ 0x11, 0x00, 0xa7, 0xaf, 0x0f, 0xe5, 0x02, 0xc5,
+ 0x54, 0x3a, 0x1f, 0x8c, 0x16, 0xb5, 0xff, 0xbe,
+ },
+},
{
.patch_id = 0x0a201210,
.digest = {
@@ -143,6 +206,15 @@
0xf7, 0x55, 0xf0, 0x13, 0xbb, 0x22, 0xf6, 0x41,
},
},
+{
+ .patch_id = 0x0a201211,
+ .digest = {
+ 0x69, 0xa1, 0x17, 0xec, 0xd0, 0xf6, 0x6c, 0x95,
+ 0xe2, 0x1e, 0xc5, 0x59, 0x1a, 0x52, 0x0a, 0x27,
+ 0xc4, 0xed, 0xd5, 0x59, 0x1f, 0xbf, 0x00, 0xff,
+ 0x08, 0x88, 0xb5, 0xe1, 0x12, 0xb6, 0xcc, 0x27,
+ },
+},
{
.patch_id = 0x0a404107,
.digest = {
@@ -152,6 +224,15 @@
0x13, 0xbc, 0xc5, 0x25, 0xe4, 0xc5, 0xc3, 0x99,
},
},
+{
+ .patch_id = 0x0a404108,
+ .digest = {
+ 0x69, 0x67, 0x43, 0x06, 0xf8, 0x0c, 0x62, 0xdc,
+ 0xa4, 0x21, 0x30, 0x4f, 0x0f, 0x21, 0x2c, 0xcb,
+ 0xcc, 0x37, 0xf1, 0x1c, 0xc3, 0xf8, 0x2f, 0x19,
+ 0xdf, 0x53, 0x53, 0x46, 0xb1, 0x15, 0xea, 0x00,
+ },
+},
{
.patch_id = 0x0a500011,
.digest = {
@@ -161,6 +242,15 @@
0x11, 0x5e, 0x96, 0x7e, 0x71, 0xe9, 0xfc, 0x74,
},
},
+{
+ .patch_id = 0x0a500012,
+ .digest = {
+ 0xeb, 0x74, 0x0d, 0x47, 0xa1, 0x8e, 0x09, 0xe4,
+ 0x93, 0x4c, 0xad, 0x03, 0x32, 0x4c, 0x38, 0x16,
+ 0x10, 0x39, 0xdd, 0x06, 0xaa, 0xce, 0xd6, 0x0f,
+ 0x62, 0x83, 0x9d, 0x8e, 0x64, 0x55, 0xbe, 0x63,
+ },
+},
{
.patch_id = 0x0a601209,
.digest = {
@@ -170,6 +260,15 @@
0xe8, 0x73, 0xe2, 0xd6, 0xdb, 0xd2, 0x77, 0x1d,
},
},
+{
+ .patch_id = 0x0a60120a,
+ .digest = {
+ 0x0c, 0x8b, 0x3d, 0xfd, 0x52, 0x52, 0x85, 0x7d,
+ 0x20, 0x3a, 0xe1, 0x7e, 0xa4, 0x21, 0x3b, 0x7b,
+ 0x17, 0x86, 0xae, 0xac, 0x13, 0xb8, 0x63, 0x9d,
+ 0x06, 0x01, 0xd0, 0xa0, 0x51, 0x9a, 0x91, 0x2c,
+ },
+},
{
.patch_id = 0x0a704107,
.digest = {
@@ -179,6 +278,15 @@
0x64, 0x39, 0x71, 0x8c, 0xce, 0xe7, 0x41, 0x39,
},
},
+{
+ .patch_id = 0x0a704108,
+ .digest = {
+ 0xd7, 0x55, 0x15, 0x2b, 0xfe, 0xc4, 0xbc, 0x93,
+ 0xec, 0x91, 0xa0, 0xae, 0x45, 0xb7, 0xc3, 0x98,
+ 0x4e, 0xff, 0x61, 0x77, 0x88, 0xc2, 0x70, 0x49,
+ 0xe0, 0x3a, 0x1d, 0x84, 0x38, 0x52, 0xbf, 0x5a,
+ },
+},
{
.patch_id = 0x0a705206,
.digest = {
@@ -188,6 +296,15 @@
0x03, 0x35, 0xe9, 0xbe, 0xfb, 0x06, 0xdf, 0xfc,
},
},
+{
+ .patch_id = 0x0a705208,
+ .digest = {
+ 0x30, 0x1d, 0x55, 0x24, 0xbc, 0x6b, 0x5a, 0x19,
+ 0x0c, 0x7d, 0x1d, 0x74, 0xaa, 0xd1, 0xeb, 0xd2,
+ 0x16, 0x62, 0xf7, 0x5b, 0xe1, 0x1f, 0x18, 0x11,
+ 0x5c, 0xf0, 0x94, 0x90, 0x26, 0xec, 0x69, 0xff,
+ },
+},
{
.patch_id = 0x0a708007,
.digest = {
@@ -197,6 +314,15 @@
0xdf, 0x92, 0x73, 0x84, 0x87, 0x3c, 0x73, 0x93,
},
},
+{
+ .patch_id = 0x0a708008,
+ .digest = {
+ 0x08, 0x6e, 0xf0, 0x22, 0x4b, 0x8e, 0xc4, 0x46,
+ 0x58, 0x34, 0xe6, 0x47, 0xa2, 0x28, 0xfd, 0xab,
+ 0x22, 0x3d, 0xdd, 0xd8, 0x52, 0x9e, 0x1d, 0x16,
+ 0xfa, 0x01, 0x68, 0x14, 0x79, 0x3e, 0xe8, 0x6b,
+ },
+},
{
.patch_id = 0x0a70c005,
.digest = {
@@ -206,6 +332,15 @@
0xee, 0x49, 0xac, 0xe1, 0x8b, 0x13, 0xc5, 0x13,
},
},
+{
+ .patch_id = 0x0a70c008,
+ .digest = {
+ 0x0f, 0xdb, 0x37, 0xa1, 0x10, 0xaf, 0xd4, 0x21,
+ 0x94, 0x0d, 0xa4, 0xa2, 0xe9, 0x86, 0x6c, 0x0e,
+ 0x85, 0x7c, 0x36, 0x30, 0xa3, 0x3a, 0x78, 0x66,
+ 0x18, 0x10, 0x60, 0x0d, 0x78, 0x3d, 0x44, 0xd0,
+ },
+},
{
.patch_id = 0x0aa00116,
.digest = {
@@ -224,3 +359,12 @@
0x68, 0x2f, 0x46, 0xee, 0xfe, 0xc6, 0x6d, 0xef,
},
},
+{
+ .patch_id = 0x0aa00216,
+ .digest = {
+ 0x79, 0xfb, 0x5b, 0x9f, 0xb6, 0xe6, 0xa8, 0xf5,
+ 0x4e, 0x7c, 0x4f, 0x8e, 0x1d, 0xad, 0xd0, 0x08,
+ 0xc2, 0x43, 0x7c, 0x8b, 0xe6, 0xdb, 0xd0, 0xd2,
+ 0xe8, 0x39, 0x26, 0xc1, 0xe5, 0x5a, 0x48, 0xf1,
+ },
+},
diff --git a/xen/arch/x86/cpu/microcode/amd.c b/xen/arch/x86/cpu/microcode/amd.c
index a2860d8948a2..e43075dcb540 100644
--- a/xen/arch/x86/cpu/microcode/amd.c
+++ b/xen/arch/x86/cpu/microcode/amd.c
@@ -529,3 +529,18 @@ void __init ucode_probe_amd(struct microcode_ops *ops)
*ops = amd_ucode_ops;
}
+
+#if 0 /* Manual CONFIG_SELF_TESTS */
+static void __init __constructor test_digests_sorted(void)
+{
+ for ( unsigned int i = 1; i < ARRAY_SIZE(patch_digests); ++i )
+ {
+ if ( patch_digests[i - 1].patch_id < patch_digests[i].patch_id )
+ continue;
+
+ panic("patch_digests[] not sorted: %08x >= %08x\n",
+ patch_digests[i - 1].patch_id,
+ patch_digests[i].patch_id);
+ }
+}
+#endif /* CONFIG_SELF_TESTS */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Apr 2025 03:18:59 +0100
Subject: x86/idle: Rearrange VERW and MONITOR in mwait_idle_with_hints()
In order to mitigate TSA, Xen will need to issue VERW before going idle.
On AMD CPUs, the VERW scrubbing side effects cancel an active MONITOR, causing
the MWAIT to exit without entering an idle state. Therefore the VERW must be
ahead of MONITOR.
Split spec_ctrl_enter_idle() in two and allow the VERW aspect to be handled
separately. While adjusting, update a stale comment concerning MSBDS; more
issues have been mitigated using VERW since it was written.
By moving VERW earlier, it is ahead of the determination of whether to go
idle. We can't move the check on softirq_pending (for correctness reasons),
but we can duplicate it earlier as a best effort attempt to skip the
speculative overhead.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c58a51a09f33..a7253f145343 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -450,9 +450,18 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
+ struct cpu_info *info = get_cpu_info();
irq_cpustat_t *stat = &irq_stat[cpu];
const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+ /*
+ * Heuristic: if we're definitely not going to idle, bail early as the
+ * speculative safety can be expensive. This is a performance
+ * consideration not a correctness issue.
+ */
+ if ( *this_softirq_pending )
+ return;
+
/*
* By setting in_mwait, we promise to other CPUs that we'll notice changes
* to __softirq_pending without being sent an IPI. We achieve this by
@@ -466,15 +475,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
"", X86_BUG_MONITOR,
[in_mwait] "=m" (stat->in_mwait));
+ /*
+ * On AMD systems, side effects from VERW cancel MONITOR, causing MWAIT to
+ * wake up immediately. Therefore, VERW must come ahead of MONITOR.
+ */
+ __spec_ctrl_enter_idle_verw(info);
+
monitor(this_softirq_pending, 0, 0);
ASSERT(!local_irq_is_enabled());
if ( !*this_softirq_pending )
{
- struct cpu_info *info = get_cpu_info();
-
- spec_ctrl_enter_idle(info);
+ __spec_ctrl_enter_idle(info, false /* VERW handled above */);
if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
mwait(eax, ecx);
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 077225418956..6724d3812029 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -115,8 +115,22 @@ static inline void init_shadow_spec_ctrl_state(void)
info->verw_sel = __HYPERVISOR_DS32;
}
+static always_inline void __spec_ctrl_enter_idle_verw(struct cpu_info *info)
+{
+ /*
+ * Flush/scrub structures which are statically partitioned between active
+ * threads. Otherwise data of ours (of unknown sensitivity) will become
+ * available to our sibling when we go idle.
+ *
+ * Note: VERW must be encoded with a memory operand, as it is only that
+ * form with side effects.
+ */
+ alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
+ [sel] "m" (info->verw_sel));
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
-static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+static always_inline void __spec_ctrl_enter_idle(struct cpu_info *info, bool verw)
{
uint32_t val = 0;
@@ -135,21 +149,8 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
"a" (val), "c" (MSR_SPEC_CTRL), "d" (0));
barrier();
- /*
- * Microarchitectural Store Buffer Data Sampling:
- *
- * On vulnerable systems, store buffer entries are statically partitioned
- * between active threads. When entering idle, our store buffer entries
- * are re-partitioned to allow the other threads to use them.
- *
- * Flush the buffers to ensure that no sensitive data of ours can be
- * leaked by a sibling after it gets our store buffer entries.
- *
- * Note: VERW must be encoded with a memory operand, as it is only that
- * form which causes a flush.
- */
- alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
- [sel] "m" (info->verw_sel));
+ if ( verw ) /* Expected to be const-propagated. */
+ __spec_ctrl_enter_idle_verw(info);
/*
* Cross-Thread Return Address Predictions:
@@ -167,6 +168,12 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
: "rax", "rcx");
}
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+ __spec_ctrl_enter_idle(info, true /* VERW */);
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 29 Aug 2024 17:36:11 +0100
Subject: x86/spec-ctrl: Mitigate Transitive Scheduler Attacks
TSA affects AMD Fam19h CPUs (Zen3 and 4 microarchitectures).
Three new CPUID bits have been defined. Two (TSA_SQ_NO and TSA_L1_NO)
indicate that the system is unaffected, and must be synthesised by Xen on
unaffected parts to date.
A third new bit indicates that VERW now has a flushing side effect. Xen
must synthesise this bit on affected systems based on microcode version.
As with other VERW-based flushing features, VERW_CLEAR needs OR-ing across
a resource pool, and guests which have seen it can safely migrate in.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 9d1ff6268d79..3e628e008e92 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -480,6 +480,17 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
if ( test_bit(X86_FEATURE_RTM, fs) )
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicates that the VERW instruction may have gained
+ * scrubbing side effects. With pooling, it means "you might migrate
+ * somewhere where scrubbing is necessary", and may need exposing on
+ * unaffected hardware. This is fine, because the VERW instruction
+ * has been around since the 286.
+ */
+ __set_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
@@ -558,6 +569,17 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicate that the VERW instruction may have gained
+ * scrubbing side effects. The max policy has it set for migration
+ * reasons, so reset the default policy back to the host value in case
+ * we're unaffected.
+ */
+ if ( !cpu_has_verw_clear )
+ __clear_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 91edb3345938..610c64bf4c97 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -99,6 +99,8 @@ __UNLIKELY_END(nsvm_hap)
pop %rsi
pop %rdi
+ SPEC_CTRL_COND_VERW /* Req: %rsp=eframe Clob: efl */
+
vmrun
SAVE_ALL
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 90d69999d183..431cf4a2a65d 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -194,6 +194,7 @@ static inline bool boot_cpu_has(unsigned int feat)
/* CPUID level 0x80000021.eax */
#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_verw_clear boot_cpu_has(X86_FEATURE_VERW_CLEAR)
#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
/* CPUID level 0x00000007:1.edx */
@@ -221,6 +222,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_pb_opt_ctrl boot_cpu_has(X86_FEATURE_PB_OPT_CTRL)
#define cpu_has_its_no boot_cpu_has(X86_FEATURE_ITS_NO)
+/* CPUID level 0x80000021.ecx */
+#define cpu_has_tsa_sq_no boot_cpu_has(X86_FEATURE_TSA_SQ_NO)
+#define cpu_has_tsa_l1_no boot_cpu_has(X86_FEATURE_TSA_L1_NO)
+
/* Synthesized. */
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 3027f1db6b70..bcdae1ed2377 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -494,7 +494,7 @@ custom_param("pv-l1tf", parse_pv_l1tf);
static void __init print_details(enum ind_thunk thunk)
{
- unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
+ unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, e21c = 0, max = 0, tmp;
uint64_t caps = 0;
/* Collect diagnostics about available mitigations. */
@@ -505,7 +505,7 @@ static void __init print_details(enum ind_thunk thunk)
if ( boot_cpu_data.extended_cpuid_level >= 0x80000008U )
cpuid(0x80000008U, &tmp, &e8b, &tmp, &tmp);
if ( boot_cpu_data.extended_cpuid_level >= 0x80000021U )
- cpuid(0x80000021U, &e21a, &tmp, &tmp, &tmp);
+ cpuid(0x80000021U, &e21a, &tmp, &e21c, &tmp);
if ( cpu_has_arch_caps )
rdmsrl(MSR_ARCH_CAPABILITIES, caps);
@@ -515,7 +515,7 @@ static void __init print_details(enum ind_thunk thunk)
* Hardware read-only information, stating immunity to certain issues, or
* suggestions of which mitigation to use.
*/
- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
(caps & ARCH_CAPS_EIBRS) ? " EIBRS" : "",
(caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
@@ -540,10 +540,12 @@ static void __init print_details(enum ind_thunk thunk)
(e8b & cpufeat_mask(X86_FEATURE_IBPB_RET)) ? " IBPB_RET" : "",
(e21a & cpufeat_mask(X86_FEATURE_IBPB_BRTYPE)) ? " IBPB_BRTYPE" : "",
(e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_SQ_NO)) ? " TSA_SQ_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_L1_NO)) ? " TSA_L1_NO" : "");
/* Hardware features which need driving to mitigate issues. */
- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
(_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -563,7 +565,8 @@ static void __init print_details(enum ind_thunk thunk)
(caps & ARCH_CAPS_GDS_CTRL) ? " GDS_CTRL" : "",
(caps & ARCH_CAPS_RFDS_CLEAR) ? " RFDS_CLEAR" : "",
(e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "",
+ (e21a & cpufeat_mask(X86_FEATURE_VERW_CLEAR)) ? " VERW_CLEAR" : "");
/* Compiled-in support which pertains to mitigations. */
if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) ||
@@ -1524,6 +1527,77 @@ static void __init rfds_calculations(void)
setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
}
+/*
+ * Transient Scheduler Attacks
+ *
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
+ */
+static void __init tsa_calculations(void)
+{
+ unsigned int curr_rev, min_rev;
+
+ /* TSA is only known to affect AMD processors at this time. */
+ if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
+ return;
+
+ /* If we're virtualised, don't attempt to synthesise anything. */
+ if ( cpu_has_hypervisor )
+ return;
+
+ /*
+ * According to the whitepaper, some Fam1A CPUs (Models 0x00...0x4f,
+ * 0x60...0x7f) are not vulnerable but don't enumerate TSA_{SQ,L1}_NO. If
+ * we see either enumerated, assume both are correct ...
+ */
+ if ( cpu_has_tsa_sq_no || cpu_has_tsa_l1_no )
+ return;
+
+ /*
+ * ... otherwise, synthesise them. CPUs other than Fam19 (Zen3/4) are
+ * stated to be not vulnerable.
+ */
+ if ( boot_cpu_data.x86 != 0x19 )
+ {
+ setup_force_cpu_cap(X86_FEATURE_TSA_SQ_NO);
+ setup_force_cpu_cap(X86_FEATURE_TSA_L1_NO);
+ return;
+ }
+
+ /*
+ * Fam19 CPUs get VERW_CLEAR with new enough microcode, but must
+ * synthesise the CPUID bit.
+ */
+ curr_rev = this_cpu(cpu_sig).rev;
+ switch ( curr_rev >> 8 )
+ {
+ case 0x0a0011: min_rev = 0x0a0011d7; break;
+ case 0x0a0012: min_rev = 0x0a00123b; break;
+ case 0x0a0082: min_rev = 0x0a00820d; break;
+ case 0x0a1011: min_rev = 0x0a10114c; break;
+ case 0x0a1012: min_rev = 0x0a10124c; break;
+ case 0x0a1081: min_rev = 0x0a108109; break;
+ case 0x0a2010: min_rev = 0x0a20102e; break;
+ case 0x0a2012: min_rev = 0x0a201211; break;
+ case 0x0a4041: min_rev = 0x0a404108; break;
+ case 0x0a5000: min_rev = 0x0a500012; break;
+ case 0x0a6012: min_rev = 0x0a60120a; break;
+ case 0x0a7041: min_rev = 0x0a704108; break;
+ case 0x0a7052: min_rev = 0x0a705208; break;
+ case 0x0a7080: min_rev = 0x0a708008; break;
+ case 0x0a70c0: min_rev = 0x0a70c008; break;
+ case 0x0aa002: min_rev = 0x0aa00216; break;
+ default:
+ printk(XENLOG_WARNING
+ "Unrecognised CPU %02x-%02x-%02x, ucode 0x%08x for TSA mitigation\n",
+ boot_cpu_data.x86, boot_cpu_data.x86_model,
+ boot_cpu_data.x86_mask, curr_rev);
+ return;
+ }
+
+ if ( curr_rev >= min_rev )
+ setup_force_cpu_cap(X86_FEATURE_VERW_CLEAR);
+}
+
static bool __init cpu_has_gds(void)
{
/*
@@ -2221,6 +2295,7 @@ void __init init_speculation_mitigations(void)
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
*
* Relevant ucodes:
*
@@ -2253,9 +2328,18 @@ void __init init_speculation_mitigations(void)
*
* - March 2023, for RFDS. Enumerate RFDS_CLEAR to mean that VERW now
* scrubs non-architectural entries from certain register files.
+ *
+ * - July 2025, for TSA. Introduces VERW side effects to mitigate
+ * TSA_{SQ/L1}. Xen must synthesise the VERW_CLEAR feature based on
+ * microcode version.
+ *
+ * Note, these microcode updates were produced before the remediation of
+ * the microcode signature issues, and are included in the firwmare
+ * updates fixing the entrysign vulnerability from ~December 2024.
*/
mds_calculations();
rfds_calculations();
+ tsa_calculations();
/*
* Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -2287,21 +2371,27 @@ void __init init_speculation_mitigations(void)
* MLPDS/MFBDS when SMT is enabled.
*/
if ( opt_verw_pv == -1 )
- opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_pv = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
if ( opt_verw_hvm == -1 )
- opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_hvm = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
/*
- * If SMT is active, and we're protecting against MDS or MMIO stale data,
+ * If SMT is active, and we're protecting against any of:
+ * - MSBDS
+ * - MMIO stale data
+ * - TSA-SQ
* we need to scrub before going idle as well as on return to guest.
* Various pipeline resources are repartitioned amongst non-idle threads.
*
- * We don't need to scrub on idle for RFDS. There are no affected cores
- * which support SMT, despite there being affected cores in hybrid systems
- * which have SMT elsewhere in the platform.
+ * We don't need to scrub on idle for:
+ * - RFDS (no SMT affected cores)
+ * - TSA-L1 (utags never shared between threads)
*/
if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+ (cpu_has_verw_clear && !cpu_has_tsa_sq_no) ||
opt_verw_mmio) && hw_smt_enabled )
setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index cb926782a8f7..4f94342ad633 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -315,6 +315,7 @@ XEN_CPUFEATURE(AVX_IFMA, 10*32+23) /*A AVX-IFMA Instructions */
XEN_CPUFEATURE(NO_NEST_BP, 11*32+ 0) /*A No Nested Data Breakpoints */
XEN_CPUFEATURE(FS_GS_NS, 11*32+ 1) /*S| FS/GS base MSRs non-serialising */
XEN_CPUFEATURE(LFENCE_DISPATCH, 11*32+ 2) /*A LFENCE always serializing */
+XEN_CPUFEATURE(VERW_CLEAR, 11*32+ 5) /*!A| VERW clears microarchitectural buffers */
XEN_CPUFEATURE(NSCB, 11*32+ 6) /*A Null Selector Clears Base (and limit too) */
XEN_CPUFEATURE(AUTO_IBRS, 11*32+ 8) /*S Automatic IBRS */
XEN_CPUFEATURE(AMD_FSRS, 11*32+10) /*A Fast Short REP STOSB */
@@ -384,6 +385,8 @@ XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+XEN_CPUFEATURE(TSA_SQ_NO, 18*32+ 1) /*A No Store Queue Transitive Scheduler Attacks */
+XEN_CPUFEATURE(TSA_L1_NO, 18*32+ 2) /*A No L1D Transitive Scheduler Attacks */
#endif /* XEN_CPUFEATURE */
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 2 Apr 2025 03:18:59 +0100
Subject: x86/idle: Rearrange VERW and MONITOR in mwait_idle_with_hints()
In order to mitigate TSA, Xen will need to issue VERW before going idle.
On AMD CPUs, the VERW scrubbing side effects cancel an active MONITOR, causing
the MWAIT to exit without entering an idle state. Therefore the VERW must be
ahead of MONITOR.
Split spec_ctrl_enter_idle() in two and allow the VERW aspect to be handled
separately. While adjusting, update a stale comment concerning MSBDS; more
issues have been mitigated using VERW since it was written.
By moving VERW earlier, it is ahead of the determination of whether to go
idle. We can't move the check on softirq_pending (for correctness reasons),
but we can duplicate it earlier as a best effort attempt to skip the
speculative overhead.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index e50a9234a0d4..423df3d316ad 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -450,9 +450,18 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
+ struct cpu_info *info = get_cpu_info();
irq_cpustat_t *stat = &irq_stat[cpu];
const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+ /*
+ * Heuristic: if we're definitely not going to idle, bail early as the
+ * speculative safety can be expensive. This is a performance
+ * consideration not a correctness issue.
+ */
+ if ( *this_softirq_pending )
+ return;
+
/*
* By setting in_mwait, we promise to other CPUs that we'll notice changes
* to __softirq_pending without being sent an IPI. We achieve this by
@@ -466,15 +475,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
"", X86_BUG_MONITOR,
[in_mwait] "=m" (stat->in_mwait));
+ /*
+ * On AMD systems, side effects from VERW cancel MONITOR, causing MWAIT to
+ * wake up immediately. Therefore, VERW must come ahead of MONITOR.
+ */
+ __spec_ctrl_enter_idle_verw(info);
+
monitor(this_softirq_pending, 0, 0);
ASSERT(!local_irq_is_enabled());
if ( !*this_softirq_pending )
{
- struct cpu_info *info = get_cpu_info();
-
- spec_ctrl_enter_idle(info);
+ __spec_ctrl_enter_idle(info, false /* VERW handled above */);
if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
mwait(eax, ecx);
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 077225418956..6724d3812029 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -115,8 +115,22 @@ static inline void init_shadow_spec_ctrl_state(void)
info->verw_sel = __HYPERVISOR_DS32;
}
+static always_inline void __spec_ctrl_enter_idle_verw(struct cpu_info *info)
+{
+ /*
+ * Flush/scrub structures which are statically partitioned between active
+ * threads. Otherwise data of ours (of unknown sensitivity) will become
+ * available to our sibling when we go idle.
+ *
+ * Note: VERW must be encoded with a memory operand, as it is only that
+ * form with side effects.
+ */
+ alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
+ [sel] "m" (info->verw_sel));
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
-static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+static always_inline void __spec_ctrl_enter_idle(struct cpu_info *info, bool verw)
{
uint32_t val = 0;
@@ -135,21 +149,8 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
"a" (val), "c" (MSR_SPEC_CTRL), "d" (0));
barrier();
- /*
- * Microarchitectural Store Buffer Data Sampling:
- *
- * On vulnerable systems, store buffer entries are statically partitioned
- * between active threads. When entering idle, our store buffer entries
- * are re-partitioned to allow the other threads to use them.
- *
- * Flush the buffers to ensure that no sensitive data of ours can be
- * leaked by a sibling after it gets our store buffer entries.
- *
- * Note: VERW must be encoded with a memory operand, as it is only that
- * form which causes a flush.
- */
- alternative_input("", "verw %[sel]", X86_FEATURE_SC_VERW_IDLE,
- [sel] "m" (info->verw_sel));
+ if ( verw ) /* Expected to be const-propagated. */
+ __spec_ctrl_enter_idle_verw(info);
/*
* Cross-Thread Return Address Predictions:
@@ -167,6 +168,12 @@ static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
: "rax", "rcx");
}
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+ __spec_ctrl_enter_idle(info, true /* VERW */);
+}
+
/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
{
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 29 Aug 2024 17:36:11 +0100
Subject: x86/spec-ctrl: Mitigate Transitive Scheduler Attacks
TSA affects AMD Fam19h CPUs (Zen3 and 4 microarchitectures).
Three new CPUID bits have been defined. Two (TSA_SQ_NO and TSA_L1_NO)
indicate that the system is unaffected, and must be synthesised by Xen on
unaffected parts to date.
A third new bit indicates that VERW now has a flushing side effect. Xen
must synthesise this bit on affected systems based on microcode version.
As with other VERW-based flushing features, VERW_CLEAR needs OR-ing across
a resource pool, and guests which have seen it can safely migrate in.
This is part of XSA-471 / CVE-2024-36350 / CVE-2024-36357.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 9d1ff6268d79..3e628e008e92 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -480,6 +480,17 @@ static void __init guest_common_max_feature_adjustments(uint32_t *fs)
if ( test_bit(X86_FEATURE_RTM, fs) )
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicates that the VERW instruction may have gained
+ * scrubbing side effects. With pooling, it means "you might migrate
+ * somewhere where scrubbing is necessary", and may need exposing on
+ * unaffected hardware. This is fine, because the VERW instruction
+ * has been around since the 286.
+ */
+ __set_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
@@ -558,6 +569,17 @@ static void __init guest_common_default_feature_adjustments(uint32_t *fs)
__set_bit(X86_FEATURE_RTM_ALWAYS_ABORT, fs);
}
break;
+
+ case X86_VENDOR_AMD:
+ /*
+ * This bit indicate that the VERW instruction may have gained
+ * scrubbing side effects. The max policy has it set for migration
+ * reasons, so reset the default policy back to the host value in case
+ * we're unaffected.
+ */
+ if ( !cpu_has_verw_clear )
+ __clear_bit(X86_FEATURE_VERW_CLEAR, fs);
+ break;
}
/*
diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 91edb3345938..610c64bf4c97 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -99,6 +99,8 @@ __UNLIKELY_END(nsvm_hap)
pop %rsi
pop %rdi
+ SPEC_CTRL_COND_VERW /* Req: %rsp=eframe Clob: efl */
+
vmrun
SAVE_ALL
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 6c5f5ce0cfc5..3c2ac964e410 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -195,6 +195,7 @@ static inline bool boot_cpu_has(unsigned int feat)
/* CPUID level 0x80000021.eax */
#define cpu_has_lfence_dispatch boot_cpu_has(X86_FEATURE_LFENCE_DISPATCH)
+#define cpu_has_verw_clear boot_cpu_has(X86_FEATURE_VERW_CLEAR)
#define cpu_has_nscb boot_cpu_has(X86_FEATURE_NSCB)
/* CPUID level 0x00000007:1.edx */
@@ -222,6 +223,10 @@ static inline bool boot_cpu_has(unsigned int feat)
#define cpu_has_pb_opt_ctrl boot_cpu_has(X86_FEATURE_PB_OPT_CTRL)
#define cpu_has_its_no boot_cpu_has(X86_FEATURE_ITS_NO)
+/* CPUID level 0x80000021.ecx */
+#define cpu_has_tsa_sq_no boot_cpu_has(X86_FEATURE_TSA_SQ_NO)
+#define cpu_has_tsa_l1_no boot_cpu_has(X86_FEATURE_TSA_L1_NO)
+
/* Synthesized. */
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 3027f1db6b70..bcdae1ed2377 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -494,7 +494,7 @@ custom_param("pv-l1tf", parse_pv_l1tf);
static void __init print_details(enum ind_thunk thunk)
{
- unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
+ unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, e21c = 0, max = 0, tmp;
uint64_t caps = 0;
/* Collect diagnostics about available mitigations. */
@@ -505,7 +505,7 @@ static void __init print_details(enum ind_thunk thunk)
if ( boot_cpu_data.extended_cpuid_level >= 0x80000008U )
cpuid(0x80000008U, &tmp, &e8b, &tmp, &tmp);
if ( boot_cpu_data.extended_cpuid_level >= 0x80000021U )
- cpuid(0x80000021U, &e21a, &tmp, &tmp, &tmp);
+ cpuid(0x80000021U, &e21a, &tmp, &e21c, &tmp);
if ( cpu_has_arch_caps )
rdmsrl(MSR_ARCH_CAPABILITIES, caps);
@@ -515,7 +515,7 @@ static void __init print_details(enum ind_thunk thunk)
* Hardware read-only information, stating immunity to certain issues, or
* suggestions of which mitigation to use.
*/
- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
(caps & ARCH_CAPS_EIBRS) ? " EIBRS" : "",
(caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
@@ -540,10 +540,12 @@ static void __init print_details(enum ind_thunk thunk)
(e8b & cpufeat_mask(X86_FEATURE_IBPB_RET)) ? " IBPB_RET" : "",
(e21a & cpufeat_mask(X86_FEATURE_IBPB_BRTYPE)) ? " IBPB_BRTYPE" : "",
(e21a & cpufeat_mask(X86_FEATURE_SRSO_NO)) ? " SRSO_NO" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_US_NO)) ? " SRSO_US_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_SQ_NO)) ? " TSA_SQ_NO" : "",
+ (e21c & cpufeat_mask(X86_FEATURE_TSA_L1_NO)) ? " TSA_L1_NO" : "");
/* Hardware features which need driving to mitigate issues. */
- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
(e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
(_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
(e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
@@ -563,7 +565,8 @@ static void __init print_details(enum ind_thunk thunk)
(caps & ARCH_CAPS_GDS_CTRL) ? " GDS_CTRL" : "",
(caps & ARCH_CAPS_RFDS_CLEAR) ? " RFDS_CLEAR" : "",
(e21a & cpufeat_mask(X86_FEATURE_SBPB)) ? " SBPB" : "",
- (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "");
+ (e21a & cpufeat_mask(X86_FEATURE_SRSO_MSR_FIX)) ? " SRSO_MSR_FIX" : "",
+ (e21a & cpufeat_mask(X86_FEATURE_VERW_CLEAR)) ? " VERW_CLEAR" : "");
/* Compiled-in support which pertains to mitigations. */
if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) ||
@@ -1524,6 +1527,77 @@ static void __init rfds_calculations(void)
setup_force_cpu_cap(X86_FEATURE_RFDS_NO);
}
+/*
+ * Transient Scheduler Attacks
+ *
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
+ */
+static void __init tsa_calculations(void)
+{
+ unsigned int curr_rev, min_rev;
+
+ /* TSA is only known to affect AMD processors at this time. */
+ if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
+ return;
+
+ /* If we're virtualised, don't attempt to synthesise anything. */
+ if ( cpu_has_hypervisor )
+ return;
+
+ /*
+ * According to the whitepaper, some Fam1A CPUs (Models 0x00...0x4f,
+ * 0x60...0x7f) are not vulnerable but don't enumerate TSA_{SQ,L1}_NO. If
+ * we see either enumerated, assume both are correct ...
+ */
+ if ( cpu_has_tsa_sq_no || cpu_has_tsa_l1_no )
+ return;
+
+ /*
+ * ... otherwise, synthesise them. CPUs other than Fam19 (Zen3/4) are
+ * stated to be not vulnerable.
+ */
+ if ( boot_cpu_data.x86 != 0x19 )
+ {
+ setup_force_cpu_cap(X86_FEATURE_TSA_SQ_NO);
+ setup_force_cpu_cap(X86_FEATURE_TSA_L1_NO);
+ return;
+ }
+
+ /*
+ * Fam19 CPUs get VERW_CLEAR with new enough microcode, but must
+ * synthesise the CPUID bit.
+ */
+ curr_rev = this_cpu(cpu_sig).rev;
+ switch ( curr_rev >> 8 )
+ {
+ case 0x0a0011: min_rev = 0x0a0011d7; break;
+ case 0x0a0012: min_rev = 0x0a00123b; break;
+ case 0x0a0082: min_rev = 0x0a00820d; break;
+ case 0x0a1011: min_rev = 0x0a10114c; break;
+ case 0x0a1012: min_rev = 0x0a10124c; break;
+ case 0x0a1081: min_rev = 0x0a108109; break;
+ case 0x0a2010: min_rev = 0x0a20102e; break;
+ case 0x0a2012: min_rev = 0x0a201211; break;
+ case 0x0a4041: min_rev = 0x0a404108; break;
+ case 0x0a5000: min_rev = 0x0a500012; break;
+ case 0x0a6012: min_rev = 0x0a60120a; break;
+ case 0x0a7041: min_rev = 0x0a704108; break;
+ case 0x0a7052: min_rev = 0x0a705208; break;
+ case 0x0a7080: min_rev = 0x0a708008; break;
+ case 0x0a70c0: min_rev = 0x0a70c008; break;
+ case 0x0aa002: min_rev = 0x0aa00216; break;
+ default:
+ printk(XENLOG_WARNING
+ "Unrecognised CPU %02x-%02x-%02x, ucode 0x%08x for TSA mitigation\n",
+ boot_cpu_data.x86, boot_cpu_data.x86_model,
+ boot_cpu_data.x86_mask, curr_rev);
+ return;
+ }
+
+ if ( curr_rev >= min_rev )
+ setup_force_cpu_cap(X86_FEATURE_VERW_CLEAR);
+}
+
static bool __init cpu_has_gds(void)
{
/*
@@ -2221,6 +2295,7 @@ void __init init_speculation_mitigations(void)
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
* https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/register-file-data-sampling.html
+ * https://www.amd.com/content/dam/amd/en/documents/resources/bulletin/technical-guidance-for-mitigating-transient-scheduler-attacks.pdf
*
* Relevant ucodes:
*
@@ -2253,9 +2328,18 @@ void __init init_speculation_mitigations(void)
*
* - March 2023, for RFDS. Enumerate RFDS_CLEAR to mean that VERW now
* scrubs non-architectural entries from certain register files.
+ *
+ * - July 2025, for TSA. Introduces VERW side effects to mitigate
+ * TSA_{SQ/L1}. Xen must synthesise the VERW_CLEAR feature based on
+ * microcode version.
+ *
+ * Note, these microcode updates were produced before the remediation of
+ * the microcode signature issues, and are included in the firwmare
+ * updates fixing the entrysign vulnerability from ~December 2024.
*/
mds_calculations();
rfds_calculations();
+ tsa_calculations();
/*
* Parts which enumerate FB_CLEAR are those with now-updated microcode
@@ -2287,21 +2371,27 @@ void __init init_speculation_mitigations(void)
* MLPDS/MFBDS when SMT is enabled.
*/
if ( opt_verw_pv == -1 )
- opt_verw_pv = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_pv = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
if ( opt_verw_hvm == -1 )
- opt_verw_hvm = cpu_has_useful_md_clear || cpu_has_rfds_clear;
+ opt_verw_hvm = (cpu_has_useful_md_clear || cpu_has_rfds_clear ||
+ cpu_has_verw_clear);
/*
- * If SMT is active, and we're protecting against MDS or MMIO stale data,
+ * If SMT is active, and we're protecting against any of:
+ * - MSBDS
+ * - MMIO stale data
+ * - TSA-SQ
* we need to scrub before going idle as well as on return to guest.
* Various pipeline resources are repartitioned amongst non-idle threads.
*
- * We don't need to scrub on idle for RFDS. There are no affected cores
- * which support SMT, despite there being affected cores in hybrid systems
- * which have SMT elsewhere in the platform.
+ * We don't need to scrub on idle for:
+ * - RFDS (no SMT affected cores)
+ * - TSA-L1 (utags never shared between threads)
*/
if ( ((cpu_has_useful_md_clear && (opt_verw_pv || opt_verw_hvm)) ||
+ (cpu_has_verw_clear && !cpu_has_tsa_sq_no) ||
opt_verw_mmio) && hw_smt_enabled )
setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 480d5f58ce09..f7312e0b04e7 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -321,6 +321,7 @@ XEN_CPUFEATURE(NO_INVD, 10*32+30) /* INVD instruction unusable */
XEN_CPUFEATURE(NO_NEST_BP, 11*32+ 0) /*A No Nested Data Breakpoints */
XEN_CPUFEATURE(FS_GS_NS, 11*32+ 1) /*S| FS/GS base MSRs non-serialising */
XEN_CPUFEATURE(LFENCE_DISPATCH, 11*32+ 2) /*A LFENCE always serializing */
+XEN_CPUFEATURE(VERW_CLEAR, 11*32+ 5) /*!A| VERW clears microarchitectural buffers */
XEN_CPUFEATURE(NSCB, 11*32+ 6) /*A Null Selector Clears Base (and limit too) */
XEN_CPUFEATURE(AUTO_IBRS, 11*32+ 8) /*S Automatic IBRS */
XEN_CPUFEATURE(AMD_FSRS, 11*32+10) /*A Fast Short REP STOSB */
@@ -396,6 +397,8 @@ XEN_CPUFEATURE(PB_OPT_CTRL, 16*32+32) /* MSR_PB_OPT_CTRL.IBPB_ALT */
XEN_CPUFEATURE(ITS_NO, 16*32+62) /*!A No Indirect Target Selection */
/* AMD-defined CPU features, CPUID level 0x80000021.ecx, word 18 */
+XEN_CPUFEATURE(TSA_SQ_NO, 18*32+ 1) /*A No Store Queue Transitive Scheduler Attacks */
+XEN_CPUFEATURE(TSA_L1_NO, 18*32+ 2) /*A No L1D Transitive Scheduler Attacks */
#endif /* XEN_CPUFEATURE */
© 2016 - 2025 Red Hat, Inc.