[PATCH] x86/microcode: Prevent attempting updates known to fail

Alejandro Vallejo posted 1 patch 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20230531175119.10830-1-alejandro.vallejo@cloud.com
xen/arch/x86/cpu/microcode/core.c     | 27 +++++++++++++++++++++++++++
xen/arch/x86/include/asm/cpufeature.h |  1 +
xen/arch/x86/include/asm/msr-index.h  |  5 +++++
3 files changed, 33 insertions(+)
[PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Alejandro Vallejo 11 months ago
If IA32_MSR_MCU_CONTROL exists, then it's possible a CPU may be unable to
perform microcode updates. This is controlled through the DIS_MCU_LOAD bit.

This patch checks that the CPU that got the request is capable of doing an
update. If it is, then we let the procedure go through. While not enough
for the general case (different CPUs with different settings), this patch
copes with the far more common scenario of all CPUs being locked.

Note that for the uncommon general case, we already have some logic in
place to emit a message on xl-dmseg in order to notify the admin that they
should reboot the machine ASAP.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
---
 xen/arch/x86/cpu/microcode/core.c     | 27 +++++++++++++++++++++++++++
 xen/arch/x86/include/asm/cpufeature.h |  1 +
 xen/arch/x86/include/asm/msr-index.h  |  5 +++++
 3 files changed, 33 insertions(+)

diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c
index cd456c476f..e507945932 100644
--- a/xen/arch/x86/cpu/microcode/core.c
+++ b/xen/arch/x86/cpu/microcode/core.c
@@ -697,6 +697,17 @@ static long cf_check microcode_update_helper(void *data)
     return ret;
 }
 
+static bool this_cpu_can_install_update(void)
+{
+    uint64_t mcu_ctrl;
+
+    if ( !cpu_has_mcu_ctrl )
+        return true;
+
+    rdmsrl(MSR_MCU_CONTROL, mcu_ctrl);
+    return !(mcu_ctrl & MCU_CONTROL_DIS_MCU_LOAD);
+}
+
 int microcode_update(XEN_GUEST_HANDLE(const_void) buf, unsigned long len)
 {
     int ret;
@@ -708,6 +719,22 @@ int microcode_update(XEN_GUEST_HANDLE(const_void) buf, unsigned long len)
     if ( !ucode_ops.apply_microcode )
         return -EINVAL;
 
+    if ( !this_cpu_can_install_update() )
+    {
+        /*
+         * This CPU can't install microcode, so it makes no sense to try to
+         * go on. We're implicitly trusting firmware sanity in that all
+         * CPUs are expected to have a homogeneous setting. If, for some
+         * reason, another CPU happens to be locked down when this one
+         * isn't then unpleasantness will follow. In particular, some CPUs
+         * will be updated while others will not. A very stern message will
+         * be displayed in xl-dmesg that case, strongly advising to reboot the
+         * machine.
+         */
+        printk("WARNING: microcode not installed due to DIS_MCU_LOAD=1");
+        return -EACCES;
+    }
+
     buffer = xmalloc_flex_struct(struct ucode_buf, buffer, len);
     if ( !buffer )
         return -ENOMEM;
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index ace31e3b1f..0118171d7e 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -192,6 +192,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_if_pschange_mc_no boot_cpu_has(X86_FEATURE_IF_PSCHANGE_MC_NO)
 #define cpu_has_tsx_ctrl        boot_cpu_has(X86_FEATURE_TSX_CTRL)
 #define cpu_has_taa_no          boot_cpu_has(X86_FEATURE_TAA_NO)
+#define cpu_has_mcu_ctrl        boot_cpu_has(X86_FEATURE_MCU_CTRL)
 #define cpu_has_fb_clear        boot_cpu_has(X86_FEATURE_FB_CLEAR)
 
 /* Synthesized. */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 2749e433d2..5c1350b5f9 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -165,6 +165,11 @@
 #define  PASID_PASID_MASK                   0x000fffff
 #define  PASID_VALID                        (_AC(1, ULL) << 31)
 
+#define MSR_MCU_CONTROL                     0x00001406
+#define  MCU_CONTROL_LOCK                   (_AC(1, ULL) <<  0)
+#define  MCU_CONTROL_DIS_MCU_LOAD           (_AC(1, ULL) <<  1)
+#define  MCU_CONTROL_EN_SMM_BYPASS          (_AC(1, ULL) <<  2)
+
 #define MSR_UARCH_MISC_CTRL                 0x00001b01
 #define  UARCH_CTRL_DOITM                   (_AC(1, ULL) <<  0)
 
-- 
2.34.1
Re: [PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Andrew Cooper 11 months ago
On 31/05/2023 6:51 pm, Alejandro Vallejo wrote:
> diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c
> index cd456c476f..e507945932 100644
> --- a/xen/arch/x86/cpu/microcode/core.c
> +++ b/xen/arch/x86/cpu/microcode/core.c
> @@ -697,6 +697,17 @@ static long cf_check microcode_update_helper(void *data)
>      return ret;
>  }
>  
> +static bool this_cpu_can_install_update(void)
> +{
> +    uint64_t mcu_ctrl;
> +
> +    if ( !cpu_has_mcu_ctrl )
> +        return true;
> +
> +    rdmsrl(MSR_MCU_CONTROL, mcu_ctrl);
> +    return !(mcu_ctrl & MCU_CONTROL_DIS_MCU_LOAD);
> +}
> +
>  int microcode_update(XEN_GUEST_HANDLE(const_void) buf, unsigned long len)
>  {
>      int ret;
> @@ -708,6 +719,22 @@ int microcode_update(XEN_GUEST_HANDLE(const_void) buf, unsigned long len)
>      if ( !ucode_ops.apply_microcode )
>          return -EINVAL;
>  
> +    if ( !this_cpu_can_install_update() )
> +    {
> +        /*
> +         * This CPU can't install microcode, so it makes no sense to try to
> +         * go on. We're implicitly trusting firmware sanity in that all
> +         * CPUs are expected to have a homogeneous setting. If, for some
> +         * reason, another CPU happens to be locked down when this one
> +         * isn't then unpleasantness will follow. In particular, some CPUs
> +         * will be updated while others will not. A very stern message will
> +         * be displayed in xl-dmesg that case, strongly advising to reboot the
> +         * machine.
> +         */
> +        printk("WARNING: microcode not installed due to DIS_MCU_LOAD=1");
> +        return -EACCES;
> +    }

I had something else in mind here.  Right now, this will read
MSR_MCU_CONTROL and emit a printk() on every microcode load, which will
be every AP, and every time the user uses the xen-ucode tool.

Instead, I recommend the following:

1) One patch moving the early-cpuid/msr read from tsx_init() into
early_microcode_init(), adjusting the comment as it goes.  No point
duplicating that logic, and we need it earlier on boot now.
2) This patch, adjusting early_microcode_init() only.  Have a printk()
saying "microcode loading disabled by firmware" and avoid filling in
ucode_ops.  Every other part of ucode handling understands "loading not
available".

In terms of the commit message, you should call out the usecase
explicitly.  This feature is intended for baremetal clouds where the
platform owner doesn't trust the tenant to choose the microcode version
in use.

Also, I'm really not sure what your 3rd paragraph is trying to say.

~Andrew

Re: [PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Andrew Cooper 10 months, 4 weeks ago
On 01/06/2023 11:54 am, Andrew Cooper wrote:
> Instead, I recommend the following:
>
> 1) One patch moving the early-cpuid/msr read from tsx_init() into
> early_microcode_init(), adjusting the comment as it goes.  No point
> duplicating that logic, and we need it earlier on boot now.
> 2) This patch, adjusting early_microcode_init() only.  Have a printk()
> saying "microcode loading disabled by firmware" and avoid filling in
> ucode_ops.  Every other part of ucode handling understands "loading not
> available".

So, having fallen over "x86/ucode: Exit early from early_update_cache()
if loading not available" for other reasons, I've realised that this
isn't a completely sensible suggestion.

By not filling in ucode_ops, nothing ever calls collect_info(), meaning
that external components which peek at this_cpu(cpu_sig).rev get 0's
back in place of the actual microcode revision.  That's probably the
best we can do for genuinely no ucode facilities available.

But there's another case we ought to cope with.  Some hypervisors
deliberately report a microcode revision of ~0, and we should take to
mean "no microcode loading available" too.

For this MCU_CONTROL_DIS_MCU_LOAD case, we don't want to be trying to
load new microcode because that's a waste of time, but we absolutely
should query the current microcode revision.  It is frequently relevant
for security reasons.

So I think we want to fine-grain things a little, and separate the
concepts of "ucode info available" and "ucode loading available".  Per
the current mechanism, that would involve supporting a case where
ucode_ops.collect_cpu_info() is available but
ucode_ops.apply_microcode() is not.

~Andrew

P.S. also in our copious free time, we need to start supporting the
Intel min_rev field, which is more complicated than it sounds.

min_rev is vaguely defined as being relevant to block updates "after
you've evaluated CPUID and made decisions based on it", but here in Xen
we do also do livepatching and late loading to explicitly make use of
newly enumerated features.

So we need a way of xen-ucode saying "please really do load this,
because I as the admin think it will be fine in combination with the
livepatch I'm about to apply".

My best idea for this is to have a `--force` option to pass to Xen to
skip the revision checks, which will require either a new hypercall, or
perhaps borrowing a high bit from the size field in the current hypercall.

With a force option in place, the boot time ucode=allow-same can go
away.  It has become distinctly less useful now that we were forced do
this unilaterally on AMD CPUs, and separating "allow same because of HW
bugs" from "the Admin promised they knew what they were doing" would be
better for testing.

Re: [PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Alejandro Vallejo 10 months, 3 weeks ago
On Fri, Jun 02, 2023 at 09:35:56PM +0100, Andrew Cooper wrote:
> For this MCU_CONTROL_DIS_MCU_LOAD case, we don't want to be trying to
> load new microcode because that's a waste of time, but we absolutely
> should query the current microcode revision.  It is frequently relevant
> for security reasons.
> 
> So I think we want to fine-grain things a little, and separate the
> concepts of "ucode info available" and "ucode loading available".  Per
> the current mechanism, that would involve supporting a case where
> ucode_ops.collect_cpu_info() is available but
> ucode_ops.apply_microcode() is not.
I was going after something to that effect, yes.

> 
> ~Andrew
> 
> P.S. also in our copious free time, we need to start supporting the
> Intel min_rev field, which is more complicated than it sounds.
> 
> min_rev is vaguely defined as being relevant to block updates "after
> you've evaluated CPUID and made decisions based on it", but here in Xen
> we do also do livepatching and late loading to explicitly make use of
> newly enumerated features.
> 
> So we need a way of xen-ucode saying "please really do load this,
> because I as the admin think it will be fine in combination with the
> livepatch I'm about to apply".
> 
> My best idea for this is to have a `--force` option to pass to Xen to
> skip the revision checks, which will require either a new hypercall, or
> perhaps borrowing a high bit from the size field in the current hypercall.
> 
> With a force option in place, the boot time ucode=allow-same can go
> away.  It has become distinctly less useful now that we were forced do
> this unilaterally on AMD CPUs, and separating "allow same because of HW
> bugs" from "the Admin promised they knew what they were doing" would be
> better for testing.
I've created a GitLab issue to keep track of that:

  https://gitlab.com/xen-project/xen/-/issues/164

There's also the case of downgrades. We probably want to at least avoid
going back to a microcode revision with different min_rev field.

Alejandro
Re: [PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Alejandro Vallejo 10 months, 4 weeks ago
On Thu, Jun 01, 2023 at 11:54:31AM +0100, Andrew Cooper wrote:
> I had something else in mind here.  Right now, this will read
> MSR_MCU_CONTROL and emit a printk() on every microcode load, which will
> be every AP, and every time the user uses the xen-ucode tool.
Not every AP. The hypercall would return with an error before the APs are
brought in. It is true that the error on dmesg would appear on every
microcode load attempt though.

> 
> Instead, I recommend the following:
> 
> 1) One patch moving the early-cpuid/msr read from tsx_init() into
> early_microcode_init(), adjusting the comment as it goes.  No point
> duplicating that logic, and we need it earlier on boot now.
> 2) This patch, adjusting early_microcode_init() only.  Have a printk()
> saying "microcode loading disabled by firmware" and avoid filling in
> ucode_ops.  Every other part of ucode handling understands "loading not
> available".
Sure. Going on a tangent though, I do wonder why tsx_init() is preceding
identify_cpu(). It's reading cpuid leaf 7d0 simply because it hasn't been
read yet, but it's not obvious why this rush in invoking tsx_init(). I
can't see any obvious marker that affect the following identify_cpu() call,
and swapping them gets rid of the cpuid read.

> 
> In terms of the commit message, you should call out the usecase
> explicitly.  This feature is intended for baremetal clouds where the
> platform owner doesn't trust the tenant to choose the microcode version
> in use.
> 
Sure.

> Also, I'm really not sure what your 3rd paragraph is trying to say.
That the case where CPU_i.DIS_MCU_LOAD != CPU_j.DIS_MCU_LOAD where i != j
is not specifically handled on the grounds that sane firmware ensures that
condition doesn't happen, and we already notify when the system reached a
nonsensical state with different CPUs having different microcode versions.

I'll rewrite it better.

Alejandro
Re: [PATCH] x86/microcode: Prevent attempting updates known to fail
Posted by Andrew Cooper 10 months, 4 weeks ago
On 02/06/2023 2:19 pm, Alejandro Vallejo wrote:
> On Thu, Jun 01, 2023 at 11:54:31AM +0100, Andrew Cooper wrote:
>> I had something else in mind here.  Right now, this will read
>> MSR_MCU_CONTROL and emit a printk() on every microcode load, which will
>> be every AP, and every time the user uses the xen-ucode tool.
> Not every AP. The hypercall would return with an error before the APs are
> brought in. It is true that the error on dmesg would appear on every
> microcode load attempt though.

I meant every AP on boot, where Xen initiates the ucode load.

>
>> Instead, I recommend the following:
>>
>> 1) One patch moving the early-cpuid/msr read from tsx_init() into
>> early_microcode_init(), adjusting the comment as it goes.  No point
>> duplicating that logic, and we need it earlier on boot now.
>> 2) This patch, adjusting early_microcode_init() only.  Have a printk()
>> saying "microcode loading disabled by firmware" and avoid filling in
>> ucode_ops.  Every other part of ucode handling understands "loading not
>> available".
> Sure. Going on a tangent though, I do wonder why tsx_init() is preceding
> identify_cpu(). It's reading cpuid leaf 7d0 simply because it hasn't been
> read yet, but it's not obvious why this rush in invoking tsx_init(). I
> can't see any obvious marker that affect the following identify_cpu() call,
> and swapping them gets rid of the cpuid read.

In __start_xen(),

    tsx_init(); /* Needs microcode.  May change HLE/RTM feature bits. */

If you were to test such a patch, the test-tsx ought to fail on SKL/KBL
amongst others.

One of the things that tsx_init() does is select TSX_CTRL_CPUID_CLEAR
and/or TSX_CPUID_CLEAR, which hides the HLE and RTM bits in regular
CPUID, so wants to run before the general CPUID scan.  This matters for
guest performance - if TSX is actually always aborting, but reported to
the guest, then any library using RTM will be less performant than using
the non-transactional path.

Conversely if the user wants to explicitly re-activate TSX despite the
firmware defaults, those bits need clearing before the CPUID scan for
anything to work.

~Andrew