x86/ucode: Parallel loading fixes

[PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load

Posted by Andrew Cooper 1 day, 3 hours ago

wait_for_state() returns false on encountering LOADING_EXIT.
control_thread_fn() can move directly to this state in the case of an early
error.  It is not an error condition for APs, but right now the latest write
into stopmachine_data.fn_result wins, causing the real error, -EIO, to get
clobbered with -EBUSY.  e.g.:

  # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
  Failed to update microcode. (err: Device or resource busy)

  (XEN) 256 cores are to update their microcode
  (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
  (XEN) Late loading aborted: CPU0 failed to update ucode: -5

Drop all the -EBUSY's, and treat hitting LOADING_EXIT as a success case.  This
causes only a single error to be returned through stop_machine_run().  e.g.:

  # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
  Failed to update microcode. (err: Input/output error)

  (XEN) 256 cores are to update their microcode
  (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
  (XEN) Late loading aborted: CPU0 failed to update ucode: -5

Fixes: 5ed12565aa32 ("microcode: rendezvous CPUs in NMI handler and load ucode")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

Xen 4.19 and earlier hit this case naturally, without the need of --force.
---
 xen/arch/x86/cpu/microcode/core.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/cpu/microcode/core.c b/xen/arch/x86/cpu/microcode/core.c
index 1d1a5aa4b097..ecee400f28a6 100644
--- a/xen/arch/x86/cpu/microcode/core.c
+++ b/xen/arch/x86/cpu/microcode/core.c
@@ -260,7 +260,9 @@ static int secondary_nmi_work(void)
 {
     cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
 
-    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
+    wait_for_state(LOADING_EXIT);
+
+    return 0;
 }
 
 static int primary_thread_work(const struct microcode_patch *patch,
@@ -271,7 +273,7 @@ static int primary_thread_work(const struct microcode_patch *patch,
     cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
 
     if ( !wait_for_state(LOADING_ENTER) )
-        return -EBUSY;
+        return 0;
 
     ret = alternative_call(ucode_ops.apply_microcode, patch, flags);
     if ( !ret )
@@ -313,7 +315,7 @@ static int cf_check microcode_nmi_callback(
 static int secondary_thread_fn(void)
 {
     if ( !wait_for_state(LOADING_CALLIN) )
-        return -EBUSY;
+        return 0;
 
     self_nmi();
 
@@ -336,7 +338,7 @@ static int primary_thread_fn(const struct microcode_patch *patch,
                              unsigned int flags)
 {
     if ( !wait_for_state(LOADING_CALLIN) )
-        return -EBUSY;
+        return 0;
 
     if ( ucode_in_nmi )
     {
-- 
2.39.5

Re: [PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load

Posted by Jan Beulich 18 hours ago

On 17.11.2025 23:21, Andrew Cooper wrote:
> wait_for_state() returns false on encountering LOADING_EXIT.
> control_thread_fn() can move directly to this state in the case of an early
> error.  It is not an error condition for APs, but right now the latest write
> into stopmachine_data.fn_result wins, causing the real error, -EIO, to get
> clobbered with -EBUSY.  e.g.:
> 
>   # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>   Failed to update microcode. (err: Device or resource busy)
> 
>   (XEN) 256 cores are to update their microcode
>   (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
>   (XEN) Late loading aborted: CPU0 failed to update ucode: -5
> 
> Drop all the -EBUSY's, and treat hitting LOADING_EXIT as a success case.  This
> causes only a single error to be returned through stop_machine_run().  e.g.:

Why "single"? stop_machine_run() can't return multiple ones, having only a
scalar return type? Or do you mean "a single, consistent" or some such?

>   # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>   Failed to update microcode. (err: Input/output error)
> 
>   (XEN) 256 cores are to update their microcode
>   (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
>   (XEN) Late loading aborted: CPU0 failed to update ucode: -5

The sole difference being which specific error is observed, which looks to
support the above interpretation. What I don't quite understand is ...

> Fixes: 5ed12565aa32 ("microcode: rendezvous CPUs in NMI handler and load ucode")

... this and the specific indication that this needs backporting: Why is
the particular error code this important here?

> --- a/xen/arch/x86/cpu/microcode/core.c
> +++ b/xen/arch/x86/cpu/microcode/core.c
> @@ -260,7 +260,9 @@ static int secondary_nmi_work(void)
>  {
>      cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>  
> -    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
> +    wait_for_state(LOADING_EXIT);
> +
> +    return 0;
>  }

At which point the function could as well return void? Preferably with this
adjustment (and the knock-on one at the call site) and with the slight
clarification to the description
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> @@ -271,7 +273,7 @@ static int primary_thread_work(const struct microcode_patch *patch,
>      cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>  
>      if ( !wait_for_state(LOADING_ENTER) )
> -        return -EBUSY;
> +        return 0;
>  
>      ret = alternative_call(ucode_ops.apply_microcode, patch, flags);
>      if ( !ret )
> @@ -313,7 +315,7 @@ static int cf_check microcode_nmi_callback(
>  static int secondary_thread_fn(void)
>  {
>      if ( !wait_for_state(LOADING_CALLIN) )
> -        return -EBUSY;
> +        return 0;
>  
>      self_nmi();
>  
> @@ -336,7 +338,7 @@ static int primary_thread_fn(const struct microcode_patch *patch,
>                               unsigned int flags)
>  {
>      if ( !wait_for_state(LOADING_CALLIN) )
> -        return -EBUSY;
> +        return 0;
>  
>      if ( ucode_in_nmi )
>      {

Vaguely recalling the original intentions, these changes looked wrong to me at
the first glance. But yes, an exit indication from the control thread isn't
really a separate error condition.

Jan

Re: [PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load

Posted by Andrew Cooper 13 hours ago

On 18/11/2025 7:49 am, Jan Beulich wrote:
> On 17.11.2025 23:21, Andrew Cooper wrote:
>> wait_for_state() returns false on encountering LOADING_EXIT.
>> control_thread_fn() can move directly to this state in the case of an early
>> error.  It is not an error condition for APs, but right now the latest write
>> into stopmachine_data.fn_result wins, causing the real error, -EIO, to get
>> clobbered with -EBUSY.  e.g.:
>>
>>   # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>>   Failed to update microcode. (err: Device or resource busy)
>>
>>   (XEN) 256 cores are to update their microcode
>>   (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
>>   (XEN) Late loading aborted: CPU0 failed to update ucode: -5
>>
>> Drop all the -EBUSY's, and treat hitting LOADING_EXIT as a success case.  This
>> causes only a single error to be returned through stop_machine_run().  e.g.:
> Why "single"? stop_machine_run() can't return multiple ones, having only a
> scalar return type? Or do you mean "a single, consistent" or some such?

stop_machine_run() has a data race on stopmachine_data.fn_result.

Any CPU returning any nonzero value back into the stop_machine machinery
will update the singleton result, and latest wins.

This causes the BSP to return -EIO, and all APs to return 0 and not
interfere with the -EIO.

>
>>   # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>>   Failed to update microcode. (err: Input/output error)
>>
>>   (XEN) 256 cores are to update their microcode
>>   (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result 0x830107d
>>   (XEN) Late loading aborted: CPU0 failed to update ucode: -5
> The sole difference being which specific error is observed, which looks to
> support the above interpretation. What I don't quite understand is ...
>
>> Fixes: 5ed12565aa32 ("microcode: rendezvous CPUs in NMI handler and load ucode")
> ... this and the specific indication that this needs backporting: Why is
> the particular error code this important here?

Because userspace cares about -EEXIST as a special case for success.

Having -EEIXST clobbered with -EBUSY causes a false negative failure in
XenServer's testing.

As said in the cover letter, 4.19 and earlier now suffer this as a side
effect of e0bb712a28a9 ("x86/ucode: Abort parallel load early on any
control thread error") because out-of-date ucodes used to be passed into
stop_machine and cause every CPU to fail with -EEXIST.

>> --- a/xen/arch/x86/cpu/microcode/core.c
>> +++ b/xen/arch/x86/cpu/microcode/core.c
>> @@ -260,7 +260,9 @@ static int secondary_nmi_work(void)
>>  {
>>      cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>>  
>> -    return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
>> +    wait_for_state(LOADING_EXIT);
>> +
>> +    return 0;
>>  }
> At which point the function could as well return void? Preferably with this
> adjustment (and the knock-on one at the call site) and with the slight
> clarification to the description
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

I have a different series, but ucode_in_nmi needs untangling first.

Even changing this function to be void causes this patch to be dominated
by cleanup, which isn't appropriate for a bugfix.

~Andrew

[PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load
[PATCH 2/3] x86/ucode: Drop structurally unreachable ASSERT()s
[PATCH 3/3] x86/ucode: Create a real type for loading_state