[v2] x86/AMD: deal with RDSEED issues

[PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months, 1 week ago

Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
place.

1: disable RDSEED on Fam17 model 47 stepping 0
2: disable RDSEED on most of Zen5

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Andrew Cooper 3 months, 1 week ago

On 28/10/2025 3:32 pm, Jan Beulich wrote:
> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> place.
>
> 1: disable RDSEED on Fam17 model 47 stepping 0
> 2: disable RDSEED on most of Zen5

We have two existing cases for RDRAND issues in Xen:

1) IvyBridge SRBDS speculative vulnerability.  Here, the RNG is good,
but use of the RDRAND instruction can allow another entity on the system
to observe the random number.  RDRAND is off by default, but can be
opted in to.

2) AMD Fam15/16h Laptop.  Here, the RNG is fine, except after S3 on one
single OEM.  Use of RDRAND can be activated on the command line, but
there's no ability for individual VMs to opt in.  Being a laptop,
migration isn't a major concern.

For this seres about RDSEED, we've got:

1) Cyan Skillfish, the PlayStation 5 CPU but also in one crypto-mining
rig.  Here, RDSEED is deterministically broken and not getting a fix.

The chances of Xen running on these systems is almost 0.  We should turn
off RDSEED and be done with it; it's not interesting in the slightest to
be able to turn back on.

2) Zen5.  Here, RDSEED gives a higher-than-expected rate of 0's for only
the 32bit and 16bit forms; the 64bit form is unaffected.

There is microcode to fix it, on server at least.  Firmware fixes for
client are rather further away.  64bit OSes are likely fine (using the
64bit instruction form).  Some Linux devs think that Linux would be safe
even using the 32bit form, if it really only has a 10% zeroes rate.

There is certainly a risk that software uses the 32b/16b forms, and not
mix it properly with other entropy, but the common case these days (64b)
works just fine.  This means that blanket-disabling does more harm than
good.

This case does really want to be off by default (given no microcode),
but able to be opted in to.  At least one major class of OSes (Linux)
are safe despite the issue.

~Andrew

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months ago

On 03.11.2025 15:10, Andrew Cooper wrote:
> On 28/10/2025 3:32 pm, Jan Beulich wrote:
>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>> place.
>>
>> 1: disable RDSEED on Fam17 model 47 stepping 0
>> 2: disable RDSEED on most of Zen5
> 
> We have two existing cases for RDRAND issues in Xen:
> 
> 1) IvyBridge SRBDS speculative vulnerability.  Here, the RNG is good,
> but use of the RDRAND instruction can allow another entity on the system
> to observe the random number.  RDRAND is off by default, but can be
> opted in to.
> 
> 2) AMD Fam15/16h Laptop.  Here, the RNG is fine, except after S3 on one
> single OEM.  Use of RDRAND can be activated on the command line, but
> there's no ability for individual VMs to opt in.  Being a laptop,
> migration isn't a major concern.
> 
> 
> For this seres about RDSEED, we've got:
> 
> 1) Cyan Skillfish, the PlayStation 5 CPU but also in one crypto-mining
> rig.  Here, RDSEED is deterministically broken and not getting a fix.
> 
> The chances of Xen running on these systems is almost 0.  We should turn
> off RDSEED and be done with it; it's not interesting in the slightest to
> be able to turn back on.

I disagree to some degree, but the code to allow re-enabling can certainly
be moved to the other patch. I don't view it as wrong to have it in the 1st
patch, though.

> 2) Zen5.  Here, RDSEED gives a higher-than-expected rate of 0's for only
> the 32bit and 16bit forms; the 64bit form is unaffected.
> 
> There is microcode to fix it, on server at least.  Firmware fixes for
> client are rather further away.  64bit OSes are likely fine (using the
> 64bit instruction form).  Some Linux devs think that Linux would be safe
> even using the 32bit form, if it really only has a 10% zeroes rate.

10% is a lot. IOW I find this dubious.

> There is certainly a risk that software uses the 32b/16b forms, and not
> mix it properly with other entropy, but the common case these days (64b)
> works just fine.  This means that blanket-disabling does more harm than
> good.

That's guesswork. I don't see why 64-bit OSes should be expected to prefer
the 64-bit form over the 32-bit one. In fact, if one only needs 32 bits of
entropy, why would one even try to get 64? That's wasting a potentially
precious resource.

Furthermore mind me mentioning (again) that 32-bit OSes (including 32-bit
environments that may be active during boot) have no way of using the 64-
bit form?

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Roger Pau Monné 3 months, 1 week ago

On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> place.
> 
> 1: disable RDSEED on Fam17 model 47 stepping 0
> 2: disable RDSEED on most of Zen5

For both patches: don't we need to set the feature in the max policy
to allow for incoming migrations of guests that have already seen the
feature?

Thanks, Roger.

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months, 1 week ago

On 31.10.2025 11:22, Roger Pau Monné wrote:
> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>> place.
>>
>> 1: disable RDSEED on Fam17 model 47 stepping 0
>> 2: disable RDSEED on most of Zen5
> 
> For both patches: don't we need to set the feature in the max policy
> to allow for incoming migrations of guests that have already seen the
> feature?

No, such guests should not run on affected hosts (unless overrides are in place),
or else they'd face sudden malfunction of RDSEED. If an override was in place on
the source host, an override will also need to be put in place on the destination
one.

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Roger Pau Monné 3 months, 1 week ago

On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
> On 31.10.2025 11:22, Roger Pau Monné wrote:
> > On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
> >> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> >> place.
> >>
> >> 1: disable RDSEED on Fam17 model 47 stepping 0
> >> 2: disable RDSEED on most of Zen5
> > 
> > For both patches: don't we need to set the feature in the max policy
> > to allow for incoming migrations of guests that have already seen the
> > feature?
> 
> No, such guests should not run on affected hosts (unless overrides are in place),
> or else they'd face sudden malfunction of RDSEED. If an override was in place on
> the source host, an override will also need to be put in place on the destination
> one.

But they may be malfunctioning before already, if started on a
vulnerable hosts without this fix and having seen RDSEED?

IMO after this fix is applied you should do pool leveling, at which
point RDSEED shouldn't be advertised anymore.  Having the feature in
the max policy allows to evacuate running guests while updating the
pool.  Otherwise those existing guests would be stuck to run on
non-updated hosts.

Thanks, Roger.

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months, 1 week ago

On 31.10.2025 11:54, Roger Pau Monné wrote:
> On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
>> On 31.10.2025 11:22, Roger Pau Monné wrote:
>>> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
>>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>>>> place.
>>>>
>>>> 1: disable RDSEED on Fam17 model 47 stepping 0
>>>> 2: disable RDSEED on most of Zen5
>>>
>>> For both patches: don't we need to set the feature in the max policy
>>> to allow for incoming migrations of guests that have already seen the
>>> feature?
>>
>> No, such guests should not run on affected hosts (unless overrides are in place),
>> or else they'd face sudden malfunction of RDSEED. If an override was in place on
>> the source host, an override will also need to be put in place on the destination
>> one.
> 
> But they may be malfunctioning before already, if started on a
> vulnerable hosts without this fix and having seen RDSEED?

Yes. But there could also be ones coming from good hosts. Imo ...

> IMO after this fix is applied you should do pool leveling, at which
> point RDSEED shouldn't be advertised anymore.  Having the feature in
> the max policy allows to evacuate running guests while updating the
> pool.  Otherwise those existing guests would be stuck to run on
> non-updated hosts.

... we need to err on the side of caution.

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Roger Pau Monné 3 months, 1 week ago

On Fri, Oct 31, 2025 at 12:47:51PM +0100, Jan Beulich wrote:
> On 31.10.2025 11:54, Roger Pau Monné wrote:
> > On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
> >> On 31.10.2025 11:22, Roger Pau Monné wrote:
> >>> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
> >>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> >>>> place.
> >>>>
> >>>> 1: disable RDSEED on Fam17 model 47 stepping 0
> >>>> 2: disable RDSEED on most of Zen5
> >>>
> >>> For both patches: don't we need to set the feature in the max policy
> >>> to allow for incoming migrations of guests that have already seen the
> >>> feature?
> >>
> >> No, such guests should not run on affected hosts (unless overrides are in place),
> >> or else they'd face sudden malfunction of RDSEED. If an override was in place on
> >> the source host, an override will also need to be put in place on the destination
> >> one.
> > 
> > But they may be malfunctioning before already, if started on a
> > vulnerable hosts without this fix and having seen RDSEED?
> 
> Yes. But there could also be ones coming from good hosts. Imo ...
> 
> > IMO after this fix is applied you should do pool leveling, at which
> > point RDSEED shouldn't be advertised anymore.  Having the feature in
> > the max policy allows to evacuate running guests while updating the
> > pool.  Otherwise those existing guests would be stuck to run on
> > non-updated hosts.
> 
> ... we need to err on the side of caution.

While I understand your concerns, this would cause failures in the
upgrade and migration model used by both XCP-ng and XenServer at
least, as it could prevent eviction of running VMs to updated hosts.

At a minimum we would need an option to allow the feature to be set on
the max policy.  Overall I think safety of migration (in this specific
regard) should be enforced by the toolstack (or orchestration layer),
rather than the hypervisor itself.  The hypervisor can reject
incompatible policies, but should leave the rest of the decisions to
higher layers as it doesn't have enough knowledge.

Thanks, Roger.

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months, 1 week ago

On 31.10.2025 13:14, Roger Pau Monné wrote:
> On Fri, Oct 31, 2025 at 12:47:51PM +0100, Jan Beulich wrote:
>> On 31.10.2025 11:54, Roger Pau Monné wrote:
>>> On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
>>>> On 31.10.2025 11:22, Roger Pau Monné wrote:
>>>>> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
>>>>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>>>>>> place.
>>>>>>
>>>>>> 1: disable RDSEED on Fam17 model 47 stepping 0
>>>>>> 2: disable RDSEED on most of Zen5
>>>>>
>>>>> For both patches: don't we need to set the feature in the max policy
>>>>> to allow for incoming migrations of guests that have already seen the
>>>>> feature?
>>>>
>>>> No, such guests should not run on affected hosts (unless overrides are in place),
>>>> or else they'd face sudden malfunction of RDSEED. If an override was in place on
>>>> the source host, an override will also need to be put in place on the destination
>>>> one.
>>>
>>> But they may be malfunctioning before already, if started on a
>>> vulnerable hosts without this fix and having seen RDSEED?
>>
>> Yes. But there could also be ones coming from good hosts. Imo ...
>>
>>> IMO after this fix is applied you should do pool leveling, at which
>>> point RDSEED shouldn't be advertised anymore.  Having the feature in
>>> the max policy allows to evacuate running guests while updating the
>>> pool.  Otherwise those existing guests would be stuck to run on
>>> non-updated hosts.
>>
>> ... we need to err on the side of caution.
> 
> While I understand your concerns, this would cause failures in the
> upgrade and migration model used by both XCP-ng and XenServer at
> least, as it could prevent eviction of running VMs to updated hosts.
> 
> At a minimum we would need an option to allow the feature to be set on
> the max policy.

That's where the 3rd patch comes into play. "cpuid=rdseed" is the respective
override. Just that it doesn't work correctly without that further patch.

>  Overall I think safety of migration (in this specific
> regard) should be enforced by the toolstack (or orchestration layer),
> rather than the hypervisor itself.  The hypervisor can reject
> incompatible policies, but should leave the rest of the decisions to
> higher layers as it doesn't have enough knowledge.

But without rendering guests vulnerable behind the admin's back.

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Roger Pau Monné 3 months, 1 week ago

On Fri, Oct 31, 2025 at 01:34:55PM +0100, Jan Beulich wrote:
> On 31.10.2025 13:14, Roger Pau Monné wrote:
> > On Fri, Oct 31, 2025 at 12:47:51PM +0100, Jan Beulich wrote:
> >> On 31.10.2025 11:54, Roger Pau Monné wrote:
> >>> On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
> >>>> On 31.10.2025 11:22, Roger Pau Monné wrote:
> >>>>> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
> >>>>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> >>>>>> place.
> >>>>>>
> >>>>>> 1: disable RDSEED on Fam17 model 47 stepping 0
> >>>>>> 2: disable RDSEED on most of Zen5
> >>>>>
> >>>>> For both patches: don't we need to set the feature in the max policy
> >>>>> to allow for incoming migrations of guests that have already seen the
> >>>>> feature?
> >>>>
> >>>> No, such guests should not run on affected hosts (unless overrides are in place),
> >>>> or else they'd face sudden malfunction of RDSEED. If an override was in place on
> >>>> the source host, an override will also need to be put in place on the destination
> >>>> one.
> >>>
> >>> But they may be malfunctioning before already, if started on a
> >>> vulnerable hosts without this fix and having seen RDSEED?
> >>
> >> Yes. But there could also be ones coming from good hosts. Imo ...
> >>
> >>> IMO after this fix is applied you should do pool leveling, at which
> >>> point RDSEED shouldn't be advertised anymore.  Having the feature in
> >>> the max policy allows to evacuate running guests while updating the
> >>> pool.  Otherwise those existing guests would be stuck to run on
> >>> non-updated hosts.
> >>
> >> ... we need to err on the side of caution.
> > 
> > While I understand your concerns, this would cause failures in the
> > upgrade and migration model used by both XCP-ng and XenServer at
> > least, as it could prevent eviction of running VMs to updated hosts.
> > 
> > At a minimum we would need an option to allow the feature to be set on
> > the max policy.
> 
> That's where the 3rd patch comes into play. "cpuid=rdseed" is the respective
> override. Just that it doesn't work correctly without that further patch.

Won't using "cpuid=rdseed" in the Xen command line result in RDSEED
getting exposed in the default policy also, which we want to avoid?

Or am I getting confused on where "cpuid=rdseed" should be used?

> >  Overall I think safety of migration (in this specific
> > regard) should be enforced by the toolstack (or orchestration layer),
> > rather than the hypervisor itself.  The hypervisor can reject
> > incompatible policies, but should leave the rest of the decisions to
> > higher layers as it doesn't have enough knowledge.
> 
> But without rendering guests vulnerable behind the admin's back.

I think that's part of the logic that should be implemented by the
orchestration layer, simply because it has all the data to make an
informed decision.  IMO it won't be behind the admin's back, or else
it's a bug in the higher layer toolstack.

Not putting rdseed in the max policy completely blocks the upgrade
path, even when a toolstack is possibly making the right informed
decisions.

I guess I need to see that 3rd patch.

Thanks, Roger.

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months ago

On 31.10.2025 14:15, Roger Pau Monné wrote:
> Not putting rdseed in the max policy completely blocks the upgrade
> path, even when a toolstack is possibly making the right informed
> decisions.

Why would that be? To evacuate guests, one would force-enable RDSEED on
an affected host. After updating of the original host (incl fixed ucode),
migrating back will be fine. The admin will thus be fully aware of where
guests run unsafely, while no un-safety is going to be introduced behind
the back of the admin and/or any guest.

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months ago

On 31.10.2025 14:15, Roger Pau Monné wrote:
> On Fri, Oct 31, 2025 at 01:34:55PM +0100, Jan Beulich wrote:
>> On 31.10.2025 13:14, Roger Pau Monné wrote:
>>> On Fri, Oct 31, 2025 at 12:47:51PM +0100, Jan Beulich wrote:
>>>> On 31.10.2025 11:54, Roger Pau Monné wrote:
>>>>> On Fri, Oct 31, 2025 at 11:29:44AM +0100, Jan Beulich wrote:
>>>>>> On 31.10.2025 11:22, Roger Pau Monné wrote:
>>>>>>> On Tue, Oct 28, 2025 at 04:32:17PM +0100, Jan Beulich wrote:
>>>>>>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>>>>>>>> place.
>>>>>>>>
>>>>>>>> 1: disable RDSEED on Fam17 model 47 stepping 0
>>>>>>>> 2: disable RDSEED on most of Zen5
>>>>>>>
>>>>>>> For both patches: don't we need to set the feature in the max policy
>>>>>>> to allow for incoming migrations of guests that have already seen the
>>>>>>> feature?
>>>>>>
>>>>>> No, such guests should not run on affected hosts (unless overrides are in place),
>>>>>> or else they'd face sudden malfunction of RDSEED. If an override was in place on
>>>>>> the source host, an override will also need to be put in place on the destination
>>>>>> one.
>>>>>
>>>>> But they may be malfunctioning before already, if started on a
>>>>> vulnerable hosts without this fix and having seen RDSEED?
>>>>
>>>> Yes. But there could also be ones coming from good hosts. Imo ...
>>>>
>>>>> IMO after this fix is applied you should do pool leveling, at which
>>>>> point RDSEED shouldn't be advertised anymore.  Having the feature in
>>>>> the max policy allows to evacuate running guests while updating the
>>>>> pool.  Otherwise those existing guests would be stuck to run on
>>>>> non-updated hosts.
>>>>
>>>> ... we need to err on the side of caution.
>>>
>>> While I understand your concerns, this would cause failures in the
>>> upgrade and migration model used by both XCP-ng and XenServer at
>>> least, as it could prevent eviction of running VMs to updated hosts.
>>>
>>> At a minimum we would need an option to allow the feature to be set on
>>> the max policy.
>>
>> That's where the 3rd patch comes into play. "cpuid=rdseed" is the respective
>> override. Just that it doesn't work correctly without that further patch.
> 
> Won't using "cpuid=rdseed" in the Xen command line result in RDSEED
> getting exposed in the default policy also, which we want to avoid?
> 
> Or am I getting confused on where "cpuid=rdseed" should be used?

No, there's no way here to get max but not default.

>>>  Overall I think safety of migration (in this specific
>>> regard) should be enforced by the toolstack (or orchestration layer),
>>> rather than the hypervisor itself.  The hypervisor can reject
>>> incompatible policies, but should leave the rest of the decisions to
>>> higher layers as it doesn't have enough knowledge.
>>
>> But without rendering guests vulnerable behind the admin's back.
> 
> I think that's part of the logic that should be implemented by the
> orchestration layer, simply because it has all the data to make an
> informed decision.  IMO it won't be behind the admin's back, or else
> it's a bug in the higher layer toolstack.

I fear I simply don't see aspects like this to be exposed to a toolstack.
We didn't for RDRAND.

> Not putting rdseed in the max policy completely blocks the upgrade
> path, even when a toolstack is possibly making the right informed
> decisions.
> 
> I guess I need to see that 3rd patch.

https://lists.xen.org/archives/html/xen-devel/2025-08/msg00113.html

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Oleksii Kurochko 3 months, 1 week ago

On 10/28/25 4:32 PM, Jan Beulich wrote:
> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
> place.
>
> 1: disable RDSEED on Fam17 model 47 stepping 0
> 2: disable RDSEED on most of Zen5

Both patches LGTM to be in 4.21:
   Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

Thanks.

~ Oleksii

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Jan Beulich 3 months, 1 week ago

On 31.10.2025 10:31, Oleksii Kurochko wrote:
> 
> On 10/28/25 4:32 PM, Jan Beulich wrote:
>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>> place.
>>
>> 1: disable RDSEED on Fam17 model 47 stepping 0
>> 2: disable RDSEED on most of Zen5
> 
> Both patches LGTM to be in 4.21:
>    Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

Thanks, yet: What about the 3rd patch mentioned in the text above?

Jan

Re: [PATCH v2 for-4.21 0/2] x86/AMD: deal with RDSEED issues

Posted by Oleksii Kurochko 3 months, 1 week ago

On 10/31/25 10:34 AM, Jan Beulich wrote:
> On 31.10.2025 10:31, Oleksii Kurochko wrote:
>> On 10/28/25 4:32 PM, Jan Beulich wrote:
>>> Both patches also want 'x86/CPU: extend is_forced_cpu_cap()'s "reach"' in
>>> place.
>>>
>>> 1: disable RDSEED on Fam17 model 47 stepping 0
>>> 2: disable RDSEED on most of Zen5
>> Both patches LGTM to be in 4.21:
>>     Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> Thanks, yet: What about the 3rd patch mentioned in the text above?

For 3rd patch, also:
  Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

Thanks.

~ Oleksii