While the SDM isn't very clear about this, our present behavior make
Linux 5.19 unhappy. As of commit 8ad7e8f69695 ("x86/fpu/xsave: Support
XSAVEC in the kernel") they're using this CPUID output also to size
the compacted area used by XSAVEC. Getting back zero there isn't really
liked, yet fpr PV that's the default on capable hardware: XSAVES isn't
exposed to PV domains.
Considering that the size reported is that of the compacted save area,
I view Linux'es assumption as appropriate (short of the SDM properly
considering the case). Therefore we need to populate the field also when
only XSAVEC is supported for a guest.
Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest")
Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I actually wonder why we surface the XSAVES feature bit to HVM domains,
when we don't support any of the features. It's solely because of this
that by default only PV domains are affected by the issue (HVM would be
affected only when XSAVES was hidden via guest config settings).
Wouldn't we better mask the bit (e.g. in recalculate_xstate()) when we
find that no features requiring XSAVES are visible to the domain? That
would likely come closer to real hardware, which pretty certainly won't
offer XSAVES without also offering at least one dependent feature.
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u
switch ( subleaf )
{
case 1:
- if ( p->xstate.xsaves )
+ if ( p->xstate.xsavec || p->xstate.xsaves )
{
/*
* TODO: Figure out what to do for XSS state. VT-x manages
On 23/08/2022 07:42, Jan Beulich wrote: > While the SDM isn't very clear about this, our present behavior make > Linux 5.19 unhappy. As of commit 8ad7e8f69695 ("x86/fpu/xsave: Support > XSAVEC in the kernel") they're using this CPUID output also to size > the compacted area used by XSAVEC. Getting back zero there isn't really > liked, yet fpr PV that's the default on capable hardware: XSAVES isn't for. > exposed to PV domains. > > Considering that the size reported is that of the compacted save area, > I view Linux'es assumption as appropriate (short of the SDM properly > considering the case). Therefore we need to populate the field also when > only XSAVEC is supported for a guest. This is a mess. The SDM is fairly clear (but only in Vol1) that this leaf is specific to XSAVES. The APM has only an equation, which shows it as the compacted size without reference to instructions. Ideally I'd like the opinion from some architects and a clarification to the SDM... > Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest") > Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests") > Signed-off-by: Jan Beulich <jbeulich@suse.com> CC Marek. Looks like Jan has found the issue you reported on IRC. Jan: Be aware that I submitted https://lore.kernel.org/lkml/20220810221909.12768-1-andrew.cooper3@citrix.com/ to Linux to correct some of the diagnostics. > --- > I actually wonder why we surface the XSAVES feature bit to HVM domains, > when we don't support any of the features. Because that's what was originally accepted into Xen, and I couldn't retract it when fixing CPUID handling at first because it would regress across migrate to a newer Xen. With CPUID data now in the migration stream, we could in principle fix it, but at this point it's definitely not worth the complexity or risk to adjust. > It's solely because of this > that by default only PV domains are affected by the issue (HVM would be > affected only when XSAVES was hidden via guest config settings). > Wouldn't we better mask the bit (e.g. in recalculate_xstate()) when we > find that no features requiring XSAVES are visible to the domain? That > would likely come closer to real hardware, which pretty certainly won't > offer XSAVES without also offering at least one dependent feature. > > --- a/xen/arch/x86/cpuid.c > +++ b/xen/arch/x86/cpuid.c > @@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u > switch ( subleaf ) > { > case 1: > - if ( p->xstate.xsaves ) > + if ( p->xstate.xsavec || p->xstate.xsaves ) If we're doing this, then it wants to be xsavec only, with the comment being extended to explain why. But this is going to further complicate my several-year-old series trying to get Xen's XSTATE handling into a position where we can start to offer supervisor states. > { > /* > * TODO: Figure out what to do for XSS state. VT-x manages
On 23.08.2022 10:59, Andrew Cooper wrote: > On 23/08/2022 07:42, Jan Beulich wrote: >> While the SDM isn't very clear about this, our present behavior make >> Linux 5.19 unhappy. As of commit 8ad7e8f69695 ("x86/fpu/xsave: Support >> XSAVEC in the kernel") they're using this CPUID output also to size >> the compacted area used by XSAVEC. Getting back zero there isn't really >> liked, yet fpr PV that's the default on capable hardware: XSAVES isn't > > for. > >> exposed to PV domains. >> >> Considering that the size reported is that of the compacted save area, >> I view Linux'es assumption as appropriate (short of the SDM properly >> considering the case). Therefore we need to populate the field also when >> only XSAVEC is supported for a guest. > > This is a mess. The SDM is fairly clear (but only in Vol1) that this > leaf is specific to XSAVES. The way it's written my assumption is that they simply didn't care about XSAVEC when writing this, or they were assuming that both features would always be supported together (yet even if they are in Intel's hardware, the architecture should spell out things as if both were entirely independent, or it should specify that one takes the other as a prereq). > The APM has only an equation, which shows > it as the compacted size without reference to instructions. > > Ideally I'd like the opinion from some architects and a clarification to > the SDM... I've made a request through my contact, but of course there's little hope for a quick technical reply. >> Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest") >> Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests") >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > CC Marek. Looks like Jan has found the issue you reported on IRC. > > Jan: Be aware that I submitted > https://lore.kernel.org/lkml/20220810221909.12768-1-andrew.cooper3@citrix.com/ > to Linux to correct some of the diagnostics. > >> --- >> I actually wonder why we surface the XSAVES feature bit to HVM domains, >> when we don't support any of the features. > > Because that's what was originally accepted into Xen, and I couldn't > retract it when fixing CPUID handling at first because it would regress > across migrate to a newer Xen. With CPUID data now in the migration > stream, we could in principle fix it, but at this point it's definitely > not worth the complexity or risk to adjust. Hmm. >> --- a/xen/arch/x86/cpuid.c >> +++ b/xen/arch/x86/cpuid.c >> @@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u >> switch ( subleaf ) >> { >> case 1: >> - if ( p->xstate.xsaves ) >> + if ( p->xstate.xsavec || p->xstate.xsaves ) > > If we're doing this, then it wants to be xsavec only, with the comment > being extended to explain why. Why would that be? Both insns use compacted format, and neither is dependent upon the other in terms of being supported. IOW XSAVES alone and XSAVEC alone enabled for a domain should still lead through this path. > But this is going to further complicate my several-year-old series > trying to get Xen's XSTATE handling into a position where we can start > to offer supervisor states. Where do you see further complication? The necessary fiddling with XSS here would of course be dependent upon p->xstate.xsaves alone (or, maybe better, on the set of enabled features in XSS being non-empty), but that's simply another (inner) if(). As an aside, I actually wonder what use the supplied size is to user mode code when any XSS-controlled feature is enabled: They'd allocate a needlessly large block of memory, as they would only be able to use XSAVEC. Jan
On 23/08/2022 10:27, Jan Beulich wrote: > On 23.08.2022 10:59, Andrew Cooper wrote: >> On 23/08/2022 07:42, Jan Beulich wrote: >>> exposed to PV domains. >>> >>> Considering that the size reported is that of the compacted save area, >>> I view Linux'es assumption as appropriate (short of the SDM properly >>> considering the case). Therefore we need to populate the field also when >>> only XSAVEC is supported for a guest. >> This is a mess. The SDM is fairly clear (but only in Vol1) that this >> leaf is specific to XSAVES. > The way it's written my assumption is that they simply didn't care about > XSAVEC when writing this, or they were assuming that both features would > always be supported together (yet even if they are in Intel's hardware, > the architecture should spell out things as if both were entirely > independent, or it should specify that one takes the other as a prereq). Real hardware has XSAVEC == XSAVES on Intel (Skylake) and AMD (Zen1). Despite an attempt to separate the parts of the ISA, they are inextricably linked. It is only under virt that we get XSAVEC without XSAVES. >>> Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest") >>> Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests") >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> CC Marek. Looks like Jan has found the issue you reported on IRC. >> >> Jan: Be aware that I submitted >> https://lore.kernel.org/lkml/20220810221909.12768-1-andrew.cooper3@citrix.com/ >> to Linux to correct some of the diagnostics. >>> --- a/xen/arch/x86/cpuid.c >>> +++ b/xen/arch/x86/cpuid.c >>> @@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u >>> switch ( subleaf ) >>> { >>> case 1: >>> - if ( p->xstate.xsaves ) >>> + if ( p->xstate.xsavec || p->xstate.xsaves ) >> If we're doing this, then it wants to be xsavec only, with the comment >> being extended to explain why. > Why would that be? Both insns use compacted format, and neither is > dependent upon the other in terms of being supported. IOW XSAVES alone > and XSAVEC alone enabled for a domain should still lead through this > path. Hmm. Because my fixes to compaction handling haven't been committed yet, and in particular one the one which makes XSAVES strictly depend on XSAVEC. In which case this hunk is correct for Xen as it currently is, and will be need to be adjusted when I rebase the compaction series. >> But this is going to further complicate my several-year-old series >> trying to get Xen's XSTATE handling into a position where we can start >> to offer supervisor states. > Where do you see further complication? The necessary fiddling with XSS > here would of course be dependent upon p->xstate.xsaves alone (or, > maybe better, on the set of enabled features in XSS being non-empty), > but that's simply another (inner) if(). > > As an aside, I actually wonder what use the supplied size is to user > mode code when any XSS-controlled feature is enabled: They'd allocate > a needlessly large block of memory, as they would only be able to use > XSAVEC. This field is an already known kernel=>user infoleak. There are threads about it on LKML. But it does highlight another problem. This change does not fix Linux on AMD Zen3 hardware, where the kernel will find the CPUID value larger than it can calculate the size to be, because Xen's use of CET-SS will show up in the CPUID value. Linux needs an adjustment from != to <= for this check. ~Andrew
On 23.08.2022 12:48, Andrew Cooper wrote: > On 23/08/2022 10:27, Jan Beulich wrote: >> On 23.08.2022 10:59, Andrew Cooper wrote: >>> But this is going to further complicate my several-year-old series >>> trying to get Xen's XSTATE handling into a position where we can start >>> to offer supervisor states. >> Where do you see further complication? The necessary fiddling with XSS >> here would of course be dependent upon p->xstate.xsaves alone (or, >> maybe better, on the set of enabled features in XSS being non-empty), >> but that's simply another (inner) if(). >> >> As an aside, I actually wonder what use the supplied size is to user >> mode code when any XSS-controlled feature is enabled: They'd allocate >> a needlessly large block of memory, as they would only be able to use >> XSAVEC. > > This field is an already known kernel=>user infoleak. There are threads > about it on LKML. > > But it does highlight another problem. This change does not fix Linux > on AMD Zen3 hardware, where the kernel will find the CPUID value larger > than it can calculate the size to be, because Xen's use of CET-SS will > show up in the CPUID value. Why would that be? We don't even have CET related defines for XCR0, so we don't enable the states in XSS. And I don't see why we would. Even for other XSTATE-managed but not XSTATE-enabled features we could clear the respective bits around the CPUID invocation (just like we may need to set some in XSS). We'd be in trouble only for any XSTATE- enabled feature that we make use of ourselves. Jan
On 24/08/2022 07:00, Jan Beulich wrote: > On 23.08.2022 12:48, Andrew Cooper wrote: >> On 23/08/2022 10:27, Jan Beulich wrote: >>> On 23.08.2022 10:59, Andrew Cooper wrote: >>>> But this is going to further complicate my several-year-old series >>>> trying to get Xen's XSTATE handling into a position where we can start >>>> to offer supervisor states. >>> Where do you see further complication? The necessary fiddling with XSS >>> here would of course be dependent upon p->xstate.xsaves alone (or, >>> maybe better, on the set of enabled features in XSS being non-empty), >>> but that's simply another (inner) if(). >>> >>> As an aside, I actually wonder what use the supplied size is to user >>> mode code when any XSS-controlled feature is enabled: They'd allocate >>> a needlessly large block of memory, as they would only be able to use >>> XSAVEC. >> This field is an already known kernel=>user infoleak. There are threads >> about it on LKML. >> >> But it does highlight another problem. This change does not fix Linux >> on AMD Zen3 hardware, where the kernel will find the CPUID value larger >> than it can calculate the size to be, because Xen's use of CET-SS will >> show up in the CPUID value. > Why would that be? We don't even have CET related defines for XCR0, so > we don't enable the states in XSS. And I don't see why we would. Even > for other XSTATE-managed but not XSTATE-enabled features we could > clear the respective bits around the CPUID invocation (just like we > may need to set some in XSS). We'd be in trouble only for any XSTATE- > enabled feature that we make use of ourselves. It's not Xen's CPUID invocation which is relevant. It's the guest kernels, which goes straight to hardware because AMD still doesn't have CPUID faulting. And yes, right now none of Xen's CET state shows up in XSTATE, but that needs to change in order to support CET in HVM guests if we don't want an enormous extra overhead in the general context switch path. ~Andrew
On 24.08.2022 14:19, Andrew Cooper wrote: > On 24/08/2022 07:00, Jan Beulich wrote: >> On 23.08.2022 12:48, Andrew Cooper wrote: >>> On 23/08/2022 10:27, Jan Beulich wrote: >>>> On 23.08.2022 10:59, Andrew Cooper wrote: >>>>> But this is going to further complicate my several-year-old series >>>>> trying to get Xen's XSTATE handling into a position where we can start >>>>> to offer supervisor states. >>>> Where do you see further complication? The necessary fiddling with XSS >>>> here would of course be dependent upon p->xstate.xsaves alone (or, >>>> maybe better, on the set of enabled features in XSS being non-empty), >>>> but that's simply another (inner) if(). >>>> >>>> As an aside, I actually wonder what use the supplied size is to user >>>> mode code when any XSS-controlled feature is enabled: They'd allocate >>>> a needlessly large block of memory, as they would only be able to use >>>> XSAVEC. >>> This field is an already known kernel=>user infoleak. There are threads >>> about it on LKML. >>> >>> But it does highlight another problem. This change does not fix Linux >>> on AMD Zen3 hardware, where the kernel will find the CPUID value larger >>> than it can calculate the size to be, because Xen's use of CET-SS will >>> show up in the CPUID value. >> Why would that be? We don't even have CET related defines for XCR0, so >> we don't enable the states in XSS. And I don't see why we would. Even >> for other XSTATE-managed but not XSTATE-enabled features we could >> clear the respective bits around the CPUID invocation (just like we >> may need to set some in XSS). We'd be in trouble only for any XSTATE- >> enabled feature that we make use of ourselves. > > It's not Xen's CPUID invocation which is relevant. It's the guest > kernels, which goes straight to hardware because AMD still doesn't have > CPUID faulting. > > And yes, right now none of Xen's CET state shows up in XSTATE, but that > needs to change in order to support CET in HVM guests if we don't want > an enormous extra overhead in the general context switch path. Right. HVM guests, though, aren't affected by the CPUID limitation you name for AMD. And PV guests could continue to be run with the bits off in XSS (we need to context switch XSS and XCR0 anyway). Jan
On 23.08.2022 12:48, Andrew Cooper wrote: > On 23/08/2022 10:27, Jan Beulich wrote: >> On 23.08.2022 10:59, Andrew Cooper wrote: >>> On 23/08/2022 07:42, Jan Beulich wrote: >>>> exposed to PV domains. >>>> >>>> Considering that the size reported is that of the compacted save area, >>>> I view Linux'es assumption as appropriate (short of the SDM properly >>>> considering the case). Therefore we need to populate the field also when >>>> only XSAVEC is supported for a guest. >>> This is a mess. The SDM is fairly clear (but only in Vol1) that this >>> leaf is specific to XSAVES. >> The way it's written my assumption is that they simply didn't care about >> XSAVEC when writing this, or they were assuming that both features would >> always be supported together (yet even if they are in Intel's hardware, >> the architecture should spell out things as if both were entirely >> independent, or it should specify that one takes the other as a prereq). > > Real hardware has XSAVEC == XSAVES on Intel (Skylake) and AMD (Zen1). > Despite an attempt to separate the parts of the ISA, they are > inextricably linked. > > It is only under virt that we get XSAVEC without XSAVES. > >>>> Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest") >>>> Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests") >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> CC Marek. Looks like Jan has found the issue you reported on IRC. >>> >>> Jan: Be aware that I submitted >>> https://lore.kernel.org/lkml/20220810221909.12768-1-andrew.cooper3@citrix.com/ >>> to Linux to correct some of the diagnostics. >>>> --- a/xen/arch/x86/cpuid.c >>>> +++ b/xen/arch/x86/cpuid.c >>>> @@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u >>>> switch ( subleaf ) >>>> { >>>> case 1: >>>> - if ( p->xstate.xsaves ) >>>> + if ( p->xstate.xsavec || p->xstate.xsaves ) >>> If we're doing this, then it wants to be xsavec only, with the comment >>> being extended to explain why. >> Why would that be? Both insns use compacted format, and neither is >> dependent upon the other in terms of being supported. IOW XSAVES alone >> and XSAVEC alone enabled for a domain should still lead through this >> path. > > Hmm. Because my fixes to compaction handling haven't been committed > yet, and in particular one the one which makes XSAVES strictly depend on > XSAVEC. > > In which case this hunk is correct for Xen as it currently is, and will > be need to be adjusted when I rebase the compaction series. May I translate this to an Ack then? Iirc there were no other change requests. >>> But this is going to further complicate my several-year-old series >>> trying to get Xen's XSTATE handling into a position where we can start >>> to offer supervisor states. >> Where do you see further complication? The necessary fiddling with XSS >> here would of course be dependent upon p->xstate.xsaves alone (or, >> maybe better, on the set of enabled features in XSS being non-empty), >> but that's simply another (inner) if(). >> >> As an aside, I actually wonder what use the supplied size is to user >> mode code when any XSS-controlled feature is enabled: They'd allocate >> a needlessly large block of memory, as they would only be able to use >> XSAVEC. > > This field is an already known kernel=>user infoleak. There are threads > about it on LKML. > > But it does highlight another problem. This change does not fix Linux > on AMD Zen3 hardware, where the kernel will find the CPUID value larger > than it can calculate the size to be, because Xen's use of CET-SS will > show up in the CPUID value. > > Linux needs an adjustment from != to <= for this check. I was wondering about that too, but if I'm not mistaken the change you suggest is the opposite of what would be apparently safe there (against overrunning buffers). Hence it may take more than just the comparison type to be modified. Jan
On 23/08/2022 13:01, Jan Beulich wrote: > On 23.08.2022 12:48, Andrew Cooper wrote: >> On 23/08/2022 10:27, Jan Beulich wrote: >>> On 23.08.2022 10:59, Andrew Cooper wrote: >>>> On 23/08/2022 07:42, Jan Beulich wrote: >>>> But this is going to further complicate my several-year-old series >>>> trying to get Xen's XSTATE handling into a position where we can start >>>> to offer supervisor states. >>> Where do you see further complication? The necessary fiddling with XSS >>> here would of course be dependent upon p->xstate.xsaves alone (or, >>> maybe better, on the set of enabled features in XSS being non-empty), >>> but that's simply another (inner) if(). >>> >>> As an aside, I actually wonder what use the supplied size is to user >>> mode code when any XSS-controlled feature is enabled: They'd allocate >>> a needlessly large block of memory, as they would only be able to use >>> XSAVEC. >> This field is an already known kernel=>user infoleak. There are threads >> about it on LKML. >> >> But it does highlight another problem. This change does not fix Linux >> on AMD Zen3 hardware, where the kernel will find the CPUID value larger >> than it can calculate the size to be, because Xen's use of CET-SS will >> show up in the CPUID value. >> >> Linux needs an adjustment from != to <= for this check. > I was wondering about that too, but if I'm not mistaken the change you > suggest is the opposite of what would be apparently safe there (against > overrunning buffers). Hence it may take more than just the comparison > type to be modified. The issue is that the CPUID leaf reports the compressed size of XCR0|XSS, which is >= what the XSAVEC instruction will write when it's only operating on XCR0 states. So either Linux trusts what it calculates from the other CPUID leaves, and gets the compressed size right, or it needs to account for the fact that in XenPV at least (probably UML too), that the CPUID leaf over-reports. ~Andrew
On 23/08/2022 13:01, Jan Beulich wrote: > On 23.08.2022 12:48, Andrew Cooper wrote: >> On 23/08/2022 10:27, Jan Beulich wrote: >>> On 23.08.2022 10:59, Andrew Cooper wrote: >>>> On 23/08/2022 07:42, Jan Beulich wrote: >>>>> +++ b/xen/arch/x86/cpuid.c >>>>> @@ -1142,7 +1142,7 @@ void guest_cpuid(const struct vcpu *v, u >>>>> switch ( subleaf ) >>>>> { >>>>> case 1: >>>>> - if ( p->xstate.xsaves ) >>>>> + if ( p->xstate.xsavec || p->xstate.xsaves ) >>>> If we're doing this, then it wants to be xsavec only, with the comment >>>> being extended to explain why. >>> Why would that be? Both insns use compacted format, and neither is >>> dependent upon the other in terms of being supported. IOW XSAVES alone >>> and XSAVEC alone enabled for a domain should still lead through this >>> path. >> Hmm. Because my fixes to compaction handling haven't been committed >> yet, and in particular one the one which makes XSAVES strictly depend on >> XSAVEC. >> >> In which case this hunk is correct for Xen as it currently is, and will >> be need to be adjusted when I rebase the compaction series. > May I translate this to an Ack then? Iirc there were no other change > requests. I was hoping that Marek would have time to test it too, seeing as he reported the bug first. But seeing as he's busy... Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> (with the typo in the commit message fixed.) ~Andrew
© 2016 - 2024 Red Hat, Inc.