[v1] x86/xstate: Fixes to size calculations

[PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Andrew Cooper 3 years, 7 months ago

Make use of the new xstate_uncompressed_size() helper rather than maintaining
the running calculation while accumulating feature components.

The rest of the CPUID data can come direct from the raw cpuid policy.  All
per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
instructions, and are constant.

Use for_each_set_bit() rather than opencoding a slightly awkward version of
it.  Mask the attributes in ecx down based on the visible features.  This
isn't actually necessary for any components or attributes defined at the time
of writing (up to AMX), but is added out of an abundance of caution.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

Using min() in for_each_set_bit() leads to awful code generation, as it
prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
---
 xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
 1 file changed, 17 insertions(+), 35 deletions(-)

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 752bf244ea..c7f8388e5d 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
 static void recalculate_xstate(struct cpuid_policy *p)
 {
     uint64_t xstates = XSTATE_FP_SSE;
-    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
-    unsigned int i, Da1 = p->xstate.Da1;
+    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
 
     /*
      * The Da1 leaf is the only piece of information preserved in the common
@@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
         return;
 
     if ( p->basic.avx )
-    {
         xstates |= X86_XCR0_YMM;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_YMM_POS] +
-                          xstate_sizes[X86_XCR0_YMM_POS]);
-    }
 
     if ( p->feat.mpx )
-    {
         xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
-                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
-    }
 
     if ( p->feat.avx512f )
-    {
         xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
-                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
-    }
 
     if ( p->feat.pku )
-    {
         xstates |= X86_XCR0_PKRU;
-        xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_PKRU_POS] +
-                          xstate_sizes[X86_XCR0_PKRU_POS]);
-    }
 
-    p->xstate.max_size  =  xstate_size;
+    /* Subleaf 0 */
+    p->xstate.max_size =
+        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
     p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
     p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
 
+    /* Subleaf 1 */
     p->xstate.Da1 = Da1;
     if ( p->xstate.xsaves )
     {
+        ecx_bits |= 3; /* Align64, XSS */
         p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
         p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
     }
-    else
-        xstates &= ~XSTATE_XSAVES_ONLY;
 
-    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
+    /* Subleafs 2+ */
+    xstates &= ~XSTATE_FP_SSE;
+    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
+    for_each_set_bit ( i, &xstates, 63 )
     {
-        uint64_t curr_xstate = 1ul << i;
-
-        if ( !(xstates & curr_xstate) )
-            continue;
-
-        p->xstate.comp[i].size   = xstate_sizes[i];
-        p->xstate.comp[i].offset = xstate_offsets[i];
-        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
-        p->xstate.comp[i].align  = curr_xstate & xstate_align;
+        /*
+         * Pass through size (eax) and offset (ebx) directly.  Visbility of
+         * attributes in ecx limited by visible features in Da1.
+         */
+        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
+        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
+        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
     }
 }
 
-- 
2.11.0

Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Jan Beulich 3 years, 7 months ago

On 03.05.2021 17:39, Andrew Cooper wrote:
> Make use of the new xstate_uncompressed_size() helper rather than maintaining
> the running calculation while accumulating feature components.
> 
> The rest of the CPUID data can come direct from the raw cpuid policy.  All
> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
> instructions, and are constant.
> 
> Use for_each_set_bit() rather than opencoding a slightly awkward version of
> it.  Mask the attributes in ecx down based on the visible features.  This
> isn't actually necessary for any components or attributes defined at the time
> of writing (up to AMX), but is added out of an abundance of caution.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> 
> Using min() in for_each_set_bit() leads to awful code generation, as it
> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
> ---
>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>  1 file changed, 17 insertions(+), 35 deletions(-)
> 
> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
> index 752bf244ea..c7f8388e5d 100644
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>  static void recalculate_xstate(struct cpuid_policy *p)
>  {
>      uint64_t xstates = XSTATE_FP_SSE;
> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
> -    unsigned int i, Da1 = p->xstate.Da1;
> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>  
>      /*
>       * The Da1 leaf is the only piece of information preserved in the common
> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>          return;
>  
>      if ( p->basic.avx )
> -    {
>          xstates |= X86_XCR0_YMM;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_YMM_POS] +
> -                          xstate_sizes[X86_XCR0_YMM_POS]);
> -    }
>  
>      if ( p->feat.mpx )
> -    {
>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
> -    }
>  
>      if ( p->feat.avx512f )
> -    {
>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
> -    }
>  
>      if ( p->feat.pku )
> -    {
>          xstates |= X86_XCR0_PKRU;
> -        xstate_size = max(xstate_size,
> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
> -    }
>  
> -    p->xstate.max_size  =  xstate_size;
> +    /* Subleaf 0 */
> +    p->xstate.max_size =
> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>  
> +    /* Subleaf 1 */
>      p->xstate.Da1 = Da1;
>      if ( p->xstate.xsaves )
>      {
> +        ecx_bits |= 3; /* Align64, XSS */

Align64 is also needed for p->xstate.xsavec afaict. I'm not really
convinced to tie one to the other either. I would rather think this
is a per-state-component attribute independent of other features.
Those state components could in turn have a dependency (like XSS
ones on XSAVES).

I'm also not happy at all to see you use a literal 3 here. We have
a struct for this, after all.

>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>      }
> -    else
> -        xstates &= ~XSTATE_XSAVES_ONLY;
>  
> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
> +    /* Subleafs 2+ */
> +    xstates &= ~XSTATE_FP_SSE;
> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
> +    for_each_set_bit ( i, &xstates, 63 )
>      {
> -        uint64_t curr_xstate = 1ul << i;
> -
> -        if ( !(xstates & curr_xstate) )
> -            continue;
> -
> -        p->xstate.comp[i].size   = xstate_sizes[i];
> -        p->xstate.comp[i].offset = xstate_offsets[i];
> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
> +        /*
> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
> +         * attributes in ecx limited by visible features in Da1.
> +         */
> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;

To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
honest. Both this and the literal 3 above make it harder to locate
all the places that need changing if a new bit (like xfd) is to be
added. It would be better if grep-ing for an existing field name
(say "xss") would easily turn up all involved places.

Jan

Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Andrew Cooper 3 years, 7 months ago

On 04/05/2021 13:43, Jan Beulich wrote:
> On 03.05.2021 17:39, Andrew Cooper wrote:
>> Make use of the new xstate_uncompressed_size() helper rather than maintaining
>> the running calculation while accumulating feature components.
>>
>> The rest of the CPUID data can come direct from the raw cpuid policy.  All
>> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
>> instructions, and are constant.
>>
>> Use for_each_set_bit() rather than opencoding a slightly awkward version of
>> it.  Mask the attributes in ecx down based on the visible features.  This
>> isn't actually necessary for any components or attributes defined at the time
>> of writing (up to AMX), but is added out of an abundance of caution.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Wei Liu <wl@xen.org>
>>
>> Using min() in for_each_set_bit() leads to awful code generation, as it
>> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
>> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
>> ---
>>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>>  1 file changed, 17 insertions(+), 35 deletions(-)
>>
>> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
>> index 752bf244ea..c7f8388e5d 100644
>> --- a/xen/arch/x86/cpuid.c
>> +++ b/xen/arch/x86/cpuid.c
>> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>>  static void recalculate_xstate(struct cpuid_policy *p)
>>  {
>>      uint64_t xstates = XSTATE_FP_SSE;
>> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
>> -    unsigned int i, Da1 = p->xstate.Da1;
>> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>>  
>>      /*
>>       * The Da1 leaf is the only piece of information preserved in the common
>> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>>          return;
>>  
>>      if ( p->basic.avx )
>> -    {
>>          xstates |= X86_XCR0_YMM;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_YMM_POS] +
>> -                          xstate_sizes[X86_XCR0_YMM_POS]);
>> -    }
>>  
>>      if ( p->feat.mpx )
>> -    {
>>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
>> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
>> -    }
>>  
>>      if ( p->feat.avx512f )
>> -    {
>>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
>> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
>> -    }
>>  
>>      if ( p->feat.pku )
>> -    {
>>          xstates |= X86_XCR0_PKRU;
>> -        xstate_size = max(xstate_size,
>> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
>> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
>> -    }
>>  
>> -    p->xstate.max_size  =  xstate_size;
>> +    /* Subleaf 0 */
>> +    p->xstate.max_size =
>> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>>  
>> +    /* Subleaf 1 */
>>      p->xstate.Da1 = Da1;
>>      if ( p->xstate.xsaves )
>>      {
>> +        ecx_bits |= 3; /* Align64, XSS */
> Align64 is also needed for p->xstate.xsavec afaict. I'm not really
> convinced to tie one to the other either. I would rather think this
> is a per-state-component attribute independent of other features.
> Those state components could in turn have a dependency (like XSS
> ones on XSAVES).

There is no such thing as a system with xsavec != xsaves (although there
does appear to be one line of AMD CPU with xsaves and not xgetbv1).

Through some (likely unintentional) coupling of data in CPUID, the
compressed dynamic size (CPUID.0xd[1].ebx) is required for xsavec, and
is strictly defined as XCR0|XSS, which forces xsaves into the mix.

In fact, an error with the spec is that userspace can calculate the
kernel's choice of MSR_XSS using CPUID data alone - there is not
currently an ambiguous combination of sizes of supervisor state
components.  This fact also makes XSAVEC suboptimal even for userspace
to use, because it is forced to allocate larger-than-necessary buffers.

In principle, we could ignore the coupling and support xsavec without
xsaves, but given that XSAVES is strictly more useful than XSAVEC, I'm
not sure it is worth trying to support.

>
> I'm also not happy at all to see you use a literal 3 here. We have
> a struct for this, after all.
>
>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>      }
>> -    else
>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>  
>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>> +    /* Subleafs 2+ */
>> +    xstates &= ~XSTATE_FP_SSE;
>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>> +    for_each_set_bit ( i, &xstates, 63 )
>>      {
>> -        uint64_t curr_xstate = 1ul << i;
>> -
>> -        if ( !(xstates & curr_xstate) )
>> -            continue;
>> -
>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>> +        /*
>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>> +         * attributes in ecx limited by visible features in Da1.
>> +         */
>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
> honest. Both this and the literal 3 above make it harder to locate
> all the places that need changing if a new bit (like xfd) is to be
> added. It would be better if grep-ing for an existing field name
> (say "xss") would easily turn up all involved places.

It's specifically to reduce the number of areas needing editing when a
new state, and therefore the number of opportunities to screw things up.

As said in the commit message, I'm not even convinced that the ecx_bits
mask is necessary, as new attributes only come in with new behaviours of
new state components.

If we choose to skip the ecx masking, then this loop body becomes even
more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].

Even if Intel do break with tradition, and retrofit new attributes into
existing subleafs, leaking them to guests won't cause anything to
explode (the bits are still reserved after all), and we can fix anything
necessary at that point.

~Andrew

Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Jan Beulich 3 years, 6 months ago

On 04.05.2021 15:58, Andrew Cooper wrote:
> On 04/05/2021 13:43, Jan Beulich wrote:
>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>> Make use of the new xstate_uncompressed_size() helper rather than maintaining
>>> the running calculation while accumulating feature components.
>>>
>>> The rest of the CPUID data can come direct from the raw cpuid policy.  All
>>> per-component data forms an ABI through the behaviour of the X{SAVE,RSTOR}*
>>> instructions, and are constant.
>>>
>>> Use for_each_set_bit() rather than opencoding a slightly awkward version of
>>> it.  Mask the attributes in ecx down based on the visible features.  This
>>> isn't actually necessary for any components or attributes defined at the time
>>> of writing (up to AMX), but is added out of an abundance of caution.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> ---
>>> CC: Jan Beulich <JBeulich@suse.com>
>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>> CC: Wei Liu <wl@xen.org>
>>>
>>> Using min() in for_each_set_bit() leads to awful code generation, as it
>>> prohibits the optimiations for spotting that the bitmap is <= BITS_PER_LONG.
>>> As p->xstate is long enough already, use a BUILD_BUG_ON() instead.
>>> ---
>>>  xen/arch/x86/cpuid.c | 52 +++++++++++++++++-----------------------------------
>>>  1 file changed, 17 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
>>> index 752bf244ea..c7f8388e5d 100644
>>> --- a/xen/arch/x86/cpuid.c
>>> +++ b/xen/arch/x86/cpuid.c
>>> @@ -154,8 +154,7 @@ static void sanitise_featureset(uint32_t *fs)
>>>  static void recalculate_xstate(struct cpuid_policy *p)
>>>  {
>>>      uint64_t xstates = XSTATE_FP_SSE;
>>> -    uint32_t xstate_size = XSTATE_AREA_MIN_SIZE;
>>> -    unsigned int i, Da1 = p->xstate.Da1;
>>> +    unsigned int i, ecx_bits = 0, Da1 = p->xstate.Da1;
>>>  
>>>      /*
>>>       * The Da1 leaf is the only piece of information preserved in the common
>>> @@ -167,61 +166,44 @@ static void recalculate_xstate(struct cpuid_policy *p)
>>>          return;
>>>  
>>>      if ( p->basic.avx )
>>> -    {
>>>          xstates |= X86_XCR0_YMM;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_YMM_POS] +
>>> -                          xstate_sizes[X86_XCR0_YMM_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.mpx )
>>> -    {
>>>          xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
>>> -                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.avx512f )
>>> -    {
>>>          xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
>>> -                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
>>> -    }
>>>  
>>>      if ( p->feat.pku )
>>> -    {
>>>          xstates |= X86_XCR0_PKRU;
>>> -        xstate_size = max(xstate_size,
>>> -                          xstate_offsets[X86_XCR0_PKRU_POS] +
>>> -                          xstate_sizes[X86_XCR0_PKRU_POS]);
>>> -    }
>>>  
>>> -    p->xstate.max_size  =  xstate_size;
>>> +    /* Subleaf 0 */
>>> +    p->xstate.max_size =
>>> +        xstate_uncompressed_size(xstates & ~XSTATE_XSAVES_ONLY);
>>>      p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
>>>      p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
>>>  
>>> +    /* Subleaf 1 */
>>>      p->xstate.Da1 = Da1;
>>>      if ( p->xstate.xsaves )
>>>      {
>>> +        ecx_bits |= 3; /* Align64, XSS */
>> Align64 is also needed for p->xstate.xsavec afaict. I'm not really
>> convinced to tie one to the other either. I would rather think this
>> is a per-state-component attribute independent of other features.
>> Those state components could in turn have a dependency (like XSS
>> ones on XSAVES).
> 
> There is no such thing as a system with xsavec != xsaves (although there
> does appear to be one line of AMD CPU with xsaves and not xgetbv1).

If we believed there was such a dependency, gen-cpuid.py should imo
already express it. The latest when we make ourselves depend on such
(which I remain not fully convinced of), such a dependency would
need adding, such that it becomes impossible to turn off xsaves
without also turning off xsavec. (Of course, a way to express this
symbolically doesn't currently exist, and is only being added as a
"side effect" of "x86: XFD enabling".)

> Through some (likely unintentional) coupling of data in CPUID, the
> compressed dynamic size (CPUID.0xd[1].ebx) is required for xsavec, and
> is strictly defined as XCR0|XSS, which forces xsaves into the mix.
> 
> In fact, an error with the spec is that userspace can calculate the
> kernel's choice of MSR_XSS using CPUID data alone - there is not
> currently an ambiguous combination of sizes of supervisor state
> components.  This fact also makes XSAVEC suboptimal even for userspace
> to use, because it is forced to allocate larger-than-necessary buffers.

But space-wise it's still better that way than using the uncompressed
format.

> In principle, we could ignore the coupling and support xsavec without
> xsaves, but given that XSAVES is strictly more useful than XSAVEC, I'm
> not sure it is worth trying to support.

I think we should, but I'm not going to object to the alternative as
long as dependencies are properly (put) in place.

>> I'm also not happy at all to see you use a literal 3 here. We have
>> a struct for this, after all.
>>
>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>      }
>>> -    else
>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>  
>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>> +    /* Subleafs 2+ */
>>> +    xstates &= ~XSTATE_FP_SSE;
>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>      {
>>> -        uint64_t curr_xstate = 1ul << i;
>>> -
>>> -        if ( !(xstates & curr_xstate) )
>>> -            continue;
>>> -
>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>> +        /*
>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>> +         * attributes in ecx limited by visible features in Da1.
>>> +         */
>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>> honest. Both this and the literal 3 above make it harder to locate
>> all the places that need changing if a new bit (like xfd) is to be
>> added. It would be better if grep-ing for an existing field name
>> (say "xss") would easily turn up all involved places.
> 
> It's specifically to reduce the number of areas needing editing when a
> new state, and therefore the number of opportunities to screw things up.
> 
> As said in the commit message, I'm not even convinced that the ecx_bits
> mask is necessary, as new attributes only come in with new behaviours of
> new state components.
> 
> If we choose to skip the ecx masking, then this loop body becomes even
> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
> 
> Even if Intel do break with tradition, and retrofit new attributes into
> existing subleafs, leaking them to guests won't cause anything to
> explode (the bits are still reserved after all), and we can fix anything
> necessary at that point.

I don't think this would necessarily go without breakage. What if,
assuming XFD support is in, an existing component got XFD sensitivity
added to it? If, like you were suggesting elsewhere, and like I had
it initially, we used a build-time constant for XFD-affected
components, we'd break consuming guests. The per-component XFD bit
(just to again take as example) also isn't strictly speaking tied to
the general XFD feature flag (but to me it makes sense for us to
enforce respective consistency). Plus, in general, the moment a flag
is no longer reserved in the spec, it is not reserved anywhere
anymore: An aware (newer) guest running on unaware (older) Xen ought
to still function correctly.

Jan

Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Andrew Cooper 3 years, 6 months ago

On 05/05/2021 09:19, Jan Beulich wrote:
> On 04.05.2021 15:58, Andrew Cooper wrote:
>> On 04/05/2021 13:43, Jan Beulich wrote:
>>> I'm also not happy at all to see you use a literal 3 here. We have
>>> a struct for this, after all.
>>>
>>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>>      }
>>>> -    else
>>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>>  
>>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>>> +    /* Subleafs 2+ */
>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>>      {
>>>> -        uint64_t curr_xstate = 1ul << i;
>>>> -
>>>> -        if ( !(xstates & curr_xstate) )
>>>> -            continue;
>>>> -
>>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>>> +        /*
>>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>>> +         * attributes in ecx limited by visible features in Da1.
>>>> +         */
>>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>>> honest. Both this and the literal 3 above make it harder to locate
>>> all the places that need changing if a new bit (like xfd) is to be
>>> added. It would be better if grep-ing for an existing field name
>>> (say "xss") would easily turn up all involved places.
>> It's specifically to reduce the number of areas needing editing when a
>> new state, and therefore the number of opportunities to screw things up.
>>
>> As said in the commit message, I'm not even convinced that the ecx_bits
>> mask is necessary, as new attributes only come in with new behaviours of
>> new state components.
>>
>> If we choose to skip the ecx masking, then this loop body becomes even
>> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
>>
>> Even if Intel do break with tradition, and retrofit new attributes into
>> existing subleafs, leaking them to guests won't cause anything to
>> explode (the bits are still reserved after all), and we can fix anything
>> necessary at that point.
> I don't think this would necessarily go without breakage. What if,
> assuming XFD support is in, an existing component got XFD sensitivity
> added to it?

I think that is exceedingly unlikely to happen.

>  If, like you were suggesting elsewhere, and like I had
> it initially, we used a build-time constant for XFD-affected
> components, we'd break consuming guests. The per-component XFD bit
> (just to again take as example) also isn't strictly speaking tied to
> the general XFD feature flag (but to me it makes sense for us to
> enforce respective consistency). Plus, in general, the moment a flag
> is no longer reserved in the spec, it is not reserved anywhere
> anymore: An aware (newer) guest running on unaware (older) Xen ought
> to still function correctly.

They're still technically reserved, because of the masking of the XFD
bit in the feature leaf.

However, pondered this for some time, we do need to retain the
attributes masking, because e.g. AMX && !XSAVEC is a legal (if very
unwise) combination to expose, and the align64 bits want to disappear
from the TILE state attributes.

Also, in terms of implementation, the easiest way to do something
plausible here is a dependency chain of XSAVE => XSAVEC (implies align64
masking) => XSAVES (implies xss masking) => CET_*.

XFD would depend on XSAVE, and would imply masking the xfd attribute.

This still leaves the broken corner case of the dynamic compressed size,
but I think I can live with that to avoid the added complexity of trying
to force XSAVEC == XSAVES.

~Andrew

Re: [PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()

Posted by Jan Beulich 3 years, 6 months ago

On 05.05.2021 16:53, Andrew Cooper wrote:
> On 05/05/2021 09:19, Jan Beulich wrote:
>> On 04.05.2021 15:58, Andrew Cooper wrote:
>>> On 04/05/2021 13:43, Jan Beulich wrote:
>>>> I'm also not happy at all to see you use a literal 3 here. We have
>>>> a struct for this, after all.
>>>>
>>>>>          p->xstate.xss_low   =  xstates & XSTATE_XSAVES_ONLY;
>>>>>          p->xstate.xss_high  = (xstates & XSTATE_XSAVES_ONLY) >> 32;
>>>>>      }
>>>>> -    else
>>>>> -        xstates &= ~XSTATE_XSAVES_ONLY;
>>>>>  
>>>>> -    for ( i = 2; i < min(63ul, ARRAY_SIZE(p->xstate.comp)); ++i )
>>>>> +    /* Subleafs 2+ */
>>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>>> +    BUILD_BUG_ON(ARRAY_SIZE(p->xstate.comp) < 63);
>>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>>>      {
>>>>> -        uint64_t curr_xstate = 1ul << i;
>>>>> -
>>>>> -        if ( !(xstates & curr_xstate) )
>>>>> -            continue;
>>>>> -
>>>>> -        p->xstate.comp[i].size   = xstate_sizes[i];
>>>>> -        p->xstate.comp[i].offset = xstate_offsets[i];
>>>>> -        p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
>>>>> -        p->xstate.comp[i].align  = curr_xstate & xstate_align;
>>>>> +        /*
>>>>> +         * Pass through size (eax) and offset (ebx) directly.  Visbility of
>>>>> +         * attributes in ecx limited by visible features in Da1.
>>>>> +         */
>>>>> +        p->xstate.raw[i].a = raw_cpuid_policy.xstate.raw[i].a;
>>>>> +        p->xstate.raw[i].b = raw_cpuid_policy.xstate.raw[i].b;
>>>>> +        p->xstate.raw[i].c = raw_cpuid_policy.xstate.raw[i].c & ecx_bits;
>>>> To me, going to raw[].{a,b,c,d} looks like a backwards move, to be
>>>> honest. Both this and the literal 3 above make it harder to locate
>>>> all the places that need changing if a new bit (like xfd) is to be
>>>> added. It would be better if grep-ing for an existing field name
>>>> (say "xss") would easily turn up all involved places.
>>> It's specifically to reduce the number of areas needing editing when a
>>> new state, and therefore the number of opportunities to screw things up.
>>>
>>> As said in the commit message, I'm not even convinced that the ecx_bits
>>> mask is necessary, as new attributes only come in with new behaviours of
>>> new state components.
>>>
>>> If we choose to skip the ecx masking, then this loop body becomes even
>>> more simple.  Just p->xstate.raw[i] = raw_cpuid_policy.xstate.raw[i].
>>>
>>> Even if Intel do break with tradition, and retrofit new attributes into
>>> existing subleafs, leaking them to guests won't cause anything to
>>> explode (the bits are still reserved after all), and we can fix anything
>>> necessary at that point.
>> I don't think this would necessarily go without breakage. What if,
>> assuming XFD support is in, an existing component got XFD sensitivity
>> added to it?
> 
> I think that is exceedingly unlikely to happen.
> 
>>  If, like you were suggesting elsewhere, and like I had
>> it initially, we used a build-time constant for XFD-affected
>> components, we'd break consuming guests. The per-component XFD bit
>> (just to again take as example) also isn't strictly speaking tied to
>> the general XFD feature flag (but to me it makes sense for us to
>> enforce respective consistency). Plus, in general, the moment a flag
>> is no longer reserved in the spec, it is not reserved anywhere
>> anymore: An aware (newer) guest running on unaware (older) Xen ought
>> to still function correctly.
> 
> They're still technically reserved, because of the masking of the XFD
> bit in the feature leaf.
> 
> However, pondered this for some time, we do need to retain the
> attributes masking, because e.g. AMX && !XSAVEC is a legal (if very
> unwise) combination to expose, and the align64 bits want to disappear
> from the TILE state attributes.
> 
> Also, in terms of implementation, the easiest way to do something
> plausible here is a dependency chain of XSAVE => XSAVEC (implies align64
> masking) => XSAVES (implies xss masking) => CET_*.
> 
> XFD would depend on XSAVE, and would imply masking the xfd attribute.

Okay, let's go with this then.

Jan

[PATCH 1/5] x86/xstate: Elide redundant writes in set_xcr0()
[PATCH 2/5] x86/xstate: Rename _xstate_ctxt_size() to hw_uncompressed_size()
[PATCH 3/5] x86/xstate: Rework xstate_ctxt_size() as xstate_uncompressed_size()
[PATCH 4/5] x86/cpuid: Simplify recalculate_xstate()
[PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves