[PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"

Andrew Cooper posted 1 patch 2 years, 4 months ago
Test gitlab-ci failed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20211124211152.1142-1-andrew.cooper3@citrix.com
tools/libs/guest/xg_cpuid_x86.c          |   7 ---
tools/tests/cpu-policy/test-cpu-policy.c | 101 -------------------------------
xen/arch/x86/cpuid.c                     |  10 ---
xen/include/xen/lib/x86/cpuid.h          |   7 ---
xen/lib/x86/cpuid.c                      |  39 ------------
5 files changed, 164 deletions(-)
[PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Andrew Cooper 2 years, 4 months ago
OSSTest has identified a 3rd regression caused by this change.  Migration
between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
fails with:

  xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
  xc: error: Restore failed (22 = Invalid argument): Internal error

which is a safety check to prevent resuming the guest when the CPUID data has
been truncated.  The problem is caused by shrinking of the max policies, which
is an ABI that needs handling compatibly between different versions of Xen.

Furthermore, shrinking of the default policies also breaks things in some
cases, because certain cpuid= settings in a VM config file which used to have
an effect will now be silently discarded.

This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
one new case where cpuid= settings might not apply correctly) and restores the
same behaviour as Xen 4.15.

Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Ian Jackson <iwj@xenproject.org>

Passing CI runs:
  https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/415822110
  https://cirrus-ci.com/build/4915643318272000

This supercedes
https://lore.kernel.org/xen-devel/20211124161649.83189-1-roger.pau@citrix.com/
which is a step in the right direction (it fixes the OSSTest breakage), but
incomplete (doesn't fix the potential cpuid= breakage).

For 4.16.  This is the 3rd breakage caused by 540d911c2813, and there is no
reasonable workaround.  Obviously, this revert isn't completely without risk,
but going wholesale back to the 4.15 behaviour is by far the safest option at
this point.
---
 tools/libs/guest/xg_cpuid_x86.c          |   7 ---
 tools/tests/cpu-policy/test-cpu-policy.c | 101 -------------------------------
 xen/arch/x86/cpuid.c                     |  10 ---
 xen/include/xen/lib/x86/cpuid.h          |   7 ---
 xen/lib/x86/cpuid.c                      |  39 ------------
 5 files changed, 164 deletions(-)

diff --git a/tools/libs/guest/xg_cpuid_x86.c b/tools/libs/guest/xg_cpuid_x86.c
index 3ffd5f683bec..198892ebdf45 100644
--- a/tools/libs/guest/xg_cpuid_x86.c
+++ b/tools/libs/guest/xg_cpuid_x86.c
@@ -638,13 +638,6 @@ int xc_cpuid_apply_policy(xc_interface *xch, uint32_t domid, bool restore,
         }
     }
 
-    /*
-     * Do not try to shrink the policy if restoring, as that could cause
-     * guest visible changes in the maximum leaf fields.
-     */
-    if ( !restore )
-        x86_cpuid_policy_shrink_max_leaves(p);
-
     rc = x86_cpuid_copy_to_buffer(p, leaves, &nr_leaves);
     if ( rc )
     {
diff --git a/tools/tests/cpu-policy/test-cpu-policy.c b/tools/tests/cpu-policy/test-cpu-policy.c
index 75973298dfd5..ed450a099709 100644
--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -8,13 +8,10 @@
 #include <err.h>
 
 #include <xen-tools/libs.h>
-#include <xen/asm/x86-defns.h>
 #include <xen/asm/x86-vendors.h>
 #include <xen/lib/x86/cpu-policy.h>
 #include <xen/domctl.h>
 
-#define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
-
 static unsigned int nr_failures;
 #define fail(fmt, ...)                          \
 ({                                              \
@@ -573,103 +570,6 @@ static void test_cpuid_out_of_range_clearing(void)
     }
 }
 
-static void test_cpuid_maximum_leaf_shrinking(void)
-{
-    static const struct test {
-        const char *name;
-        struct cpuid_policy p;
-    } tests[] = {
-        {
-            .name = "basic",
-            .p = {
-                /* Very basic information only. */
-                .basic.max_leaf = 1,
-                .basic.raw_fms = 0xc2,
-            },
-        },
-        {
-            .name = "cache",
-            .p = {
-                /* Cache subleaves present. */
-                .basic.max_leaf = 4,
-                .cache.subleaf[0].type = 1,
-            },
-        },
-        {
-            .name = "feat#0",
-            .p = {
-                /* Subleaf 0 only with some valid bit. */
-                .basic.max_leaf = 7,
-                .feat.max_subleaf = 0,
-                .feat.fsgsbase = 1,
-            },
-        },
-        {
-            .name = "feat#1",
-            .p = {
-                /* Subleaf 1 only with some valid bit. */
-                .basic.max_leaf = 7,
-                .feat.max_subleaf = 1,
-                .feat.avx_vnni = 1,
-            },
-        },
-        {
-            .name = "topo",
-            .p = {
-                /* Topology subleaves present. */
-                .basic.max_leaf = 0xb,
-                .topo.subleaf[0].type = 1,
-            },
-        },
-        {
-            .name = "xstate",
-            .p = {
-                /* First subleaf always valid (and then non-zero). */
-                .basic.max_leaf = 0xd,
-                .xstate.xcr0_low = XSTATE_FP_SSE,
-            },
-        },
-        {
-            .name = "extd",
-            .p = {
-                /* Commonly available information only. */
-                .extd.max_leaf = 0x80000008,
-                .extd.maxphysaddr = 0x28,
-                .extd.maxlinaddr = 0x30,
-            },
-        },
-    };
-
-    printf("Testing CPUID maximum leaf shrinking:\n");
-
-    for ( size_t i = 0; i < ARRAY_SIZE(tests); ++i )
-    {
-        const struct test *t = &tests[i];
-        struct cpuid_policy *p = memdup(&t->p);
-
-        p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
-        p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
-        p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
-
-        x86_cpuid_policy_shrink_max_leaves(p);
-
-        /* Check the the resulting max (sub)leaf values against expecations. */
-        if ( p->basic.max_leaf != t->p.basic.max_leaf )
-             fail("  Test %s basic fail - expected %#x, got %#x\n",
-                  t->name, t->p.basic.max_leaf, p->basic.max_leaf);
-
-        if ( p->extd.max_leaf != t->p.extd.max_leaf )
-             fail("  Test %s extd fail - expected %#x, got %#x\n",
-                  t->name, t->p.extd.max_leaf, p->extd.max_leaf);
-
-        if ( p->feat.max_subleaf != t->p.feat.max_subleaf )
-             fail("  Test %s feat fail - expected %#x, got %#x\n",
-                  t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
-
-        free(p);
-    }
-}
-
 static void test_is_compatible_success(void)
 {
     static struct test {
@@ -785,7 +685,6 @@ int main(int argc, char **argv)
     test_cpuid_serialise_success();
     test_cpuid_deserialise_failure();
     test_cpuid_out_of_range_clearing();
-    test_cpuid_maximum_leaf_shrinking();
 
     test_msr_serialise_success();
     test_msr_deserialise_failure();
diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 8ac55f0806d0..151944f65702 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -353,8 +353,6 @@ static void __init calculate_host_policy(void)
         p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
                                (1u << SVM_FEATURE_TSCRATEMSR));
     }
-
-    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init guest_common_default_feature_adjustments(uint32_t *fs)
@@ -434,8 +432,6 @@ static void __init calculate_pv_max_policy(void)
     recalculate_xstate(p);
 
     p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
-
-    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_pv_def_policy(void)
@@ -456,8 +452,6 @@ static void __init calculate_pv_def_policy(void)
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
-
-    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_hvm_max_policy(void)
@@ -523,8 +517,6 @@ static void __init calculate_hvm_max_policy(void)
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
-
-    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 static void __init calculate_hvm_def_policy(void)
@@ -549,8 +541,6 @@ static void __init calculate_hvm_def_policy(void)
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
-
-    x86_cpuid_policy_shrink_max_leaves(p);
 }
 
 void __init init_guest_cpuid(void)
diff --git a/xen/include/xen/lib/x86/cpuid.h b/xen/include/xen/lib/x86/cpuid.h
index 2300faf03e2b..a4d254ea96e0 100644
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -386,13 +386,6 @@ void x86_cpuid_policy_fill_native(struct cpuid_policy *p);
  */
 void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
 
-/**
- * Shrink max leaf/subleaf values such that the last respective valid entry
- * isn't all blank.  While permitted by the spec, such extraneous leaves may
- * provide undue "hints" to guests.
- */
-void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p);
-
 #ifdef __XEN__
 #include <public/arch-x86/xen.h>
 typedef XEN_GUEST_HANDLE_64(xen_cpuid_leaf_t) cpuid_leaf_buffer_t;
diff --git a/xen/lib/x86/cpuid.c b/xen/lib/x86/cpuid.c
index 1409c254c8ea..8eb88314f53c 100644
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -236,45 +236,6 @@ void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p)
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 
-void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
-{
-    unsigned int i;
-
-    p->basic.raw[0x4] = p->cache.raw[0];
-
-    for ( i = p->feat.max_subleaf; i; --i )
-        if ( p->feat.raw[i].a | p->feat.raw[i].b |
-             p->feat.raw[i].c | p->feat.raw[i].d )
-            break;
-    p->feat.max_subleaf = i;
-    p->basic.raw[0x7] = p->feat.raw[i];
-
-    p->basic.raw[0xb] = p->topo.raw[0];
-
-    /*
-     * Due to the way xstate gets handled in the hypervisor (see
-     * recalculate_xstate()) there is (for now at least) no need to fiddle
-     * with the xstate subleaves (IOW we assume they're already in consistent
-     * shape, for coming from either hardware or recalculate_xstate()).
-     */
-    p->basic.raw[0xd] = p->xstate.raw[0];
-
-    for ( i = p->basic.max_leaf; i; --i )
-        if ( p->basic.raw[i].a | p->basic.raw[i].b |
-             p->basic.raw[i].c | p->basic.raw[i].d )
-            break;
-    p->basic.max_leaf = i;
-
-    for ( i = p->extd.max_leaf & 0xffff; i; --i )
-        if ( p->extd.raw[i].a | p->extd.raw[i].b |
-             p->extd.raw[i].c | p->extd.raw[i].d )
-            break;
-    if ( i | p->extd.raw[0].b | p->extd.raw[0].c | p->extd.raw[0].d )
-        p->extd.max_leaf = 0x80000000 | i;
-    else
-        p->extd.max_leaf = 0;
-}
-
 const uint32_t *x86_cpuid_lookup_deep_deps(uint32_t feature)
 {
     static const uint32_t deep_features[] = INIT_DEEP_FEATURES;
-- 
2.11.0


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 24.11.2021 22:11, Andrew Cooper wrote:
> OSSTest has identified a 3rd regression caused by this change.  Migration
> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> fails with:
> 
>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>   xc: error: Restore failed (22 = Invalid argument): Internal error
> 
> which is a safety check to prevent resuming the guest when the CPUID data has
> been truncated.  The problem is caused by shrinking of the max policies, which
> is an ABI that needs handling compatibly between different versions of Xen.

Would you mind pointing me to the flight which has hit this problem? I
don't think I've seen any respective failure. I also don't think I've
seen any respective discussion on irc, so I really wonder where all
this is coming from all of the sudden. It's not like the change in
question would have gone in just yesterday.

> Furthermore, shrinking of the default policies also breaks things in some
> cases, because certain cpuid= settings in a VM config file which used to have
> an effect will now be silently discarded.

I'm afraid I don't see what you're talking about here. Could you give
an example? Is this about features the flags of which live in the
higher leaves, which would have got stripped from the default policies?
Which would mean the stripping really should happen on the max policies
only (where it may not have much of an effect).

> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> one new case where cpuid= settings might not apply correctly) and restores the
> same behaviour as Xen 4.15.
> 
> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

I'll be okay ack-ing this revert, but I'd really like to fully
understand things first.

Jan


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Roger Pau Monné 2 years, 4 months ago
On Thu, Nov 25, 2021 at 09:57:31AM +0100, Jan Beulich wrote:
> On 24.11.2021 22:11, Andrew Cooper wrote:
> > OSSTest has identified a 3rd regression caused by this change.  Migration
> > between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> > fails with:
> > 
> >   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
> >   xc: error: Restore failed (22 = Invalid argument): Internal error
> > 
> > which is a safety check to prevent resuming the guest when the CPUID data has
> > been truncated.  The problem is caused by shrinking of the max policies, which
> > is an ABI that needs handling compatibly between different versions of Xen.
> 
> Would you mind pointing me to the flight which has hit this problem? I
> don't think I've seen any respective failure. I also don't think I've
> seen any respective discussion on irc, so I really wonder where all
> this is coming from all of the sudden. It's not like the change in
> question would have gone in just yesterday.

It's from a pair of newish boxes that Credativ and Ian where
attempting to commission yesterday. Since the boxes are not yet in
production Ian wasn't sure if the issue could be on the testing or
hardware side, so emailed the details in private for us to provide
some feedback on the issue. The error is at:

http://logs.test-lab.xenproject.org/osstest/logs/166296/test-amd64-amd64-migrupgrade/info.html

> > Furthermore, shrinking of the default policies also breaks things in some
> > cases, because certain cpuid= settings in a VM config file which used to have
> > an effect will now be silently discarded.
> 
> I'm afraid I don't see what you're talking about here. Could you give
> an example? Is this about features the flags of which live in the
> higher leaves, which would have got stripped from the default policies?
> Which would mean the stripping really should happen on the max policies
> only (where it may not have much of an effect).

I think there are two separate issues, which I tried to clarify in my
reply to this same patch.

Options set using cpuid= with xl could now be rejected when in
previous versions they were accepted just fine. That's because the
shrinking to the policies can now cause find_leaf calls in
xc_cpuid_xend_policy to fail, and thus the whole operation would
fail.

There's another likely error that only affects callers of
xc_cpuid_apply_policy that pass a featureset (so not the upstream
toolstack), where some leaves of the featureset could now be ignored
by the guest if the max leaves value doesn't cover them anymore. Note
this was already an issue even before 540d911c2813, as applying the
featureset doesn't check that the set feature leaves are below the max
leaf.

Roger.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 25.11.2021 10:24, Roger Pau Monné wrote:
> On Thu, Nov 25, 2021 at 09:57:31AM +0100, Jan Beulich wrote:
>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>>> fails with:
>>>
>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>
>>> which is a safety check to prevent resuming the guest when the CPUID data has
>>> been truncated.  The problem is caused by shrinking of the max policies, which
>>> is an ABI that needs handling compatibly between different versions of Xen.
>>
>> Would you mind pointing me to the flight which has hit this problem? I
>> don't think I've seen any respective failure. I also don't think I've
>> seen any respective discussion on irc, so I really wonder where all
>> this is coming from all of the sudden. It's not like the change in
>> question would have gone in just yesterday.
> 
> It's from a pair of newish boxes that Credativ and Ian where
> attempting to commission yesterday. Since the boxes are not yet in
> production Ian wasn't sure if the issue could be on the testing or
> hardware side, so emailed the details in private for us to provide
> some feedback on the issue. The error is at:
> 
> http://logs.test-lab.xenproject.org/osstest/logs/166296/test-amd64-amd64-migrupgrade/info.html

I see. Quite lucky timing then.

>>> Furthermore, shrinking of the default policies also breaks things in some
>>> cases, because certain cpuid= settings in a VM config file which used to have
>>> an effect will now be silently discarded.
>>
>> I'm afraid I don't see what you're talking about here. Could you give
>> an example? Is this about features the flags of which live in the
>> higher leaves, which would have got stripped from the default policies?
>> Which would mean the stripping really should happen on the max policies
>> only (where it may not have much of an effect).
> 
> I think there are two separate issues, which I tried to clarify in my
> reply to this same patch.
> 
> Options set using cpuid= with xl could now be rejected when in
> previous versions they were accepted just fine. That's because the
> shrinking to the policies can now cause find_leaf calls in
> xc_cpuid_xend_policy to fail, and thus the whole operation would
> fail.

Okay, this could be addressed by merely dropping the calls from
calculate_{pv,hvm}_def_policy(). Thinking about it, I can surely
agree they shouldn't have been put there in the first place.
Which would be quite the opposite of your initial proposal, where
you did drop them from calculate_{pv,hvm}_max_policy(). A guest
migrating in with a larger max leaf value should merely have that
max leaf value retained, but that ought to be possible without
dropping the shrinking from calculate_{pv,hvm}_max_policy(). Even
leaving aside migration, I guess an explicit request for a large
max leaf value should be honored; those possibly many trailing
leaves then would simply all be blank.

> There's another likely error that only affects callers of
> xc_cpuid_apply_policy that pass a featureset (so not the upstream
> toolstack), where some leaves of the featureset could now be ignored
> by the guest if the max leaves value doesn't cover them anymore. Note
> this was already an issue even before 540d911c2813, as applying the
> featureset doesn't check that the set feature leaves are below the max
> leaf.

If this was an issue before the commit to be reverted, I take it
the revert isn't going to help it? In which case this information
is interesting, but not applicable as justification for the
revert?

Jan


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Roger Pau Monné 2 years, 4 months ago
On Thu, Nov 25, 2021 at 10:52:12AM +0100, Jan Beulich wrote:
> On 25.11.2021 10:24, Roger Pau Monné wrote:
> > On Thu, Nov 25, 2021 at 09:57:31AM +0100, Jan Beulich wrote:
> >> On 24.11.2021 22:11, Andrew Cooper wrote:
> >>> OSSTest has identified a 3rd regression caused by this change.  Migration
> >>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> >>> fails with:
> >>>
> >>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
> >>>   xc: error: Restore failed (22 = Invalid argument): Internal error
> >>>
> >>> which is a safety check to prevent resuming the guest when the CPUID data has
> >>> been truncated.  The problem is caused by shrinking of the max policies, which
> >>> is an ABI that needs handling compatibly between different versions of Xen.
> >>
> >> Would you mind pointing me to the flight which has hit this problem? I
> >> don't think I've seen any respective failure. I also don't think I've
> >> seen any respective discussion on irc, so I really wonder where all
> >> this is coming from all of the sudden. It's not like the change in
> >> question would have gone in just yesterday.
> > 
> > It's from a pair of newish boxes that Credativ and Ian where
> > attempting to commission yesterday. Since the boxes are not yet in
> > production Ian wasn't sure if the issue could be on the testing or
> > hardware side, so emailed the details in private for us to provide
> > some feedback on the issue. The error is at:
> > 
> > http://logs.test-lab.xenproject.org/osstest/logs/166296/test-amd64-amd64-migrupgrade/info.html
> 
> I see. Quite lucky timing then.

Indeed, it was pure luck that we got this just yesterday.

> 
> >>> Furthermore, shrinking of the default policies also breaks things in some
> >>> cases, because certain cpuid= settings in a VM config file which used to have
> >>> an effect will now be silently discarded.
> >>
> >> I'm afraid I don't see what you're talking about here. Could you give
> >> an example? Is this about features the flags of which live in the
> >> higher leaves, which would have got stripped from the default policies?
> >> Which would mean the stripping really should happen on the max policies
> >> only (where it may not have much of an effect).
> > 
> > I think there are two separate issues, which I tried to clarify in my
> > reply to this same patch.
> > 
> > Options set using cpuid= with xl could now be rejected when in
> > previous versions they were accepted just fine. That's because the
> > shrinking to the policies can now cause find_leaf calls in
> > xc_cpuid_xend_policy to fail, and thus the whole operation would
> > fail.
> 
> Okay, this could be addressed by merely dropping the calls from
> calculate_{pv,hvm}_def_policy(). Thinking about it, I can surely
> agree they shouldn't have been put there in the first place.
> Which would be quite the opposite of your initial proposal, where
> you did drop them from calculate_{pv,hvm}_max_policy(). A guest
> migrating in with a larger max leaf value should merely have that
> max leaf value retained, but that ought to be possible without
> dropping the shrinking from calculate_{pv,hvm}_max_policy().

I won't argue it's not possible to do that without dropping the shrink
from calculate_{pv,hvm}_max_policy(), but given the point we are on
the release we should consider the safest option, and IMO that's the
revert of the shrinking from there in order to restore the previous
behavior and have working migrations from 4.15 -> 4.16.

We can discuss other likely better approaches to solve this issue
after the release.

> Even
> leaving aside migration, I guess an explicit request for a large
> max leaf value should be honored; those possibly many trailing
> leaves then would simply all be blank.
> 
> > There's another likely error that only affects callers of
> > xc_cpuid_apply_policy that pass a featureset (so not the upstream
> > toolstack), where some leaves of the featureset could now be ignored
> > by the guest if the max leaves value doesn't cover them anymore. Note
> > this was already an issue even before 540d911c2813, as applying the
> > featureset doesn't check that the set feature leaves are below the max
> > leaf.
> 
> If this was an issue before the commit to be reverted, I take it
> the revert isn't going to help it?

I think the commit makes it more likely to hit the above scenario by
shrinking max leaves.

> In which case this information
> is interesting, but not applicable as justification for the
> revert?

As said above, while the commit at hand is not introducing the issue
with the featuresets, it makes it more likely by shrinking the max
leaves, and IMO it's a regression from behavior in 4.15.

Ie: options set on the featureset on 4.15 would be exposed, while the
same options could be hidden in 4.16 because of the shrinking to the
default domain policies, if the user happens to set an option that's
on an empty trailing featureset with a tail of zeroed leaves.

Thanks, Roger.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 25.11.2021 11:15, Roger Pau Monné wrote:
> On Thu, Nov 25, 2021 at 10:52:12AM +0100, Jan Beulich wrote:
>> On 25.11.2021 10:24, Roger Pau Monné wrote:
>>> On Thu, Nov 25, 2021 at 09:57:31AM +0100, Jan Beulich wrote:
>>>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>>>>> fails with:
>>>>>
>>>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>>>
>>>>> which is a safety check to prevent resuming the guest when the CPUID data has
>>>>> been truncated.  The problem is caused by shrinking of the max policies, which
>>>>> is an ABI that needs handling compatibly between different versions of Xen.
>>>>
>>>> Would you mind pointing me to the flight which has hit this problem? I
>>>> don't think I've seen any respective failure. I also don't think I've
>>>> seen any respective discussion on irc, so I really wonder where all
>>>> this is coming from all of the sudden. It's not like the change in
>>>> question would have gone in just yesterday.
>>>
>>> It's from a pair of newish boxes that Credativ and Ian where
>>> attempting to commission yesterday. Since the boxes are not yet in
>>> production Ian wasn't sure if the issue could be on the testing or
>>> hardware side, so emailed the details in private for us to provide
>>> some feedback on the issue. The error is at:
>>>
>>> http://logs.test-lab.xenproject.org/osstest/logs/166296/test-amd64-amd64-migrupgrade/info.html
>>
>> I see. Quite lucky timing then.
> 
> Indeed, it was pure luck that we got this just yesterday.
> 
>>
>>>>> Furthermore, shrinking of the default policies also breaks things in some
>>>>> cases, because certain cpuid= settings in a VM config file which used to have
>>>>> an effect will now be silently discarded.
>>>>
>>>> I'm afraid I don't see what you're talking about here. Could you give
>>>> an example? Is this about features the flags of which live in the
>>>> higher leaves, which would have got stripped from the default policies?
>>>> Which would mean the stripping really should happen on the max policies
>>>> only (where it may not have much of an effect).
>>>
>>> I think there are two separate issues, which I tried to clarify in my
>>> reply to this same patch.
>>>
>>> Options set using cpuid= with xl could now be rejected when in
>>> previous versions they were accepted just fine. That's because the
>>> shrinking to the policies can now cause find_leaf calls in
>>> xc_cpuid_xend_policy to fail, and thus the whole operation would
>>> fail.
>>
>> Okay, this could be addressed by merely dropping the calls from
>> calculate_{pv,hvm}_def_policy(). Thinking about it, I can surely
>> agree they shouldn't have been put there in the first place.
>> Which would be quite the opposite of your initial proposal, where
>> you did drop them from calculate_{pv,hvm}_max_policy(). A guest
>> migrating in with a larger max leaf value should merely have that
>> max leaf value retained, but that ought to be possible without
>> dropping the shrinking from calculate_{pv,hvm}_max_policy().
> 
> I won't argue it's not possible to do that without dropping the shrink
> from calculate_{pv,hvm}_max_policy(), but given the point we are on
> the release we should consider the safest option, and IMO that's the
> revert of the shrinking from there in order to restore the previous
> behavior and have working migrations from 4.15 -> 4.16.
> 
> We can discuss other likely better approaches to solve this issue
> after the release.

Sure; as said earlier on I merely would like to understand things
sufficiently well before giving my ack.

Jan

>> Even
>> leaving aside migration, I guess an explicit request for a large
>> max leaf value should be honored; those possibly many trailing
>> leaves then would simply all be blank.
>>
>>> There's another likely error that only affects callers of
>>> xc_cpuid_apply_policy that pass a featureset (so not the upstream
>>> toolstack), where some leaves of the featureset could now be ignored
>>> by the guest if the max leaves value doesn't cover them anymore. Note
>>> this was already an issue even before 540d911c2813, as applying the
>>> featureset doesn't check that the set feature leaves are below the max
>>> leaf.
>>
>> If this was an issue before the commit to be reverted, I take it
>> the revert isn't going to help it?
> 
> I think the commit makes it more likely to hit the above scenario by
> shrinking max leaves.
> 
>> In which case this information
>> is interesting, but not applicable as justification for the
>> revert?
> 
> As said above, while the commit at hand is not introducing the issue
> with the featuresets, it makes it more likely by shrinking the max
> leaves, and IMO it's a regression from behavior in 4.15.
> 
> Ie: options set on the featureset on 4.15 would be exposed, while the
> same options could be hidden in 4.16 because of the shrinking to the
> default domain policies, if the user happens to set an option that's
> on an empty trailing featureset with a tail of zeroed leaves.
> 
> Thanks, Roger.
> 


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Roger Pau Monné 2 years, 4 months ago
On Wed, Nov 24, 2021 at 09:11:52PM +0000, Andrew Cooper wrote:
> OSSTest has identified a 3rd regression caused by this change.  Migration
> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> fails with:
> 
>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>   xc: error: Restore failed (22 = Invalid argument): Internal error
> 
> which is a safety check to prevent resuming the guest when the CPUID data has
> been truncated.  The problem is caused by shrinking of the max policies, which
> is an ABI that needs handling compatibly between different versions of Xen.
> 
> Furthermore, shrinking of the default policies also breaks things in some
> cases, because certain cpuid= settings in a VM config file which used to have
> an effect will now be silently discarded.

I don't think they will be silently discarded. xc_cpuid_xend_policy
will attempt to get the CPUID leaf from the current/host/default
policies and fail because the leaf couldn't be found.

There's an issue with callers of xc_cpuid_apply_policy that pass a
featureset (which is not used by the upstream tools) as in that case
the featureset is translated to a CPUID policy without checking that
the set features are correctly exposed regarding the maximum leaves
set in CPUID, and thus features could be silently dropped, but as said
that's not used by the `cpuid=` config file option.

This possibly needs fixing anyway after the release, in order to
assert that the bits set in the featureset are correctly exposed given
the max leaves reported in the policy.

I think the above paragraph should be rewored as:

"Furthermore, shrinking of the default policies also breaks things in
some cases, because certain cpuid= settings in a VM config file which
used to work will now be refused. Also external toolstacks that
attempt to set the CPUID policy from a featureset might now see some
filled leaves not reachable due to the shrinking done to the default
domain policy before applying the featureset."

> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> one new case where cpuid= settings might not apply correctly) and restores the
> same behaviour as Xen 4.15.
> 
> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Likely with the paragraph adjusted if agreed.

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Ian Jackson <iwj@xenproject.org>
> 
> Passing CI runs:
>   https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/415822110
>   https://cirrus-ci.com/build/4915643318272000
> 
> This supercedes
> https://lore.kernel.org/xen-devel/20211124161649.83189-1-roger.pau@citrix.com/
> which is a step in the right direction (it fixes the OSSTest breakage), but
> incomplete (doesn't fix the potential cpuid= breakage).
> 
> For 4.16.  This is the 3rd breakage caused by 540d911c2813, and there is no
> reasonable workaround.  Obviously, this revert isn't completely without risk,
> but going wholesale back to the 4.15 behaviour is by far the safest option at
> this point.

I agree. Also the shrinking done by 540d911c2813 is not fixing any
issue we know about, so it's safer to just keep the previous behavior
of likely reporting empty leaves rather than risk more regressions
caused by the change.

Thanks, Roger.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Andrew Cooper 2 years, 4 months ago
On 25/11/2021 09:12, Roger Pau Monné wrote:
> On Wed, Nov 24, 2021 at 09:11:52PM +0000, Andrew Cooper wrote:
>> OSSTest has identified a 3rd regression caused by this change.  Migration
>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>> fails with:
>>
>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>
>> which is a safety check to prevent resuming the guest when the CPUID data has
>> been truncated.  The problem is caused by shrinking of the max policies, which
>> is an ABI that needs handling compatibly between different versions of Xen.
>>
>> Furthermore, shrinking of the default policies also breaks things in some
>> cases, because certain cpuid= settings in a VM config file which used to have
>> an effect will now be silently discarded.
> I don't think they will be silently discarded. xc_cpuid_xend_policy
> will attempt to get the CPUID leaf from the current/host/default
> policies and fail because the leaf couldn't be found.

Hmm yes.  Now I look at the code again, the error handling is plumbed up
fully.

So yes - there will be a rejection now where previously it worked.

> There's an issue with callers of xc_cpuid_apply_policy that pass a
> featureset (which is not used by the upstream tools) as in that case
> the featureset is translated to a CPUID policy without checking that
> the set features are correctly exposed regarding the maximum leaves
> set in CPUID, and thus features could be silently dropped, but as said
> that's not used by the `cpuid=` config file option.

Good catch.  I'd neglected to consider this aspect.

> This possibly needs fixing anyway after the release, in order to
> assert that the bits set in the featureset are correctly exposed given
> the max leaves reported in the policy.
>
> I think the above paragraph should be rewored as:
>
> "Furthermore, shrinking of the default policies also breaks things in
> some cases, because certain cpuid= settings in a VM config file which
> used to work will now be refused. Also external toolstacks that
> attempt to set the CPUID policy from a featureset might now see some
> filled leaves not reachable due to the shrinking done to the default
> domain policy before applying the featureset."
>
>> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
>> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
>> one new case where cpuid= settings might not apply correctly) and restores the
>> same behaviour as Xen 4.15.
>>
>> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
>> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>
> Likely with the paragraph adjusted if agreed.

Ok.

~Andrew

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 24.11.2021 22:11, Andrew Cooper wrote:
> OSSTest has identified a 3rd regression caused by this change.  Migration
> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> fails with:
> 
>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>   xc: error: Restore failed (22 = Invalid argument): Internal error
> 
> which is a safety check to prevent resuming the guest when the CPUID data has
> been truncated.  The problem is caused by shrinking of the max policies, which
> is an ABI that needs handling compatibly between different versions of Xen.
> 
> Furthermore, shrinking of the default policies also breaks things in some
> cases, because certain cpuid= settings in a VM config file which used to have
> an effect will now be silently discarded.
> 
> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> one new case where cpuid= settings might not apply correctly) and restores the
> same behaviour as Xen 4.15.
> 
> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

While not strictly needed with Roger having given his already,
Acked-by: Jan Beulich <jbeulich@suse.com>
to signal my (basic) agreement with the course of action taken.
Nevertheless I fear this is going to become yet one more case where
future action is promised, but things then die out.

Ian - I guess all this now needs is your R-a.

Jan


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Roger Pau Monné 2 years, 4 months ago
On Thu, Nov 25, 2021 at 11:25:36AM +0100, Jan Beulich wrote:
> On 24.11.2021 22:11, Andrew Cooper wrote:
> > OSSTest has identified a 3rd regression caused by this change.  Migration
> > between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> > fails with:
> > 
> >   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
> >   xc: error: Restore failed (22 = Invalid argument): Internal error
> > 
> > which is a safety check to prevent resuming the guest when the CPUID data has
> > been truncated.  The problem is caused by shrinking of the max policies, which
> > is an ABI that needs handling compatibly between different versions of Xen.
> > 
> > Furthermore, shrinking of the default policies also breaks things in some
> > cases, because certain cpuid= settings in a VM config file which used to have
> > an effect will now be silently discarded.
> > 
> > This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> > partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> > one new case where cpuid= settings might not apply correctly) and restores the
> > same behaviour as Xen 4.15.
> > 
> > Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
> > Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
> > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> While not strictly needed with Roger having given his already,
> Acked-by: Jan Beulich <jbeulich@suse.com>
> to signal my (basic) agreement with the course of action taken.
> Nevertheless I fear this is going to become yet one more case where
> future action is promised, but things then die out.

I'm certainly happy to look at newer versions of this patch, but I
think we should consider doing the shrinking only on the toolstack
said, and only after all the manipulations on the policy have been
performed.

Thanks, Roger.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Andrew Cooper 2 years, 4 months ago
On 25/11/2021 10:43, Roger Pau Monné wrote:
> On Thu, Nov 25, 2021 at 11:25:36AM +0100, Jan Beulich wrote:
>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>>> fails with:
>>>
>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>
>>> which is a safety check to prevent resuming the guest when the CPUID data has
>>> been truncated.  The problem is caused by shrinking of the max policies, which
>>> is an ABI that needs handling compatibly between different versions of Xen.
>>>
>>> Furthermore, shrinking of the default policies also breaks things in some
>>> cases, because certain cpuid= settings in a VM config file which used to have
>>> an effect will now be silently discarded.
>>>
>>> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
>>> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
>>> one new case where cpuid= settings might not apply correctly) and restores the
>>> same behaviour as Xen 4.15.
>>>
>>> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
>>> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> While not strictly needed with Roger having given his already,
>> Acked-by: Jan Beulich <jbeulich@suse.com>
>> to signal my (basic) agreement with the course of action taken.
>> Nevertheless I fear this is going to become yet one more case where
>> future action is promised, but things then die out.
> I'm certainly happy to look at newer versions of this patch, but I
> think we should consider doing the shrinking only on the toolstack
> said, and only after all the manipulations on the policy have been
> performed.

Correct.

The max policies cannot be shrunk - they are, by definition, the upper
bounds that we audit against.  (More precisely, they must never end up
lower than an older Xen used to offer on the same configuration, and
must not be lower anything the user may opt in to.)

The default policies can in principle be shrunk, but only if the
toolstack learns to grow max leaf too (which it will need to). 
Nevertheless, actually shrinking the default policies is actively
unhelpful, because it is wasting time doing something which the
toolstack needs to undo later.

The policy for new domains should be shrunk, but only after every other
adjustment is made.  This is one small aspect of teaching the toolstack
to properly understand CPUID (and MSR) policies, and has always been on
the plan.

~Andrew

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 25.11.2021 18:28, Andrew Cooper wrote:
> On 25/11/2021 10:43, Roger Pau Monné wrote:
>> On Thu, Nov 25, 2021 at 11:25:36AM +0100, Jan Beulich wrote:
>>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>>>> fails with:
>>>>
>>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>>
>>>> which is a safety check to prevent resuming the guest when the CPUID data has
>>>> been truncated.  The problem is caused by shrinking of the max policies, which
>>>> is an ABI that needs handling compatibly between different versions of Xen.
>>>>
>>>> Furthermore, shrinking of the default policies also breaks things in some
>>>> cases, because certain cpuid= settings in a VM config file which used to have
>>>> an effect will now be silently discarded.
>>>>
>>>> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
>>>> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
>>>> one new case where cpuid= settings might not apply correctly) and restores the
>>>> same behaviour as Xen 4.15.
>>>>
>>>> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
>>>> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> While not strictly needed with Roger having given his already,
>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>> to signal my (basic) agreement with the course of action taken.
>>> Nevertheless I fear this is going to become yet one more case where
>>> future action is promised, but things then die out.
>> I'm certainly happy to look at newer versions of this patch, but I
>> think we should consider doing the shrinking only on the toolstack
>> said, and only after all the manipulations on the policy have been
>> performed.
> 
> Correct.
> 
> The max policies cannot be shrunk - they are, by definition, the upper
> bounds that we audit against.  (More precisely, they must never end up
> lower than an older Xen used to offer on the same configuration, and
> must not be lower anything the user may opt in to.)

I disagree: For one, the user cannot opt in to anything beyond max policy.
Or else that policy isn't "max" anymore. The user may opt in to a higher
than useful max (sub)leaf, but that's independent. I'm also not convinced
older Xen mistakenly offering too much can be taken as an excuse that we
can't ever go below that. We've done so in the past iirc, with workarounds
added elsewhere. Older Xen offering too high a max (sub)leaf again is
independent. Max (sub)leaf requests from the user should, contrary to my
view when submitting the original change, be honored. This would then
automatically include migrating-in guests.

> The default policies can in principle be shrunk, but only if the
> toolstack learns to grow max leaf too (which it will need to). 
> Nevertheless, actually shrinking the default policies is actively
> unhelpful, because it is wasting time doing something which the
> toolstack needs to undo later.

I agree.

> The policy for new domains should be shrunk, but only after every other
> adjustment is made.  This is one small aspect of teaching the toolstack
> to properly understand CPUID (and MSR) policies, and has always been on
> the plan.

And this not being the case yet is getting too prominent for my taste
with the need to raise the max Intel leaves we know of for things like
AMX or KeyLocker. I didn't get shrinking done right; apologies for that.
But I continue to think that proper shrinking ought to be a prereq to
that, without delaying that work (effectively complete for AMX, partial
for KeyLocker) almost indefinitely.

Jan


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Roger Pau Monné 2 years, 4 months ago
On Fri, Nov 26, 2021 at 09:22:50AM +0100, Jan Beulich wrote:
> On 25.11.2021 18:28, Andrew Cooper wrote:
> > On 25/11/2021 10:43, Roger Pau Monné wrote:
> >> On Thu, Nov 25, 2021 at 11:25:36AM +0100, Jan Beulich wrote:
> >>> On 24.11.2021 22:11, Andrew Cooper wrote:
> >>>> OSSTest has identified a 3rd regression caused by this change.  Migration
> >>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
> >>>> fails with:
> >>>>
> >>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
> >>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
> >>>>
> >>>> which is a safety check to prevent resuming the guest when the CPUID data has
> >>>> been truncated.  The problem is caused by shrinking of the max policies, which
> >>>> is an ABI that needs handling compatibly between different versions of Xen.
> >>>>
> >>>> Furthermore, shrinking of the default policies also breaks things in some
> >>>> cases, because certain cpuid= settings in a VM config file which used to have
> >>>> an effect will now be silently discarded.
> >>>>
> >>>> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
> >>>> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
> >>>> one new case where cpuid= settings might not apply correctly) and restores the
> >>>> same behaviour as Xen 4.15.
> >>>>
> >>>> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
> >>>> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
> >>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >>> While not strictly needed with Roger having given his already,
> >>> Acked-by: Jan Beulich <jbeulich@suse.com>
> >>> to signal my (basic) agreement with the course of action taken.
> >>> Nevertheless I fear this is going to become yet one more case where
> >>> future action is promised, but things then die out.
> >> I'm certainly happy to look at newer versions of this patch, but I
> >> think we should consider doing the shrinking only on the toolstack
> >> said, and only after all the manipulations on the policy have been
> >> performed.
> > 
> > Correct.
> > 
> > The max policies cannot be shrunk - they are, by definition, the upper
> > bounds that we audit against.  (More precisely, they must never end up
> > lower than an older Xen used to offer on the same configuration, and
> > must not be lower anything the user may opt in to.)
> 
> I disagree: For one, the user cannot opt in to anything beyond max policy.
> Or else that policy isn't "max" anymore. The user may opt in to a higher
> than useful max (sub)leaf, but that's independent.

Wouldn't it be possible for Xen to offer certain features to guests
that are not part of the native CPUID policy, and that thus might
require setting bit(s) on leaves that could otherwise be empty? That
also requires some explicit checks in Xen in order to assert that
leaves above the max one are empty.

TBH I'm not sure what's the benefit of shrinking the max policies, as
those are not to be used as-is by guests. They are an upper bound, but
should be tailored for each guest usage, and should be shrunk before
being used by guests on a case-by-case basis (ie: by the toolstack).

> I'm also not convinced
> older Xen mistakenly offering too much can be taken as an excuse that we
> can't ever go below that. We've done so in the past iirc, with workarounds
> added elsewhere. Older Xen offering too high a max (sub)leaf again is
> independent. Max (sub)leaf requests from the user should, contrary to my
> view when submitting the original change, be honored. This would then
> automatically include migrating-in guests.
> 
> > The default policies can in principle be shrunk, but only if the
> > toolstack learns to grow max leaf too (which it will need to). 
> > Nevertheless, actually shrinking the default policies is actively
> > unhelpful, because it is wasting time doing something which the
> > toolstack needs to undo later.
> 
> I agree.
> 
> > The policy for new domains should be shrunk, but only after every other
> > adjustment is made.  This is one small aspect of teaching the toolstack
> > to properly understand CPUID (and MSR) policies, and has always been on
> > the plan.
> 
> And this not being the case yet is getting too prominent for my taste
> with the need to raise the max Intel leaves we know of for things like
> AMX or KeyLocker. I didn't get shrinking done right; apologies for that.
> But I continue to think that proper shrinking ought to be a prereq to
> that, without delaying that work (effectively complete for AMX, partial
> for KeyLocker) almost indefinitely.

I have to revisit the patches I have pending for CPUID/MSR interface
cleanup on my side, but I think that the shrinking could be easily
done at the tail of that that series.

Thanks, Roger.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Jan Beulich 2 years, 4 months ago
On 26.11.2021 09:37, Roger Pau Monné wrote:
> On Fri, Nov 26, 2021 at 09:22:50AM +0100, Jan Beulich wrote:
>> On 25.11.2021 18:28, Andrew Cooper wrote:
>>> On 25/11/2021 10:43, Roger Pau Monné wrote:
>>>> On Thu, Nov 25, 2021 at 11:25:36AM +0100, Jan Beulich wrote:
>>>>> On 24.11.2021 22:11, Andrew Cooper wrote:
>>>>>> OSSTest has identified a 3rd regression caused by this change.  Migration
>>>>>> between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
>>>>>> fails with:
>>>>>>
>>>>>>   xc: error: Failed to set CPUID policy: leaf 00000000, subleaf ffffffff, msr ffffffff (22 = Invalid argument): Internal error
>>>>>>   xc: error: Restore failed (22 = Invalid argument): Internal error
>>>>>>
>>>>>> which is a safety check to prevent resuming the guest when the CPUID data has
>>>>>> been truncated.  The problem is caused by shrinking of the max policies, which
>>>>>> is an ABI that needs handling compatibly between different versions of Xen.
>>>>>>
>>>>>> Furthermore, shrinking of the default policies also breaks things in some
>>>>>> cases, because certain cpuid= settings in a VM config file which used to have
>>>>>> an effect will now be silently discarded.
>>>>>>
>>>>>> This reverts commit 540d911c2813c3d8f4cdbb3f5672119e5e768a3d, as well as the
>>>>>> partial fix attempt in 81da2b544cbb003a5447c9b14d275746ad22ab37 (which added
>>>>>> one new case where cpuid= settings might not apply correctly) and restores the
>>>>>> same behaviour as Xen 4.15.
>>>>>>
>>>>>> Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents")
>>>>>> Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves")
>>>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> While not strictly needed with Roger having given his already,
>>>>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>>>> to signal my (basic) agreement with the course of action taken.
>>>>> Nevertheless I fear this is going to become yet one more case where
>>>>> future action is promised, but things then die out.
>>>> I'm certainly happy to look at newer versions of this patch, but I
>>>> think we should consider doing the shrinking only on the toolstack
>>>> said, and only after all the manipulations on the policy have been
>>>> performed.
>>>
>>> Correct.
>>>
>>> The max policies cannot be shrunk - they are, by definition, the upper
>>> bounds that we audit against.  (More precisely, they must never end up
>>> lower than an older Xen used to offer on the same configuration, and
>>> must not be lower anything the user may opt in to.)
>>
>> I disagree: For one, the user cannot opt in to anything beyond max policy.
>> Or else that policy isn't "max" anymore. The user may opt in to a higher
>> than useful max (sub)leaf, but that's independent.
> 
> Wouldn't it be possible for Xen to offer certain features to guests
> that are not part of the native CPUID policy, and that thus might
> require setting bit(s) on leaves that could otherwise be empty? That
> also requires some explicit checks in Xen in order to assert that
> leaves above the max one are empty.

Extra available features imo still need representing in the max policies.

> TBH I'm not sure what's the benefit of shrinking the max policies, as
> those are not to be used as-is by guests. They are an upper bound, but
> should be tailored for each guest usage, and should be shrunk before
> being used by guests on a case-by-case basis (ie: by the toolstack).

Well, if on an AMX-capable system the admin passed "cpuid=no-amx-tile"
(or whatever better equivalent to mean "AMX as a whole"), I wouldn't
see why logic everywhere would need working with 30 leaves when 13
would do. But yes, beyond saving some effort the effect of shrinking
max leaves should be non-noticable in the guest visible result.

Anyway, the main reason I will want to continue carrying the patch in
some form is for the actual function, as that gets modified by later
patches. Where ultimately it gets called from is secondary from that
perspective.

Jan


Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Ian Jackson 2 years, 4 months ago
Jan Beulich writes ("Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents""):
> While not strictly needed with Roger having given his already,
> Acked-by: Jan Beulich <jbeulich@suse.com>
> to signal my (basic) agreement with the course of action taken.
> Nevertheless I fear this is going to become yet one more case where
> future action is promised, but things then die out.
> 
> Ian - I guess all this now needs is your R-a.

Thanks everyone.  Yes.

For the record,

Release-Acked-by: Ian Jackson <iwj@xenproject.org>

I will commit it myself now.

Ian.

Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
Posted by Ian Jackson 2 years, 4 months ago
Ian Jackson writes ("Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents""):
> Jan Beulich writes ("Re: [PATCH for-4.16] Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents""):
> > While not strictly needed with Roger having given his already,
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> > to signal my (basic) agreement with the course of action taken.
> > Nevertheless I fear this is going to become yet one more case where
> > future action is promised, but things then die out.
> > 
> > Ian - I guess all this now needs is your R-a.
> 
> Thanks everyone.  Yes.
> 
> For the record,
> 
> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
> 
> I will commit it myself now.

Reviewing the thread, there was some discussion over the commit
message.  It wasn't clear to me whether there was still a change
intended.

I decided not to wait for 4.16, so I have committed it there as-is.
The commit message is less important there as it won't end up in most
archeaology.

When the discussion about the commit message has converged, please
would someone commit the patch to staging too.

Thanks,
Ian.