x86/PV32: restore PAE-extended-CR3 logic

[PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year, 2 months ago

While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
applies to guests also when run on a 64-bit hypervisor: The "extended
CR3" format has to be used there as well, to fit the address in the only
32-bit wide register there. As a result it was a mistake that the check
was never enabled for that case, and was then mistakenly deleted in the
course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
CONFIG_PAGING_LEVELS==4"]).

Similarly during Dom0 construction kernel awareness needs to be taken
into account, and respective code was again mistakenly never enabled for
32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
5d1181a5ea5e ["xen: Remove x86_32 build target"]).

At the same time restrict enabling of the assist for Dom0 to just the
32-bit case. Furthermore there's no need for an atomic update there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was uncertain whether to add a check to the CR3 guest read path,
raising e.g. #GP(0) when the value read wouldn't fit but also may not
be converted to "extended" format (overflow is possible there in
principle because of the control tools "slack" in promote_l3_table()).

In that context I was puzzled to find no check on the CR3 guest write
path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
of the low reserved ones) could observe anomalous behavior rather than
plain failure.

As to a Fixes: tag - it's pretty unclear which of the many original
32-on-64 changes to blame. I don't think the two cited commits should
be referenced there, as they didn't break anything that wasn't already
broken.

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
     unsigned int   partial_flags = page->partial_flags;
     l3_pgentry_t   l3e = l3e_empty();
 
+    /*
+     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
+     * understand the weird 'extended cr3' format for dealing with high-order
+     * address bits. We cut some slack for control tools (before vcpu0 is
+     * initialised).
+     */
+    if ( is_pv_32bit_domain(d) &&
+         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
+         mfn_x(l3mfn) >= 0x100000 &&
+         d->vcpu[0] && d->vcpu[0]->is_initialised )
+    {
+        gdprintk(XENLOG_WARNING,
+                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
+                 mfn_x(l3mfn));
+        return -ERANGE;
+    }
+
     pl3e = map_domain_page(l3mfn);
 
     /*
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -490,12 +490,12 @@ int __init dom0_construct_pv(struct doma
 
     nr_pages = dom0_compute_nr_pages(d, &parms, initrd_len);
 
-    if ( parms.pae == XEN_PAE_EXTCR3 )
-            set_bit(VMASST_TYPE_pae_extended_cr3, &d->vm_assist);
-
 #ifdef CONFIG_PV32
     if ( elf_32bit(&elf) )
     {
+        if ( parms.pae == XEN_PAE_EXTCR3 )
+            __set_bit(VMASST_TYPE_pae_extended_cr3, &d->vm_assist);
+
         if ( !pv_shim && (parms.virt_hv_start_low != UNSET_ADDR) )
         {
             unsigned long value = ROUNDUP(parms.virt_hv_start_low,
@@ -594,7 +594,10 @@ int __init dom0_construct_pv(struct doma
         vphysmap_start = parms.p2m_base;
         vphysmap_end   = vphysmap_start + nr_pages * sizeof(unsigned long);
     }
-    page = alloc_domheap_pages(d, order, MEMF_no_scrub);
+    page = alloc_domheap_pages(d, order,
+                               MEMF_no_scrub |
+                               (VM_ASSIST(d, pae_extended_cr3) ||
+                                !compat ? 0 : MEMF_bits(32)));
     if ( page == NULL )
         panic("Not enough RAM for domain 0 allocation\n");
     alloc_spfn = mfn_x(page_to_mfn(page));

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 2 months, 2 weeks ago

On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> applies to guests also when run on a 64-bit hypervisor: The "extended
> CR3" format has to be used there as well, to fit the address in the only
> 32-bit wide register there. As a result it was a mistake that the check
> was never enabled for that case, and was then mistakenly deleted in the
> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> CONFIG_PAGING_LEVELS==4"]).
> 
> Similarly during Dom0 construction kernel awareness needs to be taken
> into account, and respective code was again mistakenly never enabled for
> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> 
> At the same time restrict enabling of the assist for Dom0 to just the
> 32-bit case. Furthermore there's no need for an atomic update there.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Andrew Cooper 1 year ago

On 15/02/2023 2:54 pm, Jan Beulich wrote:
> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> applies to guests also when run on a 64-bit hypervisor:

Is this really true?  Even when looking at Xen 4.2, 32bit guests are
required to pass a full 4k page, not a 32b quad.

Which makes complete sense.  It was a hard requirement of 32bit non-PAE
guests, so it was a natural restriction to maintain into 32bit PAE guests.

This is *only* a 32-on-64 issue, because this is the only case a 32bit
guest could in principle use an L3 placed above the 4G boundary.

>  The "extended
> CR3" format has to be used there as well, to fit the address in the only
> 32-bit wide register there. As a result it was a mistake that the check
> was never enabled for that case, and was then mistakenly deleted in the
> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> CONFIG_PAGING_LEVELS==4"]).
>
> Similarly during Dom0 construction kernel awareness needs to be taken
> into account, and respective code was again mistakenly never enabled for
> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
>
> At the same time restrict enabling of the assist for Dom0 to just the
> 32-bit case. Furthermore there's no need for an atomic update there.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was uncertain whether to add a check to the CR3 guest read path,
> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> be converted to "extended" format (overflow is possible there in
> principle because of the control tools "slack" in promote_l3_table()).
>
> In that context I was puzzled to find no check on the CR3 guest write
> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> of the low reserved ones) could observe anomalous behavior rather than
> plain failure.
>
> As to a Fixes: tag - it's pretty unclear which of the many original
> 32-on-64 changes to blame. I don't think the two cited commits should
> be referenced there, as they didn't break anything that wasn't already
> broken.
>
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>      unsigned int   partial_flags = page->partial_flags;
>      l3_pgentry_t   l3e = l3e_empty();
>  
> +    /*
> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> +     * understand the weird 'extended cr3' format for dealing with high-order
> +     * address bits. We cut some slack for control tools (before vcpu0 is
> +     * initialised).
> +     */
> +    if ( is_pv_32bit_domain(d) &&
> +         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
> +         mfn_x(l3mfn) >= 0x100000 &&
> +         d->vcpu[0] && d->vcpu[0]->is_initialised )
> +    {
> +        gdprintk(XENLOG_WARNING,
> +                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
> +                 mfn_x(l3mfn));
> +        return -ERANGE;
> +    }

Having dug through source history, I see this is largely the form that
it used to be.

But I'm unconvinced by the "cut control tools some slack".  I'm quite
tired of different bits of Xen taking on unnecessary complexity because
people are unwilling to fix the problem at the correct layer.

A toolstack which has non-pae_extended_cr3 guest on its hand will know
this before any pagetables get allocated.

That said...

I don't recall encountering this in migration v2, and looking at the
logic now, I'm pretty sure it will malfunction with a
non-pae_extended_cr3 guest.  When interpreting the guest cr3 value, we
blindly make the transform on the save and restore side, and I can't
spot anything limiting the L3 tables to below the 4G boundary.

So I'm reasonably sure I accidentally broke such guests in Xen 4.6(?)
and the absence of complaints in the intervening 8(?) years shows how
many are in use in practice.

For this check specifically, I'd suggest prohibiting non-32p guests from
setting pae_extended_cr3 in the first place (I see no limit currently),
and then simplifying the check to just

if ( unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
     mfn_x(l3mfn) >= PFN_DOWN(GB(4)) )

And I suppose I need to make a non-pae_extended_cr3 XTF test which is
migrate-capable...

~Andrew

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 04.04.2023 15:08, Andrew Cooper wrote:
> On 15/02/2023 2:54 pm, Jan Beulich wrote:
>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>> applies to guests also when run on a 64-bit hypervisor:
> 
> Is this really true?  Even when looking at Xen 4.2, 32bit guests are
> required to pass a full 4k page, not a 32b quad.

The full-page vs 32b-quad aspect is orthogonal. This VM-assist is solely
about where that data structure is, not what size it is.

> Which makes complete sense.  It was a hard requirement of 32bit non-PAE
> guests, so it was a natural restriction to maintain into 32bit PAE guests.
> 
> This is *only* a 32-on-64 issue, because this is the only case a 32bit
> guest could in principle use an L3 placed above the 4G boundary.

Not exactly. 32-bit Xen maintained a 4-entry "shadow" array below 4G
that it would copy (massage) the guest entries into upon CR3 reload
(just look for struct pae_l3_cache in the old sources). So above-4G
page table base was possible there as well.

>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>      unsigned int   partial_flags = page->partial_flags;
>>      l3_pgentry_t   l3e = l3e_empty();
>>  
>> +    /*
>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>> +     * understand the weird 'extended cr3' format for dealing with high-order
>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>> +     * initialised).
>> +     */
>> +    if ( is_pv_32bit_domain(d) &&
>> +         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>> +         mfn_x(l3mfn) >= 0x100000 &&
>> +         d->vcpu[0] && d->vcpu[0]->is_initialised )
>> +    {
>> +        gdprintk(XENLOG_WARNING,
>> +                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
>> +                 mfn_x(l3mfn));
>> +        return -ERANGE;
>> +    }
> 
> Having dug through source history, I see this is largely the form that
> it used to be.
> 
> But I'm unconvinced by the "cut control tools some slack".  I'm quite
> tired of different bits of Xen taking on unnecessary complexity because
> people are unwilling to fix the problem at the correct layer.

But anything tools do before having created the first vCPU would not
have had any means to engage the VM-assist. I.e. ...

> A toolstack which has non-pae_extended_cr3 guest on its hand will know
> this before any pagetables get allocated.

... this knowledge buys it nothing: It would need to move the table
to below 4G irrespective of knowing that the guest can deal with
bigger addresses, just to get past this check.

> For this check specifically, I'd suggest prohibiting non-32p guests from
> setting pae_extended_cr3 in the first place (I see no limit currently),
> and then simplifying the check to just
> 
> if ( unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>      mfn_x(l3mfn) >= PFN_DOWN(GB(4)) )

Dropping the is_pv_32bit_domain() check isn't possible because we can't,
all of the sudden, fail 64-bit guests' requests to enable this VM-
assist (no matter that we know that it is of no use to them). Dropping
the control-tools part of the condition is at least problematic as well,
as per above. Albeit I'll admit I didn't check whether nowadays vCPU 0
is initialized before page tables are built. But I think it's more
sensible the other way around: CR3 setting (in the hypervisor) is less
involved when the page was already validated as an L3 one.

Jan

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Andrew Cooper 1 year ago

On 04/04/2023 3:21 pm, Jan Beulich wrote:
> On 04.04.2023 15:08, Andrew Cooper wrote:
>> On 15/02/2023 2:54 pm, Jan Beulich wrote:
>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>>> applies to guests also when run on a 64-bit hypervisor:
>> Is this really true?  Even when looking at Xen 4.2, 32bit guests are
>> required to pass a full 4k page, not a 32b quad.
> The full-page vs 32b-quad aspect is orthogonal. This VM-assist is solely
> about where that data structure is, not what size it is.
>
>> Which makes complete sense.  It was a hard requirement of 32bit non-PAE
>> guests, so it was a natural restriction to maintain into 32bit PAE guests.
>>
>> This is *only* a 32-on-64 issue, because this is the only case a 32bit
>> guest could in principle use an L3 placed above the 4G boundary.
> Not exactly. 32-bit Xen maintained a 4-entry "shadow" array below 4G
> that it would copy (massage) the guest entries into upon CR3 reload
> (just look for struct pae_l3_cache in the old sources). So above-4G
> page table base was possible there as well.

Oh eww, so while Xen never gained an optimisation to permit only a 32b
quad in place of a full 4k L3 table, it did support having the full
tables higher.

(This code is especially hard to follow with #ifdefary in the common
mm.c when there are perfectly good x86_{32,64}/mm.c's to use for
differing function implementations...)

>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>>      unsigned int   partial_flags = page->partial_flags;
>>>      l3_pgentry_t   l3e = l3e_empty();
>>>  
>>> +    /*
>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>>> +     * understand the weird 'extended cr3' format for dealing with high-order
>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>>> +     * initialised).
>>> +     */
>>> +    if ( is_pv_32bit_domain(d) &&
>>> +         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>> +         mfn_x(l3mfn) >= 0x100000 &&
>>> +         d->vcpu[0] && d->vcpu[0]->is_initialised )
>>> +    {
>>> +        gdprintk(XENLOG_WARNING,
>>> +                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
>>> +                 mfn_x(l3mfn));
>>> +        return -ERANGE;
>>> +    }
>> Having dug through source history, I see this is largely the form that
>> it used to be.
>>
>> But I'm unconvinced by the "cut control tools some slack".  I'm quite
>> tired of different bits of Xen taking on unnecessary complexity because
>> people are unwilling to fix the problem at the correct layer.
> But anything tools do before having created the first vCPU would not
> have had any means to engage the VM-assist. I.e. ...
>
>> A toolstack which has non-pae_extended_cr3 guest on its hand will know
>> this before any pagetables get allocated.
> ... this knowledge buys it nothing: It would need to move the table
> to below 4G irrespective of knowing that the guest can deal with
> bigger addresses, just to get past this check.

This just goes from bad to worse.  It is mad that the VMASSIST flags
can't be set ahead of a vcpu initialise hypercall.

But.

The code in xg_dom_x86.c unconditionally moves the L3 below the 4G
boundary, so the thing actually pinned as an L3 will always pass the check.

Which is just as well because it too blindly applies the extended-cr3
transform momentarily after conditionally setting
VMASST_TYPE_pae_extended_cr3...

So a 32bit PV guests will pass the check irrespective of their
pae_extended_cr3 setting.

I note looking at this code that it's absurd.  The single L3 ought to be
allocated with memf(32), rather than being allocated regularly and then
reallocated lower if it happened to be too high (which will be the
increasingly common case as it's getting harder and harder to find
systems with <4G of RAM).  Making the memf conditional on the
pae_extended_cr3 needs to come with some better xen<->tools APIs.

>> For this check specifically, I'd suggest prohibiting non-32p guests from
>> setting pae_extended_cr3 in the first place (I see no limit currently),
>> and then simplifying the check to just
>>
>> if ( unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>      mfn_x(l3mfn) >= PFN_DOWN(GB(4)) )
> Dropping the is_pv_32bit_domain() check isn't possible because we can't,
> all of the sudden, fail 64-bit guests' requests to enable this VM-
> assist (no matter that we know that it is of no use to them).

I'm not so sure about this.  This VMASSIST cannot credibly be set at
runtime, and making a restriction here is not usefully different from
prior patches of yours that relax checks in Xen that still break on
older builds.

But as I know you're going to argue with that position, I'll at least
note that ignoring a 64bit guest's request to set that bit would be less
bad than the current behaviour.

> Dropping
> the control-tools part of the condition is at least problematic as well,
> as per above. Albeit I'll admit I didn't check whether nowadays vCPU 0
> is initialized before page tables are built. But I think it's more
> sensible the other way around: CR3 setting (in the hypervisor) is less
> involved when the page was already validated as an L3 one.

All of this is before the guest starts running, so it doesn't matter.

The most efficient way (from Xen's point of view) is to pin the L1s,
then L2s, then L3s and then set vCR3, because this is the only order
where we don't have to do do recursive type acquisition.

But, the most efficient way for the toolstack to do this is the opposite
way around, because making Xen do recursive type acquisition is faster
than other ways, and turns all subsequent hypercalls into almost no-ops.

I doubt there is a relevant difference between these two approaches.

And it doesn't matter either.  The check won't ever trip from domain
creation (see above), nor from migration (we set vcpu context before
pinning the pagetables, and a non pae_extended_cr3 will have exploded on
the source side).

So there really are no toolstack codepaths that can trip the check. 
Future improvements that might trip the check can come with a less
broken hypercall as a prerequisite.

~Andrew

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 04.04.2023 22:40, Andrew Cooper wrote:
> On 04/04/2023 3:21 pm, Jan Beulich wrote:
>> On 04.04.2023 15:08, Andrew Cooper wrote:
>>> On 15/02/2023 2:54 pm, Jan Beulich wrote:
>>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>>>> applies to guests also when run on a 64-bit hypervisor:
>>> Is this really true?  Even when looking at Xen 4.2, 32bit guests are
>>> required to pass a full 4k page, not a 32b quad.
>> The full-page vs 32b-quad aspect is orthogonal. This VM-assist is solely
>> about where that data structure is, not what size it is.
>>
>>> Which makes complete sense.  It was a hard requirement of 32bit non-PAE
>>> guests, so it was a natural restriction to maintain into 32bit PAE guests.
>>>
>>> This is *only* a 32-on-64 issue, because this is the only case a 32bit
>>> guest could in principle use an L3 placed above the 4G boundary.
>> Not exactly. 32-bit Xen maintained a 4-entry "shadow" array below 4G
>> that it would copy (massage) the guest entries into upon CR3 reload
>> (just look for struct pae_l3_cache in the old sources). So above-4G
>> page table base was possible there as well.
> 
> Oh eww, so while Xen never gained an optimisation to permit only a 32b
> quad in place of a full 4k L3 table, it did support having the full
> tables higher.
> 
> (This code is especially hard to follow with #ifdefary in the common
> mm.c when there are perfectly good x86_{32,64}/mm.c's to use for
> differing function implementations...)

Except that the #ifdef-ary was wrong, and should have been suitable if()
instead. Those having been #ifdef was why the respective code got
removed in the course of purging 32-bit Xen logic.

>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>>>      unsigned int   partial_flags = page->partial_flags;
>>>>      l3_pgentry_t   l3e = l3e_empty();
>>>>  
>>>> +    /*
>>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>>>> +     * understand the weird 'extended cr3' format for dealing with high-order
>>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>>>> +     * initialised).
>>>> +     */
>>>> +    if ( is_pv_32bit_domain(d) &&
>>>> +         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>>> +         mfn_x(l3mfn) >= 0x100000 &&
>>>> +         d->vcpu[0] && d->vcpu[0]->is_initialised )
>>>> +    {
>>>> +        gdprintk(XENLOG_WARNING,
>>>> +                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
>>>> +                 mfn_x(l3mfn));
>>>> +        return -ERANGE;
>>>> +    }
>>> Having dug through source history, I see this is largely the form that
>>> it used to be.
>>>
>>> But I'm unconvinced by the "cut control tools some slack".  I'm quite
>>> tired of different bits of Xen taking on unnecessary complexity because
>>> people are unwilling to fix the problem at the correct layer.
>> But anything tools do before having created the first vCPU would not
>> have had any means to engage the VM-assist. I.e. ...
>>
>>> A toolstack which has non-pae_extended_cr3 guest on its hand will know
>>> this before any pagetables get allocated.
>> ... this knowledge buys it nothing: It would need to move the table
>> to below 4G irrespective of knowing that the guest can deal with
>> bigger addresses, just to get past this check.
> 
> This just goes from bad to worse.  It is mad that the VMASSIST flags
> can't be set ahead of a vcpu initialise hypercall.
> 
> But.
> 
> The code in xg_dom_x86.c unconditionally moves the L3 below the 4G
> boundary, so the thing actually pinned as an L3 will always pass the check.

Where do you see this being dome unconditionally? I only see this inside
a check of dom->parms->pae being XEN_PAE_YES.

> Which is just as well because it too blindly applies the extended-cr3
> transform momentarily after conditionally setting
> VMASST_TYPE_pae_extended_cr3...

Doing this "blindly" is (kind of) fine, isn't it? The transformation is
an identity one when extended-CR3 isn't in use.

> So a 32bit PV guests will pass the check irrespective of their
> pae_extended_cr3 setting.

As per above - I question this: dom->parms->pae isn't a simple boolean
(see enum xen_pae_type in libelf.h, and as can also be seen from the
condition around the enabling of the VM assist).

>>> For this check specifically, I'd suggest prohibiting non-32p guests from
>>> setting pae_extended_cr3 in the first place (I see no limit currently),
>>> and then simplifying the check to just
>>>
>>> if ( unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>>      mfn_x(l3mfn) >= PFN_DOWN(GB(4)) )
>> Dropping the is_pv_32bit_domain() check isn't possible because we can't,
>> all of the sudden, fail 64-bit guests' requests to enable this VM-
>> assist (no matter that we know that it is of no use to them).
> 
> I'm not so sure about this.  This VMASSIST cannot credibly be set at
> runtime,

Why not? A kernel may statically say XEN_PAE_YES, resulting in its
initial L3 to be relocated to below 4G, and then - before any further
page table creation - enable the assist.

> and making a restriction here is not usefully different from
> prior patches of yours that relax checks in Xen that still break on
> older builds.

How's this not "usefully different"? I replace potentially silent
misbehavior by the failing of a hypercall (which ought to be noticed)
plus a (debug build only) log message.

> But as I know you're going to argue with that position, I'll at least
> note that ignoring a 64bit guest's request to set that bit would be less
> bad than the current behaviour.

We already ignore this bit (as in: it has no effect). Yet just like we
can't fail the request all of the sudden, we also can't zap the bit
from the supplied mask, as kernels may legitimately check that what
they read is what they set. (That said - I'm unaware of such checking
anywhere.)

Jan

>> Dropping
>> the control-tools part of the condition is at least problematic as well,
>> as per above. Albeit I'll admit I didn't check whether nowadays vCPU 0
>> is initialized before page tables are built. But I think it's more
>> sensible the other way around: CR3 setting (in the hypervisor) is less
>> involved when the page was already validated as an L3 one.
> 
> All of this is before the guest starts running, so it doesn't matter.
> 
> The most efficient way (from Xen's point of view) is to pin the L1s,
> then L2s, then L3s and then set vCR3, because this is the only order
> where we don't have to do do recursive type acquisition.
> 
> But, the most efficient way for the toolstack to do this is the opposite
> way around, because making Xen do recursive type acquisition is faster
> than other ways, and turns all subsequent hypercalls into almost no-ops.
> 
> I doubt there is a relevant difference between these two approaches.
> 
> 
> And it doesn't matter either.  The check won't ever trip from domain
> creation (see above), nor from migration (we set vcpu context before
> pinning the pagetables, and a non pae_extended_cr3 will have exploded on
> the source side).
> 
> So there really are no toolstack codepaths that can trip the check. 
> Future improvements that might trip the check can come with a less
> broken hypercall as a prerequisite.
> 
> ~Andrew

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> applies to guests also when run on a 64-bit hypervisor: The "extended
> CR3" format has to be used there as well, to fit the address in the only
> 32-bit wide register there. As a result it was a mistake that the check
> was never enabled for that case, and was then mistakenly deleted in the
> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> CONFIG_PAGING_LEVELS==4"]).
> 
> Similarly during Dom0 construction kernel awareness needs to be taken
> into account, and respective code was again mistakenly never enabled for
> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> 
> At the same time restrict enabling of the assist for Dom0 to just the
> 32-bit case. Furthermore there's no need for an atomic update there.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was uncertain whether to add a check to the CR3 guest read path,
> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> be converted to "extended" format (overflow is possible there in
> principle because of the control tools "slack" in promote_l3_table()).
> 
> In that context I was puzzled to find no check on the CR3 guest write
> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> of the low reserved ones) could observe anomalous behavior rather than
> plain failure.
> 
> As to a Fixes: tag - it's pretty unclear which of the many original
> 32-on-64 changes to blame. I don't think the two cited commits should
> be referenced there, as they didn't break anything that wasn't already
> broken.
> 
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>      unsigned int   partial_flags = page->partial_flags;
>      l3_pgentry_t   l3e = l3e_empty();
>  
> +    /*
> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> +     * understand the weird 'extended cr3' format for dealing with high-order
> +     * address bits. We cut some slack for control tools (before vcpu0 is
> +     * initialised).

Don't we then need some check in the vCPU init path to assure that the
cr3 is < 32bits if we allow those to initially be set?

Or will the initialization unconditionally overwrite any previous cr3
value?

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 04.04.2023 12:12, Roger Pau Monné wrote:
> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>> applies to guests also when run on a 64-bit hypervisor: The "extended
>> CR3" format has to be used there as well, to fit the address in the only
>> 32-bit wide register there. As a result it was a mistake that the check
>> was never enabled for that case, and was then mistakenly deleted in the
>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
>> CONFIG_PAGING_LEVELS==4"]).
>>
>> Similarly during Dom0 construction kernel awareness needs to be taken
>> into account, and respective code was again mistakenly never enabled for
>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
>>
>> At the same time restrict enabling of the assist for Dom0 to just the
>> 32-bit case. Furthermore there's no need for an atomic update there.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was uncertain whether to add a check to the CR3 guest read path,
>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
>> be converted to "extended" format (overflow is possible there in
>> principle because of the control tools "slack" in promote_l3_table()).
>>
>> In that context I was puzzled to find no check on the CR3 guest write
>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
>> of the low reserved ones) could observe anomalous behavior rather than
>> plain failure.
>>
>> As to a Fixes: tag - it's pretty unclear which of the many original
>> 32-on-64 changes to blame. I don't think the two cited commits should
>> be referenced there, as they didn't break anything that wasn't already
>> broken.
>>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>      unsigned int   partial_flags = page->partial_flags;
>>      l3_pgentry_t   l3e = l3e_empty();
>>  
>> +    /*
>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>> +     * understand the weird 'extended cr3' format for dealing with high-order
>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>> +     * initialised).
> 
> Don't we then need some check in the vCPU init path to assure that the
> cr3 is < 32bits if we allow those to initially be set?
> 
> Or will the initialization unconditionally overwrite any previous cr3
> value?

That's not the way I understand this "cut some slack". Instead I read it
to be meant to cover for the VM-assist bit not being set, yet. Beyond
that it is assumed to be tool stack's responsibility to constrain
addresses suitably. If it doesn't, it'll simply break the guest. (There
is some guessing on my part involved here, as the original introduction
of that code didn't further explain things.)

Nevertheless going beyond what was there originally might be desirable.
Yet it's not really clear to me when / how to carry out such further
checking. For example I don't fancy walking all of the domain's pages
when it's about to be unpaused for the first time.

Jan

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
> On 04.04.2023 12:12, Roger Pau Monné wrote:
> > On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> >> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> >> applies to guests also when run on a 64-bit hypervisor: The "extended
> >> CR3" format has to be used there as well, to fit the address in the only
> >> 32-bit wide register there. As a result it was a mistake that the check
> >> was never enabled for that case, and was then mistakenly deleted in the
> >> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> >> CONFIG_PAGING_LEVELS==4"]).
> >>
> >> Similarly during Dom0 construction kernel awareness needs to be taken
> >> into account, and respective code was again mistakenly never enabled for
> >> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> >> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> >>
> >> At the same time restrict enabling of the assist for Dom0 to just the
> >> 32-bit case. Furthermore there's no need for an atomic update there.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I was uncertain whether to add a check to the CR3 guest read path,
> >> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> >> be converted to "extended" format (overflow is possible there in
> >> principle because of the control tools "slack" in promote_l3_table()).
> >>
> >> In that context I was puzzled to find no check on the CR3 guest write
> >> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> >> of the low reserved ones) could observe anomalous behavior rather than
> >> plain failure.
> >>
> >> As to a Fixes: tag - it's pretty unclear which of the many original
> >> 32-on-64 changes to blame. I don't think the two cited commits should
> >> be referenced there, as they didn't break anything that wasn't already
> >> broken.
> >>
> >> --- a/xen/arch/x86/mm.c
> >> +++ b/xen/arch/x86/mm.c
> >> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
> >>      unsigned int   partial_flags = page->partial_flags;
> >>      l3_pgentry_t   l3e = l3e_empty();
> >>  
> >> +    /*
> >> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> >> +     * understand the weird 'extended cr3' format for dealing with high-order
> >> +     * address bits. We cut some slack for control tools (before vcpu0 is
> >> +     * initialised).
> > 
> > Don't we then need some check in the vCPU init path to assure that the
> > cr3 is < 32bits if we allow those to initially be set?
> > 
> > Or will the initialization unconditionally overwrite any previous cr3
> > value?
> 
> That's not the way I understand this "cut some slack". Instead I read it
> to be meant to cover for the VM-assist bit not being set, yet. Beyond
> that it is assumed to be tool stack's responsibility to constrain
> addresses suitably. If it doesn't, it'll simply break the guest. (There
> is some guessing on my part involved here, as the original introduction
> of that code didn't further explain things.)

If it's just the guest that's broken I would think it's fine.  As long
as such mismatch doesn't cause issues in the hypervisor internal state.

Did you see a toolstack setting such entries before pae_extended_cr3
is set?

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 04.04.2023 13:41, Roger Pau Monné wrote:
> On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
>> On 04.04.2023 12:12, Roger Pau Monné wrote:
>>> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
>>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>>>> applies to guests also when run on a 64-bit hypervisor: The "extended
>>>> CR3" format has to be used there as well, to fit the address in the only
>>>> 32-bit wide register there. As a result it was a mistake that the check
>>>> was never enabled for that case, and was then mistakenly deleted in the
>>>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
>>>> CONFIG_PAGING_LEVELS==4"]).
>>>>
>>>> Similarly during Dom0 construction kernel awareness needs to be taken
>>>> into account, and respective code was again mistakenly never enabled for
>>>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
>>>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
>>>>
>>>> At the same time restrict enabling of the assist for Dom0 to just the
>>>> 32-bit case. Furthermore there's no need for an atomic update there.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> I was uncertain whether to add a check to the CR3 guest read path,
>>>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
>>>> be converted to "extended" format (overflow is possible there in
>>>> principle because of the control tools "slack" in promote_l3_table()).
>>>>
>>>> In that context I was puzzled to find no check on the CR3 guest write
>>>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
>>>> of the low reserved ones) could observe anomalous behavior rather than
>>>> plain failure.
>>>>
>>>> As to a Fixes: tag - it's pretty unclear which of the many original
>>>> 32-on-64 changes to blame. I don't think the two cited commits should
>>>> be referenced there, as they didn't break anything that wasn't already
>>>> broken.
>>>>
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>>>      unsigned int   partial_flags = page->partial_flags;
>>>>      l3_pgentry_t   l3e = l3e_empty();
>>>>  
>>>> +    /*
>>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>>>> +     * understand the weird 'extended cr3' format for dealing with high-order
>>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>>>> +     * initialised).
>>>
>>> Don't we then need some check in the vCPU init path to assure that the
>>> cr3 is < 32bits if we allow those to initially be set?
>>>
>>> Or will the initialization unconditionally overwrite any previous cr3
>>> value?
>>
>> That's not the way I understand this "cut some slack". Instead I read it
>> to be meant to cover for the VM-assist bit not being set, yet. Beyond
>> that it is assumed to be tool stack's responsibility to constrain
>> addresses suitably. If it doesn't, it'll simply break the guest. (There
>> is some guessing on my part involved here, as the original introduction
>> of that code didn't further explain things.)
> 
> If it's just the guest that's broken I would think it's fine.  As long
> as such mismatch doesn't cause issues in the hypervisor internal state.
> 
> Did you see a toolstack setting such entries before pae_extended_cr3
> is set?

To be honest - I didn't look. As said in the longer reply to Andrew, I
think it is more logical this way (the page table root already being
validated as an L3 table when vCPU 0 is inititalized, which includes
setting its CR3). Hence even if right now the order was the other way
around (which I doubt it is), I wouldn't want to make impossible to
restore the original ordering again.

Jan

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Tue, Apr 04, 2023 at 04:24:16PM +0200, Jan Beulich wrote:
> On 04.04.2023 13:41, Roger Pau Monné wrote:
> > On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
> >> On 04.04.2023 12:12, Roger Pau Monné wrote:
> >>> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> >>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> >>>> applies to guests also when run on a 64-bit hypervisor: The "extended
> >>>> CR3" format has to be used there as well, to fit the address in the only
> >>>> 32-bit wide register there. As a result it was a mistake that the check
> >>>> was never enabled for that case, and was then mistakenly deleted in the
> >>>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> >>>> CONFIG_PAGING_LEVELS==4"]).
> >>>>
> >>>> Similarly during Dom0 construction kernel awareness needs to be taken
> >>>> into account, and respective code was again mistakenly never enabled for
> >>>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> >>>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> >>>>
> >>>> At the same time restrict enabling of the assist for Dom0 to just the
> >>>> 32-bit case. Furthermore there's no need for an atomic update there.
> >>>>
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> I was uncertain whether to add a check to the CR3 guest read path,
> >>>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> >>>> be converted to "extended" format (overflow is possible there in
> >>>> principle because of the control tools "slack" in promote_l3_table()).
> >>>>
> >>>> In that context I was puzzled to find no check on the CR3 guest write
> >>>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> >>>> of the low reserved ones) could observe anomalous behavior rather than
> >>>> plain failure.
> >>>>
> >>>> As to a Fixes: tag - it's pretty unclear which of the many original
> >>>> 32-on-64 changes to blame. I don't think the two cited commits should
> >>>> be referenced there, as they didn't break anything that wasn't already
> >>>> broken.
> >>>>
> >>>> --- a/xen/arch/x86/mm.c
> >>>> +++ b/xen/arch/x86/mm.c
> >>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
> >>>>      unsigned int   partial_flags = page->partial_flags;
> >>>>      l3_pgentry_t   l3e = l3e_empty();
> >>>>  
> >>>> +    /*
> >>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> >>>> +     * understand the weird 'extended cr3' format for dealing with high-order
> >>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
> >>>> +     * initialised).
> >>>
> >>> Don't we then need some check in the vCPU init path to assure that the
> >>> cr3 is < 32bits if we allow those to initially be set?
> >>>
> >>> Or will the initialization unconditionally overwrite any previous cr3
> >>> value?
> >>
> >> That's not the way I understand this "cut some slack". Instead I read it
> >> to be meant to cover for the VM-assist bit not being set, yet. Beyond
> >> that it is assumed to be tool stack's responsibility to constrain
> >> addresses suitably. If it doesn't, it'll simply break the guest. (There
> >> is some guessing on my part involved here, as the original introduction
> >> of that code didn't further explain things.)
> > 
> > If it's just the guest that's broken I would think it's fine.  As long
> > as such mismatch doesn't cause issues in the hypervisor internal state.
> > 
> > Did you see a toolstack setting such entries before pae_extended_cr3
> > is set?
> 
> To be honest - I didn't look. As said in the longer reply to Andrew, I
> think it is more logical this way (the page table root already being
> validated as an L3 table when vCPU 0 is inititalized, which includes
> setting its CR3). Hence even if right now the order was the other way
> around (which I doubt it is), I wouldn't want to make impossible to
> restore the original ordering again.

IMO I think it would be better if we could already report error at
domain creation time if the toolstack is attempting to create a domain
that the hypervisor knows is not going to work properly, rather than
allowing it and the guest failing in maybe non obvious ways.

It seems to me however that we would need to fix xc_dom_boot_image()
in order to setup the vCPU before creating the initial page-tables.
(->setup_pgtables() hook being called before ->vcpu() hook)

So I don't think this is strictly worse than what we have, but it
would also be nice to get things sorted out so the ability of the
toolstack to shot its own foot is limited.

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Tue, Apr 04, 2023 at 05:57:11PM +0200, Roger Pau Monné wrote:
> On Tue, Apr 04, 2023 at 04:24:16PM +0200, Jan Beulich wrote:
> > On 04.04.2023 13:41, Roger Pau Monné wrote:
> > > On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
> > >> On 04.04.2023 12:12, Roger Pau Monné wrote:
> > >>> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> > >>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> > >>>> applies to guests also when run on a 64-bit hypervisor: The "extended
> > >>>> CR3" format has to be used there as well, to fit the address in the only
> > >>>> 32-bit wide register there. As a result it was a mistake that the check
> > >>>> was never enabled for that case, and was then mistakenly deleted in the
> > >>>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> > >>>> CONFIG_PAGING_LEVELS==4"]).
> > >>>>
> > >>>> Similarly during Dom0 construction kernel awareness needs to be taken
> > >>>> into account, and respective code was again mistakenly never enabled for
> > >>>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> > >>>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> > >>>>
> > >>>> At the same time restrict enabling of the assist for Dom0 to just the
> > >>>> 32-bit case. Furthermore there's no need for an atomic update there.
> > >>>>
> > >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > >>>> ---
> > >>>> I was uncertain whether to add a check to the CR3 guest read path,
> > >>>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> > >>>> be converted to "extended" format (overflow is possible there in
> > >>>> principle because of the control tools "slack" in promote_l3_table()).
> > >>>>
> > >>>> In that context I was puzzled to find no check on the CR3 guest write
> > >>>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> > >>>> of the low reserved ones) could observe anomalous behavior rather than
> > >>>> plain failure.
> > >>>>
> > >>>> As to a Fixes: tag - it's pretty unclear which of the many original
> > >>>> 32-on-64 changes to blame. I don't think the two cited commits should
> > >>>> be referenced there, as they didn't break anything that wasn't already
> > >>>> broken.
> > >>>>
> > >>>> --- a/xen/arch/x86/mm.c
> > >>>> +++ b/xen/arch/x86/mm.c
> > >>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
> > >>>>      unsigned int   partial_flags = page->partial_flags;
> > >>>>      l3_pgentry_t   l3e = l3e_empty();
> > >>>>  
> > >>>> +    /*
> > >>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> > >>>> +     * understand the weird 'extended cr3' format for dealing with high-order
> > >>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
> > >>>> +     * initialised).
> > >>>
> > >>> Don't we then need some check in the vCPU init path to assure that the
> > >>> cr3 is < 32bits if we allow those to initially be set?
> > >>>
> > >>> Or will the initialization unconditionally overwrite any previous cr3
> > >>> value?
> > >>
> > >> That's not the way I understand this "cut some slack". Instead I read it
> > >> to be meant to cover for the VM-assist bit not being set, yet. Beyond
> > >> that it is assumed to be tool stack's responsibility to constrain
> > >> addresses suitably. If it doesn't, it'll simply break the guest. (There
> > >> is some guessing on my part involved here, as the original introduction
> > >> of that code didn't further explain things.)
> > > 
> > > If it's just the guest that's broken I would think it's fine.  As long
> > > as such mismatch doesn't cause issues in the hypervisor internal state.
> > > 
> > > Did you see a toolstack setting such entries before pae_extended_cr3
> > > is set?
> > 
> > To be honest - I didn't look. As said in the longer reply to Andrew, I
> > think it is more logical this way (the page table root already being
> > validated as an L3 table when vCPU 0 is inititalized, which includes
> > setting its CR3). Hence even if right now the order was the other way
> > around (which I doubt it is), I wouldn't want to make impossible to
> > restore the original ordering again.
> 
> IMO I think it would be better if we could already report error at
> domain creation time if the toolstack is attempting to create a domain
> that the hypervisor knows is not going to work properly, rather than
> allowing it and the guest failing in maybe non obvious ways.
> 
> It seems to me however that we would need to fix xc_dom_boot_image()
> in order to setup the vCPU before creating the initial page-tables.
> (->setup_pgtables() hook being called before ->vcpu() hook)
> 
> So I don't think this is strictly worse than what we have, but it
> would also be nice to get things sorted out so the ability of the
> toolstack to shot its own foot is limited.

Maybe I'm confused after all day, but isn't the hypercall used by the
toolstack to set CR3 the same one used to set the vm_assist bits?
(XEN_DOMCTL_setvcpucontext)

At which point we just need to make sure d->vm_assist gets set before
attempting to load the new CR3 (seems that way from the quick look
I've given at arch_set_info_guest()).

And so there should be no need to give extra slack to toolstack
operations.

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 04.04.2023 18:38, Roger Pau Monné wrote:
> On Tue, Apr 04, 2023 at 05:57:11PM +0200, Roger Pau Monné wrote:
>> On Tue, Apr 04, 2023 at 04:24:16PM +0200, Jan Beulich wrote:
>>> On 04.04.2023 13:41, Roger Pau Monné wrote:
>>>> On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
>>>>> On 04.04.2023 12:12, Roger Pau Monné wrote:
>>>>>> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
>>>>>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>>>>>>> applies to guests also when run on a 64-bit hypervisor: The "extended
>>>>>>> CR3" format has to be used there as well, to fit the address in the only
>>>>>>> 32-bit wide register there. As a result it was a mistake that the check
>>>>>>> was never enabled for that case, and was then mistakenly deleted in the
>>>>>>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
>>>>>>> CONFIG_PAGING_LEVELS==4"]).
>>>>>>>
>>>>>>> Similarly during Dom0 construction kernel awareness needs to be taken
>>>>>>> into account, and respective code was again mistakenly never enabled for
>>>>>>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
>>>>>>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
>>>>>>>
>>>>>>> At the same time restrict enabling of the assist for Dom0 to just the
>>>>>>> 32-bit case. Furthermore there's no need for an atomic update there.
>>>>>>>
>>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>>> ---
>>>>>>> I was uncertain whether to add a check to the CR3 guest read path,
>>>>>>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
>>>>>>> be converted to "extended" format (overflow is possible there in
>>>>>>> principle because of the control tools "slack" in promote_l3_table()).
>>>>>>>
>>>>>>> In that context I was puzzled to find no check on the CR3 guest write
>>>>>>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
>>>>>>> of the low reserved ones) could observe anomalous behavior rather than
>>>>>>> plain failure.
>>>>>>>
>>>>>>> As to a Fixes: tag - it's pretty unclear which of the many original
>>>>>>> 32-on-64 changes to blame. I don't think the two cited commits should
>>>>>>> be referenced there, as they didn't break anything that wasn't already
>>>>>>> broken.
>>>>>>>
>>>>>>> --- a/xen/arch/x86/mm.c
>>>>>>> +++ b/xen/arch/x86/mm.c
>>>>>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>>>>>>      unsigned int   partial_flags = page->partial_flags;
>>>>>>>      l3_pgentry_t   l3e = l3e_empty();
>>>>>>>  
>>>>>>> +    /*
>>>>>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>>>>>>> +     * understand the weird 'extended cr3' format for dealing with high-order
>>>>>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>>>>>>> +     * initialised).
>>>>>>
>>>>>> Don't we then need some check in the vCPU init path to assure that the
>>>>>> cr3 is < 32bits if we allow those to initially be set?
>>>>>>
>>>>>> Or will the initialization unconditionally overwrite any previous cr3
>>>>>> value?
>>>>>
>>>>> That's not the way I understand this "cut some slack". Instead I read it
>>>>> to be meant to cover for the VM-assist bit not being set, yet. Beyond
>>>>> that it is assumed to be tool stack's responsibility to constrain
>>>>> addresses suitably. If it doesn't, it'll simply break the guest. (There
>>>>> is some guessing on my part involved here, as the original introduction
>>>>> of that code didn't further explain things.)
>>>>
>>>> If it's just the guest that's broken I would think it's fine.  As long
>>>> as such mismatch doesn't cause issues in the hypervisor internal state.
>>>>
>>>> Did you see a toolstack setting such entries before pae_extended_cr3
>>>> is set?
>>>
>>> To be honest - I didn't look. As said in the longer reply to Andrew, I
>>> think it is more logical this way (the page table root already being
>>> validated as an L3 table when vCPU 0 is inititalized, which includes
>>> setting its CR3). Hence even if right now the order was the other way
>>> around (which I doubt it is), I wouldn't want to make impossible to
>>> restore the original ordering again.
>>
>> IMO I think it would be better if we could already report error at
>> domain creation time if the toolstack is attempting to create a domain
>> that the hypervisor knows is not going to work properly, rather than
>> allowing it and the guest failing in maybe non obvious ways.
>>
>> It seems to me however that we would need to fix xc_dom_boot_image()
>> in order to setup the vCPU before creating the initial page-tables.
>> (->setup_pgtables() hook being called before ->vcpu() hook)

This might be a possibility, yes, but it's (imo severe) scope creep
in the context here. All I'm after is to restore code which was
delete in error (and which was, when it was still there, not
properly put in use in all cases it would have been needed).

>> So I don't think this is strictly worse than what we have, but it
>> would also be nice to get things sorted out so the ability of the
>> toolstack to shot its own foot is limited.
> 
> Maybe I'm confused after all day, but isn't the hypercall used by the
> toolstack to set CR3 the same one used to set the vm_assist bits?
> (XEN_DOMCTL_setvcpucontext)
> 
> At which point we just need to make sure d->vm_assist gets set before
> attempting to load the new CR3 (seems that way from the quick look
> I've given at arch_set_info_guest()).

Right, it is this way already. But that's not the point.
MMUEXT_PIN_L3_TABLE may (will?) come ahead of this.

Jan

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Wed, Apr 05, 2023 at 10:11:30AM +0200, Jan Beulich wrote:
> On 04.04.2023 18:38, Roger Pau Monné wrote:
> > On Tue, Apr 04, 2023 at 05:57:11PM +0200, Roger Pau Monné wrote:
> >> On Tue, Apr 04, 2023 at 04:24:16PM +0200, Jan Beulich wrote:
> >>> On 04.04.2023 13:41, Roger Pau Monné wrote:
> >>>> On Tue, Apr 04, 2023 at 12:31:31PM +0200, Jan Beulich wrote:
> >>>>> On 04.04.2023 12:12, Roger Pau Monné wrote:
> >>>>>> On Wed, Feb 15, 2023 at 03:54:11PM +0100, Jan Beulich wrote:
> >>>>>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
> >>>>>>> applies to guests also when run on a 64-bit hypervisor: The "extended
> >>>>>>> CR3" format has to be used there as well, to fit the address in the only
> >>>>>>> 32-bit wide register there. As a result it was a mistake that the check
> >>>>>>> was never enabled for that case, and was then mistakenly deleted in the
> >>>>>>> course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
> >>>>>>> CONFIG_PAGING_LEVELS==4"]).
> >>>>>>>
> >>>>>>> Similarly during Dom0 construction kernel awareness needs to be taken
> >>>>>>> into account, and respective code was again mistakenly never enabled for
> >>>>>>> 32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
> >>>>>>> 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
> >>>>>>>
> >>>>>>> At the same time restrict enabling of the assist for Dom0 to just the
> >>>>>>> 32-bit case. Furthermore there's no need for an atomic update there.
> >>>>>>>
> >>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>>>>> ---
> >>>>>>> I was uncertain whether to add a check to the CR3 guest read path,
> >>>>>>> raising e.g. #GP(0) when the value read wouldn't fit but also may not
> >>>>>>> be converted to "extended" format (overflow is possible there in
> >>>>>>> principle because of the control tools "slack" in promote_l3_table()).
> >>>>>>>
> >>>>>>> In that context I was puzzled to find no check on the CR3 guest write
> >>>>>>> path even in 4.2: A guest (bogusly) setting the PCD or PWT bits (or any
> >>>>>>> of the low reserved ones) could observe anomalous behavior rather than
> >>>>>>> plain failure.
> >>>>>>>
> >>>>>>> As to a Fixes: tag - it's pretty unclear which of the many original
> >>>>>>> 32-on-64 changes to blame. I don't think the two cited commits should
> >>>>>>> be referenced there, as they didn't break anything that wasn't already
> >>>>>>> broken.
> >>>>>>>
> >>>>>>> --- a/xen/arch/x86/mm.c
> >>>>>>> +++ b/xen/arch/x86/mm.c
> >>>>>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
> >>>>>>>      unsigned int   partial_flags = page->partial_flags;
> >>>>>>>      l3_pgentry_t   l3e = l3e_empty();
> >>>>>>>  
> >>>>>>> +    /*
> >>>>>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
> >>>>>>> +     * understand the weird 'extended cr3' format for dealing with high-order
> >>>>>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
> >>>>>>> +     * initialised).
> >>>>>>
> >>>>>> Don't we then need some check in the vCPU init path to assure that the
> >>>>>> cr3 is < 32bits if we allow those to initially be set?
> >>>>>>
> >>>>>> Or will the initialization unconditionally overwrite any previous cr3
> >>>>>> value?
> >>>>>
> >>>>> That's not the way I understand this "cut some slack". Instead I read it
> >>>>> to be meant to cover for the VM-assist bit not being set, yet. Beyond
> >>>>> that it is assumed to be tool stack's responsibility to constrain
> >>>>> addresses suitably. If it doesn't, it'll simply break the guest. (There
> >>>>> is some guessing on my part involved here, as the original introduction
> >>>>> of that code didn't further explain things.)
> >>>>
> >>>> If it's just the guest that's broken I would think it's fine.  As long
> >>>> as such mismatch doesn't cause issues in the hypervisor internal state.
> >>>>
> >>>> Did you see a toolstack setting such entries before pae_extended_cr3
> >>>> is set?
> >>>
> >>> To be honest - I didn't look. As said in the longer reply to Andrew, I
> >>> think it is more logical this way (the page table root already being
> >>> validated as an L3 table when vCPU 0 is inititalized, which includes
> >>> setting its CR3). Hence even if right now the order was the other way
> >>> around (which I doubt it is), I wouldn't want to make impossible to
> >>> restore the original ordering again.
> >>
> >> IMO I think it would be better if we could already report error at
> >> domain creation time if the toolstack is attempting to create a domain
> >> that the hypervisor knows is not going to work properly, rather than
> >> allowing it and the guest failing in maybe non obvious ways.
> >>
> >> It seems to me however that we would need to fix xc_dom_boot_image()
> >> in order to setup the vCPU before creating the initial page-tables.
> >> (->setup_pgtables() hook being called before ->vcpu() hook)
> 
> This might be a possibility, yes, but it's (imo severe) scope creep
> in the context here. All I'm after is to restore code which was
> delete in error (and which was, when it was still there, not
> properly put in use in all cases it would have been needed).

I realize the check was wrongly removed in 218adf199e68, so I guess
it's fine to restore it, albeit it would be better if we could remove
the weirdness with setting up page tables before the hypervisor has
the full information about the domain capabilities.

> >> So I don't think this is strictly worse than what we have, but it
> >> would also be nice to get things sorted out so the ability of the
> >> toolstack to shot its own foot is limited.
> > 
> > Maybe I'm confused after all day, but isn't the hypercall used by the
> > toolstack to set CR3 the same one used to set the vm_assist bits?
> > (XEN_DOMCTL_setvcpucontext)
> > 
> > At which point we just need to make sure d->vm_assist gets set before
> > attempting to load the new CR3 (seems that way from the quick look
> > I've given at arch_set_info_guest()).
> 
> Right, it is this way already. But that's not the point.
> MMUEXT_PIN_L3_TABLE may (will?) come ahead of this.

Oh, I see.

We should likely move the setting of vm_assist to the domain create
hypercall, instead of doing it at vCPU initialization.

Thanks, Roger.

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Jan Beulich 1 year ago

On 05.04.2023 10:59, Roger Pau Monné wrote:
> We should likely move the setting of vm_assist to the domain create
> hypercall, instead of doing it at vCPU initialization.

Perhaps, the more that setting the assist is limited to vCPU 0 init.
Which in a way makes sense when considering domain creation, but it
is odd for the case of vCPU 0 being brought down, reset, and then
re-initialized; IOW I think arch_set_info_guest() should further
have constrained the setting by a !d->creation_finished check.

Jan

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic

Posted by Roger Pau Monné 1 year ago

On Wed, Apr 05, 2023 at 11:29:51AM +0200, Jan Beulich wrote:
> On 05.04.2023 10:59, Roger Pau Monné wrote:
> > We should likely move the setting of vm_assist to the domain create
> > hypercall, instead of doing it at vCPU initialization.
> 
> Perhaps, the more that setting the assist is limited to vCPU 0 init.
> Which in a way makes sense when considering domain creation, but it
> is odd for the case of vCPU 0 being brought down, reset, and then
> re-initialized; IOW I think arch_set_info_guest() should further
> have constrained the setting by a !d->creation_finished check.

Maybe, but still the right fix IMO would be to move this into
domain_create.  We could add the !d->creation_finished check but that
feels more like a bodge than a proper solution.

Restoring the previous check is better than nothing, but it would be
nice if long term we could get rid of the vpcu related conditions.

Thanks, Roger.