[v2] x86/domain: adjust limitation on shared_info allocation below 4G

[PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monne 2 days, 17 hours ago

The limitation of shared_info being allocated below 4G to fit in the
start_info field only applies to 32bit PV guests.  On 64bit PV guests the
start_info field is 64bits wide.  HVM guests don't use start_info at all.

Drop the restriction in arch_domain_create() and instead free and
re-allocate the page from memory below 4G if needed in switch_compat(),
when the guest is set to run in 32bit PV mode.

Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c    |  7 ++++---
 xen/arch/x86/pv/domain.c | 20 ++++++++++++++++++++
 xen/common/domain.c      |  2 +-
 xen/include/xen/domain.h |  2 ++
 4 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index edb76366b596..3e701f2146c9 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -882,10 +882,11 @@ int arch_domain_create(struct domain *d,
         goto fail;
 
     /*
-     * The shared_info machine address must fit in a 32-bit field within a
-     * 32-bit guest's start_info structure. Hence we specify MEMF_bits(32).
+     * For 32bit PV guests the shared_info machine address must fit in a 32-bit
+     * field within the guest's start_info structure.  We might need to free
+     * and allocate later if the guest turns out to be a 32bit PV one.
      */
-    if ( (d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32))) == NULL )
+    if ( (d->shared_info = alloc_xenheap_page()) == NULL )
         goto fail;
 
     clear_page(d->shared_info);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 01499582d2d6..8ced3d70a52f 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
     d->arch.has_32bit_shinfo = 1;
     d->arch.pv.is_32bit = true;
 
+    /* Check whether the shared_info page needs to be moved below 4G. */
+    if ( virt_to_maddr(d->shared_info) >> 32 )
+    {
+        shared_info_t *prev = d->shared_info;
+
+        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
+        if ( !d->shared_info )
+        {
+            d->shared_info = prev;
+            rc = -ENOMEM;
+            goto undo_and_fail;
+        }
+        put_page(virt_to_page(prev));
+        clear_page(d->shared_info);
+        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
+        /* Ensure all references to the old shared_info page are dropped. */
+        for_each_vcpu( d, v )
+            vcpu_info_reset(v);
+    }
+
     for_each_vcpu( d, v )
     {
         if ( (rc = setup_compat_arg_xlat(v)) ||
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 376351b528c9..11fed22455b9 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -305,7 +305,7 @@ static void vcpu_check_shutdown(struct vcpu *v)
     spin_unlock(&d->shutdown_lock);
 }
 
-static void vcpu_info_reset(struct vcpu *v)
+void vcpu_info_reset(struct vcpu *v)
 {
     struct domain *d = v->domain;
 
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index 273717c31b3f..52cdf012c491 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -83,6 +83,8 @@ void cf_check free_pirq_struct(void *ptr);
 int  arch_vcpu_create(struct vcpu *v);
 void arch_vcpu_destroy(struct vcpu *v);
 
+void vcpu_info_reset(struct vcpu *v);
+
 int map_guest_area(struct vcpu *v, paddr_t gaddr, unsigned int size,
                    struct guest_area *area,
                    void (*populate)(void *dst, struct vcpu *v));
-- 
2.51.0

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Andrew Cooper 2 days, 16 hours ago

On 04/02/2026 12:25 pm, Roger Pau Monne wrote:
> The limitation of shared_info being allocated below 4G to fit in the
> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> start_info field is 64bits wide.  HVM guests don't use start_info at all.

All shared info?  HVM does use it, but doesn't see the MFN.

>
> Drop the restriction in arch_domain_create() and instead free and
> re-allocate the page from memory below 4G if needed in switch_compat(),
> when the guest is set to run in 32bit PV mode.
>
> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/domain.c    |  7 ++++---
>  xen/arch/x86/pv/domain.c | 20 ++++++++++++++++++++
>  xen/common/domain.c      |  2 +-
>  xen/include/xen/domain.h |  2 ++
>  4 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index edb76366b596..3e701f2146c9 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -882,10 +882,11 @@ int arch_domain_create(struct domain *d,
>          goto fail;
>  
>      /*
> -     * The shared_info machine address must fit in a 32-bit field within a
> -     * 32-bit guest's start_info structure. Hence we specify MEMF_bits(32).
> +     * For 32bit PV guests the shared_info machine address must fit in a 32-bit
> +     * field within the guest's start_info structure.  We might need to free
> +     * and allocate later if the guest turns out to be a 32bit PV one.
>       */
> -    if ( (d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32))) == NULL )
> +    if ( (d->shared_info = alloc_xenheap_page()) == NULL )
>          goto fail;
>  

The comment is now out of place when the source is read naturally.  I'd
suggest dropping the comment entirely, and ...

>      clear_page(d->shared_info);
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 01499582d2d6..8ced3d70a52f 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
>      d->arch.has_32bit_shinfo = 1;
>      d->arch.pv.is_32bit = true;
>  
> +    /* Check whether the shared_info page needs to be moved below 4G. */

... extending this one talking about the 32bit field.

> +    if ( virt_to_maddr(d->shared_info) >> 32 )
> +    {
> +        shared_info_t *prev = d->shared_info;
> +
> +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
> +        if ( !d->shared_info )
> +        {
> +            d->shared_info = prev;
> +            rc = -ENOMEM;
> +            goto undo_and_fail;
> +        }
> +        put_page(virt_to_page(prev));
> +        clear_page(d->shared_info);

I think copy_page() would be more appropriate.  That way there are fewer
implicit ordering dependencies.

> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> +        /* Ensure all references to the old shared_info page are dropped. */
> +        for_each_vcpu( d, v )
> +            vcpu_info_reset(v);

switch_compat() can only occur on a domain with no memory.  How can we
have outstanding references?

~Andrew

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 2 days, 15 hours ago

On Wed, Feb 04, 2026 at 03:20:09PM +0100, Andrew Cooper wrote:
> On 04/02/2026 12:25 pm, Roger Pau Monne wrote:
> > The limitation of shared_info being allocated below 4G to fit in the
> > start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> > start_info field is 64bits wide.  HVM guests don't use start_info at all.
> 
> All shared info?  HVM does use it, but doesn't see the MFN.

HVM guest use shared_info, but not start_info.  The issue is not with
shared_info itself, but the size of the field used to store
shared_info in start_info.

HVM doesn't use start_info, and hence doesn't care about the position
of shared_info in that regard.

> >
> > Drop the restriction in arch_domain_create() and instead free and
> > re-allocate the page from memory below 4G if needed in switch_compat(),
> > when the guest is set to run in 32bit PV mode.
> >
> > Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/domain.c    |  7 ++++---
> >  xen/arch/x86/pv/domain.c | 20 ++++++++++++++++++++
> >  xen/common/domain.c      |  2 +-
> >  xen/include/xen/domain.h |  2 ++
> >  4 files changed, 27 insertions(+), 4 deletions(-)
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index edb76366b596..3e701f2146c9 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -882,10 +882,11 @@ int arch_domain_create(struct domain *d,
> >          goto fail;
> >  
> >      /*
> > -     * The shared_info machine address must fit in a 32-bit field within a
> > -     * 32-bit guest's start_info structure. Hence we specify MEMF_bits(32).
> > +     * For 32bit PV guests the shared_info machine address must fit in a 32-bit
> > +     * field within the guest's start_info structure.  We might need to free
> > +     * and allocate later if the guest turns out to be a 32bit PV one.
> >       */
> > -    if ( (d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32))) == NULL )
> > +    if ( (d->shared_info = alloc_xenheap_page()) == NULL )
> >          goto fail;
> >  
> 
> The comment is now out of place when the source is read naturally.  I'd
> suggest dropping the comment entirely, and ...
> 
> >      clear_page(d->shared_info);
> > diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> > index 01499582d2d6..8ced3d70a52f 100644
> > --- a/xen/arch/x86/pv/domain.c
> > +++ b/xen/arch/x86/pv/domain.c
> > @@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
> >      d->arch.has_32bit_shinfo = 1;
> >      d->arch.pv.is_32bit = true;
> >  
> > +    /* Check whether the shared_info page needs to be moved below 4G. */
> 
> ... extending this one talking about the 32bit field.

Hm, yes, I've considered doing that.  Unless Jan objects I will move
the comment then, seeing as you also think it's best.

> > +    if ( virt_to_maddr(d->shared_info) >> 32 )
> > +    {
> > +        shared_info_t *prev = d->shared_info;
> > +
> > +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
> > +        if ( !d->shared_info )
> > +        {
> > +            d->shared_info = prev;
> > +            rc = -ENOMEM;
> > +            goto undo_and_fail;
> > +        }
> > +        put_page(virt_to_page(prev));
> > +        clear_page(d->shared_info);
> 
> I think copy_page() would be more appropriate.  That way there are fewer
> implicit ordering dependencies.

Hm, I had a copy_page() initially, but I don't think it's appropriate
because (most of?) the fields will need translation.  I don't think
translation is feasible, but I could at least call
update_domain_wallclock_time() which is what hvm_latch_shinfo_size()
does when changing bitness.

> > +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> > +        /* Ensure all references to the old shared_info page are dropped. */
> > +        for_each_vcpu( d, v )
> > +            vcpu_info_reset(v);
> 
> switch_compat() can only occur on a domain with no memory.  How can we
> have outstanding references?

As Jan pointed out, it's not references, but stashed pointers to the
previous shared_info page.  I've used the wrong wording here.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Andrew Cooper 2 days, 15 hours ago

On 04/02/2026 3:01 pm, Roger Pau Monné wrote:
>>> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
>>> +        /* Ensure all references to the old shared_info page are dropped. */
>>> +        for_each_vcpu( d, v )
>>> +            vcpu_info_reset(v);
>> switch_compat() can only occur on a domain with no memory.  How can we
>> have outstanding references?
> As Jan pointed out, it's not references, but stashed pointers to the
> previous shared_info page.  I've used the wrong wording here.

Yes, I saw that thread, but my question still stands.

How can there be any this early in the domain's lifecycle?

~Andrew

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 2 days, 14 hours ago

On 04.02.2026 16:12, Andrew Cooper wrote:
> On 04/02/2026 3:01 pm, Roger Pau Monné wrote:
>>>> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
>>>> +        /* Ensure all references to the old shared_info page are dropped. */
>>>> +        for_each_vcpu( d, v )
>>>> +            vcpu_info_reset(v);
>>> switch_compat() can only occur on a domain with no memory.  How can we
>>> have outstanding references?
>> As Jan pointed out, it's not references, but stashed pointers to the
>> previous shared_info page.  I've used the wrong wording here.
> 
> Yes, I saw that thread, but my question still stands.
> 
> How can there be any this early in the domain's lifecycle?

Can't (aren't) vCPU-s added ahead of adding memory?

Jan

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 2 days, 13 hours ago

On Wed, Feb 04, 2026 at 04:32:25PM +0100, Jan Beulich wrote:
> On 04.02.2026 16:12, Andrew Cooper wrote:
> > On 04/02/2026 3:01 pm, Roger Pau Monné wrote:
> >>>> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> >>>> +        /* Ensure all references to the old shared_info page are dropped. */
> >>>> +        for_each_vcpu( d, v )
> >>>> +            vcpu_info_reset(v);
> >>> switch_compat() can only occur on a domain with no memory.  How can we
> >>> have outstanding references?
> >> As Jan pointed out, it's not references, but stashed pointers to the
> >> previous shared_info page.  I've used the wrong wording here.
> > 
> > Yes, I saw that thread, but my question still stands.
> > 
> > How can there be any this early in the domain's lifecycle?
> 
> Can't (aren't) vCPU-s added ahead of adding memory?

At least on x86 when using xl/libxl the call to
XEN_DOMCTL_set_address_size happens after the call to
XEN_DOMCTL_max_vcpus, and the later calls vcpu_create() which sets the
pointer into the shared_info page for legacy (< 32) vCPUs.

Even if we could invert those two calls, it's impossible to know what
other toolstacks might do.  I think we need to keep the
vcpu_info_reset() call.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Andrew Cooper 2 days, 12 hours ago

On 04/02/2026 4:54 pm, Roger Pau Monné wrote:
> On Wed, Feb 04, 2026 at 04:32:25PM +0100, Jan Beulich wrote:
>> On 04.02.2026 16:12, Andrew Cooper wrote:
>>> On 04/02/2026 3:01 pm, Roger Pau Monné wrote:
>>>>>> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
>>>>>> +        /* Ensure all references to the old shared_info page are dropped. */
>>>>>> +        for_each_vcpu( d, v )
>>>>>> +            vcpu_info_reset(v);
>>>>> switch_compat() can only occur on a domain with no memory.  How can we
>>>>> have outstanding references?
>>>> As Jan pointed out, it's not references, but stashed pointers to the
>>>> previous shared_info page.  I've used the wrong wording here.
>>> Yes, I saw that thread, but my question still stands.
>>>
>>> How can there be any this early in the domain's lifecycle?
>> Can't (aren't) vCPU-s added ahead of adding memory?
> At least on x86 when using xl/libxl the call to
> XEN_DOMCTL_set_address_size happens after the call to
> XEN_DOMCTL_max_vcpus, and the later calls vcpu_create() which sets the
> pointer into the shared_info page for legacy (< 32) vCPUs.

Geez, that disaster of an interface still has spike traps we're falling
foul of.

Please extend the comment to note the first 32 vcpu compatibility case.

But, combined with the format change (clear vs copy page), doesn't it
mean there's an existing bug here?

Even without moving the shared_info page, aren't the cached pointers
made wrong by switch_compat() ?

This is one area where we've got no XTF testing at all, and I probably
ought to see about fixing that.

~Andrew

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 2 days, 12 hours ago

On Wed, Feb 04, 2026 at 05:23:21PM +0000, Andrew Cooper wrote:
> On 04/02/2026 4:54 pm, Roger Pau Monné wrote:
> > On Wed, Feb 04, 2026 at 04:32:25PM +0100, Jan Beulich wrote:
> >> On 04.02.2026 16:12, Andrew Cooper wrote:
> >>> On 04/02/2026 3:01 pm, Roger Pau Monné wrote:
> >>>>>> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> >>>>>> +        /* Ensure all references to the old shared_info page are dropped. */
> >>>>>> +        for_each_vcpu( d, v )
> >>>>>> +            vcpu_info_reset(v);
> >>>>> switch_compat() can only occur on a domain with no memory.  How can we
> >>>>> have outstanding references?
> >>>> As Jan pointed out, it's not references, but stashed pointers to the
> >>>> previous shared_info page.  I've used the wrong wording here.
> >>> Yes, I saw that thread, but my question still stands.
> >>>
> >>> How can there be any this early in the domain's lifecycle?
> >> Can't (aren't) vCPU-s added ahead of adding memory?
> > At least on x86 when using xl/libxl the call to
> > XEN_DOMCTL_set_address_size happens after the call to
> > XEN_DOMCTL_max_vcpus, and the later calls vcpu_create() which sets the
> > pointer into the shared_info page for legacy (< 32) vCPUs.
> 
> Geez, that disaster of an interface still has spike traps we're falling
> foul of.
> 
> Please extend the comment to note the first 32 vcpu compatibility case.
> 
> But, combined with the format change (clear vs copy page), doesn't it
> mean there's an existing bug here?
> 
> Even without moving the shared_info page, aren't the cached pointers
> made wrong by switch_compat() ?

No, they are not wrong because the vcpu_info array is at the start of
shared_info, and it has the same size (64bytes) on both 32 and 64bits.
Hence the offsets into the different vcpu_info array elements are the
same regardless of bitness.

> This is one area where we've got no XTF testing at all, and I probably
> ought to see about fixing that.

Yeah, it's not great.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 2 days, 15 hours ago

On 04.02.2026 16:01, Roger Pau Monné wrote:
> On Wed, Feb 04, 2026 at 03:20:09PM +0100, Andrew Cooper wrote:
>> On 04/02/2026 12:25 pm, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/pv/domain.c
>>> +++ b/xen/arch/x86/pv/domain.c
>>> @@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
>>>      d->arch.has_32bit_shinfo = 1;
>>>      d->arch.pv.is_32bit = true;
>>>  
>>> +    /* Check whether the shared_info page needs to be moved below 4G. */
>>
>> ... extending this one talking about the 32bit field.
> 
> Hm, yes, I've considered doing that.  Unless Jan objects I will move
> the comment then, seeing as you also think it's best.

Definitely no objection from me. My R-b stands with that movement.

Jan

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 2 days, 15 hours ago

On 04.02.2026 15:20, Andrew Cooper wrote:
> On 04/02/2026 12:25 pm, Roger Pau Monne wrote:
>> The limitation of shared_info being allocated below 4G to fit in the
>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
> 
> All shared info?  HVM does use it, but doesn't see the MFN.

I think it's really start_info that is meant here, as that's where the
restriction comes from.

>> +    if ( virt_to_maddr(d->shared_info) >> 32 )
>> +    {
>> +        shared_info_t *prev = d->shared_info;
>> +
>> +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
>> +        if ( !d->shared_info )
>> +        {
>> +            d->shared_info = prev;
>> +            rc = -ENOMEM;
>> +            goto undo_and_fail;
>> +        }
>> +        put_page(virt_to_page(prev));
>> +        clear_page(d->shared_info);
> 
> I think copy_page() would be more appropriate.  That way there are fewer
> implicit ordering dependencies.

I'd strongly recommend against copy_page() here. If there was any data in
there that would need copying, it would need to be done field-wise, using
the compat xlat machinery. The layouts are different. It may be prudent to
assert that the original page is still completely zeroed.

Jan

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Andrew Cooper 2 days, 15 hours ago

On 04/02/2026 2:40 pm, Jan Beulich wrote:
>>> +    if ( virt_to_maddr(d->shared_info) >> 32 )
>>> +    {
>>> +        shared_info_t *prev = d->shared_info;
>>> +
>>> +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
>>> +        if ( !d->shared_info )
>>> +        {
>>> +            d->shared_info = prev;
>>> +            rc = -ENOMEM;
>>> +            goto undo_and_fail;
>>> +        }
>>> +        put_page(virt_to_page(prev));
>>> +        clear_page(d->shared_info);
>> I think copy_page() would be more appropriate.  That way there are fewer
>> implicit ordering dependencies.
> I'd strongly recommend against copy_page() here. If there was any data in
> there that would need copying, it would need to be done field-wise, using
> the compat xlat machinery. The layouts are different. It may be prudent to
> assert that the original page is still completely zeroed.

Oh, yes.  The layouts will be different.  Fine.

~Andrew

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 2 days, 16 hours ago

On 04.02.2026 13:25, Roger Pau Monne wrote:
> The limitation of shared_info being allocated below 4G to fit in the
> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> start_info field is 64bits wide.  HVM guests don't use start_info at all.
> 
> Drop the restriction in arch_domain_create() and instead free and
> re-allocate the page from memory below 4G if needed in switch_compat(),
> when the guest is set to run in 32bit PV mode.
> 
> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")

The tag is here because there is the (largely theoretical?) possibility for
a system to have no memory at all left below 4Gb, in which case creation of
a PV64 or non-shadow HVM guest would needlessly fail?

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with two nits:

> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
>      d->arch.has_32bit_shinfo = 1;
>      d->arch.pv.is_32bit = true;
>  
> +    /* Check whether the shared_info page needs to be moved below 4G. */
> +    if ( virt_to_maddr(d->shared_info) >> 32 )
> +    {
> +        shared_info_t *prev = d->shared_info;
> +
> +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
> +        if ( !d->shared_info )
> +        {
> +            d->shared_info = prev;
> +            rc = -ENOMEM;
> +            goto undo_and_fail;
> +        }
> +        put_page(virt_to_page(prev));
> +        clear_page(d->shared_info);
> +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> +        /* Ensure all references to the old shared_info page are dropped. */

"references to ... page are dropped" reads as if talk was of page references.
vcpu_info_reset() writes out pointers only, though. May I suggest

        /* Ensure all pointers to the old shared_info page are replaced. */

?

> +        for_each_vcpu( d, v )

A common oddity, where my personal take is that only one of two forms are
possible (without ./CODING_STYLE saying anything explicit): Either such a
for_each_... is deemed a (kind-of) keyword (like "for"), then it wants to be

        for_each_vcpu ( d, v )

or it is deemed not to be one, then it wants to be

        for_each_vcpu(d, v)

Jan

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 2 days, 15 hours ago

On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
> On 04.02.2026 13:25, Roger Pau Monne wrote:
> > The limitation of shared_info being allocated below 4G to fit in the
> > start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> > start_info field is 64bits wide.  HVM guests don't use start_info at all.
> > 
> > Drop the restriction in arch_domain_create() and instead free and
> > re-allocate the page from memory below 4G if needed in switch_compat(),
> > when the guest is set to run in 32bit PV mode.
> > 
> > Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
> 
> The tag is here because there is the (largely theoretical?) possibility for
> a system to have no memory at all left below 4Gb, in which case creation of
> a PV64 or non-shadow HVM guest would needlessly fail?

It's kid of an issue we discovered when using strict domain NUMA node
placement.  At that point the toolstack would exhaust all memory on
node 0 and by doing that inadvertently consume all memory below 4G.

> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> with two nits:
> 
> > --- a/xen/arch/x86/pv/domain.c
> > +++ b/xen/arch/x86/pv/domain.c
> > @@ -247,6 +247,26 @@ int switch_compat(struct domain *d)
> >      d->arch.has_32bit_shinfo = 1;
> >      d->arch.pv.is_32bit = true;
> >  
> > +    /* Check whether the shared_info page needs to be moved below 4G. */
> > +    if ( virt_to_maddr(d->shared_info) >> 32 )
> > +    {
> > +        shared_info_t *prev = d->shared_info;
> > +
> > +        d->shared_info = alloc_xenheap_pages(0, MEMF_bits(32));
> > +        if ( !d->shared_info )
> > +        {
> > +            d->shared_info = prev;
> > +            rc = -ENOMEM;
> > +            goto undo_and_fail;
> > +        }
> > +        put_page(virt_to_page(prev));
> > +        clear_page(d->shared_info);
> > +        share_xen_page_with_guest(virt_to_page(d->shared_info), d, SHARE_rw);
> > +        /* Ensure all references to the old shared_info page are dropped. */
> 
> "references to ... page are dropped" reads as if talk was of page references.
> vcpu_info_reset() writes out pointers only, though. May I suggest
> 
>         /* Ensure all pointers to the old shared_info page are replaced. */

Fine with me.

> ?
> 
> > +        for_each_vcpu( d, v )
> 
> A common oddity, where my personal take is that only one of two forms are
> possible (without ./CODING_STYLE saying anything explicit): Either such a
> for_each_... is deemed a (kind-of) keyword (like "for"), then it wants to be
> 
>         for_each_vcpu ( d, v )

Oh right, I copied it from the instance below and didn't notice the
missing spaces.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 2 days, 15 hours ago

On 04.02.2026 15:52, Roger Pau Monné wrote:
> On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
>> On 04.02.2026 13:25, Roger Pau Monne wrote:
>>> The limitation of shared_info being allocated below 4G to fit in the
>>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
>>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
>>>
>>> Drop the restriction in arch_domain_create() and instead free and
>>> re-allocate the page from memory below 4G if needed in switch_compat(),
>>> when the guest is set to run in 32bit PV mode.
>>>
>>> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
>>
>> The tag is here because there is the (largely theoretical?) possibility for
>> a system to have no memory at all left below 4Gb, in which case creation of
>> a PV64 or non-shadow HVM guest would needlessly fail?
> 
> It's kid of an issue we discovered when using strict domain NUMA node
> placement.  At that point the toolstack would exhaust all memory on
> node 0 and by doing that inadvertently consume all memory below 4G.

Right, and hence also my "memory: arrange to conserve on DMA reservation",
where I'm still fighting with myself as to what to do with the comments you
gave there.

Jan

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 2 days, 13 hours ago

On Wed, Feb 04, 2026 at 04:08:21PM +0100, Jan Beulich wrote:
> On 04.02.2026 15:52, Roger Pau Monné wrote:
> > On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
> >> On 04.02.2026 13:25, Roger Pau Monne wrote:
> >>> The limitation of shared_info being allocated below 4G to fit in the
> >>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> >>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
> >>>
> >>> Drop the restriction in arch_domain_create() and instead free and
> >>> re-allocate the page from memory below 4G if needed in switch_compat(),
> >>> when the guest is set to run in 32bit PV mode.
> >>>
> >>> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
> >>
> >> The tag is here because there is the (largely theoretical?) possibility for
> >> a system to have no memory at all left below 4Gb, in which case creation of
> >> a PV64 or non-shadow HVM guest would needlessly fail?
> > 
> > It's kid of an issue we discovered when using strict domain NUMA node
> > placement.  At that point the toolstack would exhaust all memory on
> > node 0 and by doing that inadvertently consume all memory below 4G.
> 
> Right, and hence also my "memory: arrange to conserve on DMA reservation",
> where I'm still fighting with myself as to what to do with the comments you
> gave there.

Better fighting with yourself rather than fighting with me I guess ;).

That change would be controversial with what we currently do on
XenServer, because we don't (yet) special case the memory below 4G to
not account for it in the per node free amount of memory.

What would happen when you append the MEMF_no_dma flag as proposed in
your commit, but the caller is also passing MEMF_exact_node with
target node 0?  AFAICT the allocation would still refuse to use the
low 4G pool.

Also, your commit should also be expanded to avoid staking claims that
would drain the DMA pool, as then populate_physmap() won't be able to
allocate from there?  It would be weird for the toolstack to have
successfully made a claim, and then populate_physmap() returning
-ENOMEM because (part of) the claim has been made against the DMA
pool, which populate_physmap() would explicitly forbidden to allocate
from.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 1 day, 21 hours ago

On 04.02.2026 17:46, Roger Pau Monné wrote:
> On Wed, Feb 04, 2026 at 04:08:21PM +0100, Jan Beulich wrote:
>> On 04.02.2026 15:52, Roger Pau Monné wrote:
>>> On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
>>>> On 04.02.2026 13:25, Roger Pau Monne wrote:
>>>>> The limitation of shared_info being allocated below 4G to fit in the
>>>>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
>>>>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
>>>>>
>>>>> Drop the restriction in arch_domain_create() and instead free and
>>>>> re-allocate the page from memory below 4G if needed in switch_compat(),
>>>>> when the guest is set to run in 32bit PV mode.
>>>>>
>>>>> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
>>>>
>>>> The tag is here because there is the (largely theoretical?) possibility for
>>>> a system to have no memory at all left below 4Gb, in which case creation of
>>>> a PV64 or non-shadow HVM guest would needlessly fail?
>>>
>>> It's kid of an issue we discovered when using strict domain NUMA node
>>> placement.  At that point the toolstack would exhaust all memory on
>>> node 0 and by doing that inadvertently consume all memory below 4G.
>>
>> Right, and hence also my "memory: arrange to conserve on DMA reservation",
>> where I'm still fighting with myself as to what to do with the comments you
>> gave there.
> 
> Better fighting with yourself rather than fighting with me I guess ;).
> 
> That change would be controversial with what we currently do on
> XenServer, because we don't (yet) special case the memory below 4G to
> not account for it in the per node free amount of memory.
> 
> What would happen when you append the MEMF_no_dma flag as proposed in
> your commit, but the caller is also passing MEMF_exact_node with
> target node 0?  AFAICT the allocation would still refuse to use the
> low 4G pool.

Yes, DMA-ability is intended to take higher priority than exact-node
requests. Another node would then need choosing by the toolstack.

> Also, your commit should also be expanded to avoid staking claims that
> would drain the DMA pool, as then populate_physmap() won't be able to
> allocate from there?

Except that upstream claims aren't node-specific, yet, so could be
fulfilled my taking memory from other nodes? Aiui the problem would
only occur if that DAM-able memory was the only memory left in the
system.

Jan

>  It would be weird for the toolstack to have
> successfully made a claim, and then populate_physmap() returning
> -ENOMEM because (part of) the claim has been made against the DMA
> pool, which populate_physmap() would explicitly forbidden to allocate
> from.
> 
> Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Roger Pau Monné 1 day, 16 hours ago

On Thu, Feb 05, 2026 at 09:29:53AM +0100, Jan Beulich wrote:
> On 04.02.2026 17:46, Roger Pau Monné wrote:
> > On Wed, Feb 04, 2026 at 04:08:21PM +0100, Jan Beulich wrote:
> >> On 04.02.2026 15:52, Roger Pau Monné wrote:
> >>> On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
> >>>> On 04.02.2026 13:25, Roger Pau Monne wrote:
> >>>>> The limitation of shared_info being allocated below 4G to fit in the
> >>>>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
> >>>>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
> >>>>>
> >>>>> Drop the restriction in arch_domain_create() and instead free and
> >>>>> re-allocate the page from memory below 4G if needed in switch_compat(),
> >>>>> when the guest is set to run in 32bit PV mode.
> >>>>>
> >>>>> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
> >>>>
> >>>> The tag is here because there is the (largely theoretical?) possibility for
> >>>> a system to have no memory at all left below 4Gb, in which case creation of
> >>>> a PV64 or non-shadow HVM guest would needlessly fail?
> >>>
> >>> It's kid of an issue we discovered when using strict domain NUMA node
> >>> placement.  At that point the toolstack would exhaust all memory on
> >>> node 0 and by doing that inadvertently consume all memory below 4G.
> >>
> >> Right, and hence also my "memory: arrange to conserve on DMA reservation",
> >> where I'm still fighting with myself as to what to do with the comments you
> >> gave there.
> > 
> > Better fighting with yourself rather than fighting with me I guess ;).
> > 
> > That change would be controversial with what we currently do on
> > XenServer, because we don't (yet) special case the memory below 4G to
> > not account for it in the per node free amount of memory.
> > 
> > What would happen when you append the MEMF_no_dma flag as proposed in
> > your commit, but the caller is also passing MEMF_exact_node with
> > target node 0?  AFAICT the allocation would still refuse to use the
> > low 4G pool.
> 
> Yes, DMA-ability is intended to take higher priority than exact-node
> requests. Another node would then need choosing by the toolstack.
> 
> > Also, your commit should also be expanded to avoid staking claims that
> > would drain the DMA pool, as then populate_physmap() won't be able to
> > allocate from there?
> 
> Except that upstream claims aren't node-specific, yet, so could be
> fulfilled my taking memory from other nodes?

That's likely to change at some point, but yes, they are not node
specific yet.

> Aiui the problem would
> only occur if that DAM-able memory was the only memory left in the
> system.

Indeed, in that scenario toolstack will be allowed to make claims that
cover that DMA memory, yet populate physmap won't be able to consume
those claims.

I think there are two item that need to be done for us to append
MEMF_no_dma to populate physmap allocations:

 * DMA memory is not reachable by claims.
 * DMA memory must be reported to the toolstack, so it can account for
   it separately from free memory.

Last point could also be solved by subtracting the DMA memory from the
`free_pages` value returned to the toolstack.

Thanks, Roger.

Re: [PATCH v2] x86/domain: adjust limitation on shared_info allocation below 4G

Posted by Jan Beulich 1 day, 16 hours ago

On 05.02.2026 15:08, Roger Pau Monné wrote:
> On Thu, Feb 05, 2026 at 09:29:53AM +0100, Jan Beulich wrote:
>> On 04.02.2026 17:46, Roger Pau Monné wrote:
>>> On Wed, Feb 04, 2026 at 04:08:21PM +0100, Jan Beulich wrote:
>>>> On 04.02.2026 15:52, Roger Pau Monné wrote:
>>>>> On Wed, Feb 04, 2026 at 03:06:52PM +0100, Jan Beulich wrote:
>>>>>> On 04.02.2026 13:25, Roger Pau Monne wrote:
>>>>>>> The limitation of shared_info being allocated below 4G to fit in the
>>>>>>> start_info field only applies to 32bit PV guests.  On 64bit PV guests the
>>>>>>> start_info field is 64bits wide.  HVM guests don't use start_info at all.
>>>>>>>
>>>>>>> Drop the restriction in arch_domain_create() and instead free and
>>>>>>> re-allocate the page from memory below 4G if needed in switch_compat(),
>>>>>>> when the guest is set to run in 32bit PV mode.
>>>>>>>
>>>>>>> Fixes: 3cadc0469d5c ("x86_64: shared_info must be allocated below 4GB as it is advertised to 32-bit guests via a 32-bit machine address field in start_info.")
>>>>>>
>>>>>> The tag is here because there is the (largely theoretical?) possibility for
>>>>>> a system to have no memory at all left below 4Gb, in which case creation of
>>>>>> a PV64 or non-shadow HVM guest would needlessly fail?
>>>>>
>>>>> It's kid of an issue we discovered when using strict domain NUMA node
>>>>> placement.  At that point the toolstack would exhaust all memory on
>>>>> node 0 and by doing that inadvertently consume all memory below 4G.
>>>>
>>>> Right, and hence also my "memory: arrange to conserve on DMA reservation",
>>>> where I'm still fighting with myself as to what to do with the comments you
>>>> gave there.
>>>
>>> Better fighting with yourself rather than fighting with me I guess ;).
>>>
>>> That change would be controversial with what we currently do on
>>> XenServer, because we don't (yet) special case the memory below 4G to
>>> not account for it in the per node free amount of memory.
>>>
>>> What would happen when you append the MEMF_no_dma flag as proposed in
>>> your commit, but the caller is also passing MEMF_exact_node with
>>> target node 0?  AFAICT the allocation would still refuse to use the
>>> low 4G pool.
>>
>> Yes, DMA-ability is intended to take higher priority than exact-node
>> requests. Another node would then need choosing by the toolstack.
>>
>>> Also, your commit should also be expanded to avoid staking claims that
>>> would drain the DMA pool, as then populate_physmap() won't be able to
>>> allocate from there?
>>
>> Except that upstream claims aren't node-specific, yet, so could be
>> fulfilled my taking memory from other nodes?
> 
> That's likely to change at some point, but yes, they are not node
> specific yet.
> 
>> Aiui the problem would
>> only occur if that DAM-able memory was the only memory left in the
>> system.
> 
> Indeed, in that scenario toolstack will be allowed to make claims that
> cover that DMA memory, yet populate physmap won't be able to consume
> those claims.

It would be (following said patch of mine), but only in order-0 chunks.
Which would make ...

> I think there are two item that need to be done for us to append
> MEMF_no_dma to populate physmap allocations:
> 
>  * DMA memory is not reachable by claims.
>  * DMA memory must be reported to the toolstack, so it can account for
>    it separately from free memory.
> 
> Last point could also be solved by subtracting the DMA memory from the
> `free_pages` value returned to the toolstack.

... any of this more difficult. We don't want to completely prevent its
use, we only want to (heuristically) limit it.

Jan