[PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems

Jan Beulich posted 1 patch 11 months, 2 weeks ago
Failed in applying to current master (apply log)
[PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems
Posted by Jan Beulich 11 months, 2 weeks ago
While frame table setup, directmap init, and boot allocator population
respect all intended bounds, the logic passing memory to the heap
allocator which wasn't passed to the boot allocator fails to respect
max_{pdx,pfn}. This then typically triggers the BUG() in
free_heap_pages() after checking page state, because of hitting a struct
page_info instance which was set to all ~0.

Of course all the memory above the 16Tb boundary is still going to
remain unused; using it requires BIGMEM=y. And of course this fix
similarly ought to help BIGMEM=y configurations on >= 123Tb systems
(where all the memory beyond that boundary continues to be unused).

Fixes: bac2000063ba ("x86-64: reduce range spanned by 1:1 mapping and frame table indexes")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Sadly the people reporting the issue have decided to go with the 16Tb
limit, and hence the patch wasn't tested by them. I thought that I'd
still post it, though.

The "must not be passed to the boot allocator" for the range in question
may already not be applicable anymore, with all page tables now being
mapped via map_domain_page() (iirc this work has been completed). But of
course there would be a risk that something else is/was overlooked, and
hence the offending code is being fixed rather than purged (and the
purging should occur once the directmap is properly gone). (This also
seems preferable for potential backports of this change.)

--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1722,15 +1722,16 @@ void __init noreturn __start_xen(unsigne
 
     if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
     {
-        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
+        unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
+        unsigned long hi = pdx_to_pfn(max_pdx - 1) + 1;
         uint64_t mask = PAGE_SIZE - 1;
 
         if ( !highmem_start )
-            xenheap_max_mfn(limit);
+            xenheap_max_mfn(lo);
 
         end_boot_allocator();
 
-        /* Pass the remaining memory to the allocator. */
+        /* Pass the remaining memory in the (lo, hi) range to the allocator. */
         for ( i = 0; i < boot_e820.nr_map; i++ )
         {
             uint64_t s, e;
@@ -1739,10 +1740,12 @@ void __init noreturn __start_xen(unsigne
                 continue;
             s = (boot_e820.map[i].addr + mask) & ~mask;
             e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask;
-            if ( PFN_DOWN(e) <= limit )
+            if ( PFN_DOWN(e) <= lo || PFN_DOWN(s) >= hi )
                 continue;
-            if ( PFN_DOWN(s) <= limit )
-                s = pfn_to_paddr(limit + 1);
+            if ( PFN_DOWN(s) <= lo )
+                s = pfn_to_paddr(lo + 1);
+            if ( PFN_DOWN(e) > hi )
+                e = pfn_to_paddr(hi);
             init_domheap_pages(s, e);
         }
     }
Re: [PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems
Posted by Roger Pau Monné 5 months, 1 week ago
On Wed, Jun 07, 2023 at 08:17:30AM +0200, Jan Beulich wrote:
> While frame table setup, directmap init, and boot allocator population
> respect all intended bounds, the logic passing memory to the heap
> allocator which wasn't passed to the boot allocator fails to respect
> max_{pdx,pfn}. This then typically triggers the BUG() in
> free_heap_pages() after checking page state, because of hitting a struct
> page_info instance which was set to all ~0.
> 
> Of course all the memory above the 16Tb boundary is still going to
> remain unused; using it requires BIGMEM=y. And of course this fix
> similarly ought to help BIGMEM=y configurations on >= 123Tb systems
> (where all the memory beyond that boundary continues to be unused).
> 
> Fixes: bac2000063ba ("x86-64: reduce range spanned by 1:1 mapping and frame table indexes")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> Sadly the people reporting the issue have decided to go with the 16Tb
> limit, and hence the patch wasn't tested by them. I thought that I'd
> still post it, though.

We should see about finding a way to test BIGMEM, maybe with PoD.

> The "must not be passed to the boot allocator" for the range in question
> may already not be applicable anymore, with all page tables now being
> mapped via map_domain_page() (iirc this work has been completed). But of
> course there would be a risk that something else is/was overlooked, and
> hence the offending code is being fixed rather than purged (and the
> purging should occur once the directmap is properly gone). (This also
> seems preferable for potential backports of this change.)
> 
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1722,15 +1722,16 @@ void __init noreturn __start_xen(unsigne
>  
>      if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
>      {
> -        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
> +        unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
> +        unsigned long hi = pdx_to_pfn(max_pdx - 1) + 1;

Maybe use max_page to avoid the pdx_to_pfn() call?  (And is also more
in context with the condition on the outside if).

Thanks, Roger.

Re: [PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems
Posted by Jan Beulich 5 months ago
On 15.12.2023 15:54, Roger Pau Monné wrote:
> On Wed, Jun 07, 2023 at 08:17:30AM +0200, Jan Beulich wrote:
>> While frame table setup, directmap init, and boot allocator population
>> respect all intended bounds, the logic passing memory to the heap
>> allocator which wasn't passed to the boot allocator fails to respect
>> max_{pdx,pfn}. This then typically triggers the BUG() in
>> free_heap_pages() after checking page state, because of hitting a struct
>> page_info instance which was set to all ~0.
>>
>> Of course all the memory above the 16Tb boundary is still going to
>> remain unused; using it requires BIGMEM=y. And of course this fix
>> similarly ought to help BIGMEM=y configurations on >= 123Tb systems
>> (where all the memory beyond that boundary continues to be unused).
>>
>> Fixes: bac2000063ba ("x86-64: reduce range spanned by 1:1 mapping and frame table indexes")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1722,15 +1722,16 @@ void __init noreturn __start_xen(unsigne
>>  
>>      if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
>>      {
>> -        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
>> +        unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
>> +        unsigned long hi = pdx_to_pfn(max_pdx - 1) + 1;
> 
> Maybe use max_page to avoid the pdx_to_pfn() call?  (And is also more
> in context with the condition on the outside if).

You mean

        unsigned long hi = min(pdx_to_pfn(max_pdx - 1) + 1, max_page);

? I could switch to that, yes. I wouldn't feel well switching to using
just max_page, especially with me having nowhere to (reasonably) test.

Jan

Re: [PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems
Posted by Roger Pau Monné 5 months ago
On Mon, Dec 18, 2023 at 09:26:24AM +0100, Jan Beulich wrote:
> On 15.12.2023 15:54, Roger Pau Monné wrote:
> > On Wed, Jun 07, 2023 at 08:17:30AM +0200, Jan Beulich wrote:
> >> While frame table setup, directmap init, and boot allocator population
> >> respect all intended bounds, the logic passing memory to the heap
> >> allocator which wasn't passed to the boot allocator fails to respect
> >> max_{pdx,pfn}. This then typically triggers the BUG() in
> >> free_heap_pages() after checking page state, because of hitting a struct
> >> page_info instance which was set to all ~0.
> >>
> >> Of course all the memory above the 16Tb boundary is still going to
> >> remain unused; using it requires BIGMEM=y. And of course this fix
> >> similarly ought to help BIGMEM=y configurations on >= 123Tb systems
> >> (where all the memory beyond that boundary continues to be unused).
> >>
> >> Fixes: bac2000063ba ("x86-64: reduce range spanned by 1:1 mapping and frame table indexes")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> >> --- a/xen/arch/x86/setup.c
> >> +++ b/xen/arch/x86/setup.c
> >> @@ -1722,15 +1722,16 @@ void __init noreturn __start_xen(unsigne
> >>  
> >>      if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
> >>      {
> >> -        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
> >> +        unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
> >> +        unsigned long hi = pdx_to_pfn(max_pdx - 1) + 1;
> > 
> > Maybe use max_page to avoid the pdx_to_pfn() call?  (And is also more
> > in context with the condition on the outside if).
> 
> You mean
> 
>         unsigned long hi = min(pdx_to_pfn(max_pdx - 1) + 1, max_page);
> 
> ? I could switch to that, yes. I wouldn't feel well switching to using
> just max_page, especially with me having nowhere to (reasonably) test.

Isn't max_page derived from max_pdx (see setup_max_pdx()), and
hence we could avoid the pdx_to_pfn() conversion by just using it?

max_page = pdx_to_pfn(max_pdx - 1) + 1;

So hi == max_page in your proposed code.

Maybe there are further restrictions applied to max_pdx that are not
propagated into max_page, the meaning of all those variables is very
opaque, and hard to follow in the source code.

Thanks, Roger.

Re: [PATCH] x86: allow non-BIGMEM configs to boot on >= 16Tb systems
Posted by Jan Beulich 5 months ago
On 18.12.2023 12:13, Roger Pau Monné wrote:
> On Mon, Dec 18, 2023 at 09:26:24AM +0100, Jan Beulich wrote:
>> On 15.12.2023 15:54, Roger Pau Monné wrote:
>>> On Wed, Jun 07, 2023 at 08:17:30AM +0200, Jan Beulich wrote:
>>>> While frame table setup, directmap init, and boot allocator population
>>>> respect all intended bounds, the logic passing memory to the heap
>>>> allocator which wasn't passed to the boot allocator fails to respect
>>>> max_{pdx,pfn}. This then typically triggers the BUG() in
>>>> free_heap_pages() after checking page state, because of hitting a struct
>>>> page_info instance which was set to all ~0.
>>>>
>>>> Of course all the memory above the 16Tb boundary is still going to
>>>> remain unused; using it requires BIGMEM=y. And of course this fix
>>>> similarly ought to help BIGMEM=y configurations on >= 123Tb systems
>>>> (where all the memory beyond that boundary continues to be unused).
>>>>
>>>> Fixes: bac2000063ba ("x86-64: reduce range spanned by 1:1 mapping and frame table indexes")
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Thanks.
>>
>>>> --- a/xen/arch/x86/setup.c
>>>> +++ b/xen/arch/x86/setup.c
>>>> @@ -1722,15 +1722,16 @@ void __init noreturn __start_xen(unsigne
>>>>  
>>>>      if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
>>>>      {
>>>> -        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
>>>> +        unsigned long lo = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
>>>> +        unsigned long hi = pdx_to_pfn(max_pdx - 1) + 1;
>>>
>>> Maybe use max_page to avoid the pdx_to_pfn() call?  (And is also more
>>> in context with the condition on the outside if).
>>
>> You mean
>>
>>         unsigned long hi = min(pdx_to_pfn(max_pdx - 1) + 1, max_page);
>>
>> ? I could switch to that, yes. I wouldn't feel well switching to using
>> just max_page, especially with me having nowhere to (reasonably) test.
> 
> Isn't max_page derived from max_pdx (see setup_max_pdx()), and
> hence we could avoid the pdx_to_pfn() conversion by just using it?
> 
> max_page = pdx_to_pfn(max_pdx - 1) + 1;
> 
> So hi == max_page in your proposed code.
> 
> Maybe there are further restrictions applied to max_pdx that are not
> propagated into max_page, the meaning of all those variables is very
> opaque, and hard to follow in the source code.

Looking more closely, the two appear to be properly in sync once
setup_max_pdx() was called the first time. I guess I was in part
mislead by

            e = (pdx_to_pfn(max_pdx - 1) + 1ULL) << PAGE_SHIFT;

just a few lines past an update to both variables. I'll switch to
max_page here, and I may also make a patch to tidy the line quoted
above.

Jan