[v4] pdx: introduce a new compression algorithm

[PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

Posted by Roger Pau Monne 6 months, 1 week ago

There's a loop in arch_init_memory() that iterates over holes and non-RAM
regions to possibly mark any page_info structures matching those addresses
as IO.  The looping there is done over the PFN space.

PFNs not covered by the PDX space will always fail the mfn_valid() check,
hence re-write the loop to iterate over the PDX space and avoid checking
any holes that are not covered by the PDX translation.

On a system with a ~6TiB hole this change together with using PDX
compression reduces boot time in approximately 20 seconds.  Xen boot time
without the change is ~50s, with the change it's ~30s.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v3:
 - Ensure parameter to pfn_to_pdx() is always RAM.

Changes since v2:
 - New in this version.
---
 xen/arch/x86/mm.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7fd56c7ce90..e27dc28cfdd6 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
 
 void __init arch_init_memory(void)
 {
-    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
+    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
 
     /*
      * Basic guest-accessible flags:
@@ -328,9 +328,20 @@ void __init arch_init_memory(void)
             destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
                                  (unsigned long)mfn_to_virt(ioend_pfn));
 
-        /* Mark as I/O up to next RAM region. */
-        for ( ; pfn < rstart_pfn; pfn++ )
+        /*
+         * Mark as I/O up to next RAM region.  Iterate over the PDX space to
+         * skip holes which would always fail the mfn_valid() check.
+         *
+         * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
+         * hence provide pfn - 1, which is the tailing PFN from the last RAM
+         * range, or pdx 0 if the input pfn is 0.
+         */
+        for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
+              pdx < pfn_to_pdx(rstart_pfn);
+              pdx++ )
         {
+            pfn = pdx_to_pfn(pdx);
+
             if ( !mfn_valid(_mfn(pfn)) )
                 continue;
 
-- 
2.49.0

Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

Posted by Jan Beulich 6 months, 1 week ago

On 05.08.2025 11:52, Roger Pau Monne wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>  
>  void __init arch_init_memory(void)
>  {
> -    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> +    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>  
>      /*
>       * Basic guest-accessible flags:
> @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
>              destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
>                                   (unsigned long)mfn_to_virt(ioend_pfn));
>  
> -        /* Mark as I/O up to next RAM region. */
> -        for ( ; pfn < rstart_pfn; pfn++ )
> +        /*
> +         * Mark as I/O up to next RAM region.  Iterate over the PDX space to
> +         * skip holes which would always fail the mfn_valid() check.
> +         *
> +         * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
> +         * hence provide pfn - 1, which is the tailing PFN from the last RAM
> +         * range, or pdx 0 if the input pfn is 0.
> +         */
> +        for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
> +              pdx < pfn_to_pdx(rstart_pfn);
> +              pdx++ )
>          {
> +            pfn = pdx_to_pfn(pdx);
> +
>              if ( !mfn_valid(_mfn(pfn)) )
>                  continue;
>  

As much as I would have liked to ack this, I fear there's another caveat here:
At the top of the loop we check not only for RAM, but also for UNUSABLE. The
latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
transformations on any such page.

Jan

Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

Posted by Roger Pau Monné 6 months, 1 week ago

On Tue, Aug 05, 2025 at 02:38:38PM +0200, Jan Beulich wrote:
> On 05.08.2025 11:52, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
> >  
> >  void __init arch_init_memory(void)
> >  {
> > -    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> > +    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
> >  
> >      /*
> >       * Basic guest-accessible flags:
> > @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
> >              destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
> >                                   (unsigned long)mfn_to_virt(ioend_pfn));
> >  
> > -        /* Mark as I/O up to next RAM region. */
> > -        for ( ; pfn < rstart_pfn; pfn++ )
> > +        /*
> > +         * Mark as I/O up to next RAM region.  Iterate over the PDX space to
> > +         * skip holes which would always fail the mfn_valid() check.
> > +         *
> > +         * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
> > +         * hence provide pfn - 1, which is the tailing PFN from the last RAM
> > +         * range, or pdx 0 if the input pfn is 0.
> > +         */
> > +        for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
> > +              pdx < pfn_to_pdx(rstart_pfn);
> > +              pdx++ )
> >          {
> > +            pfn = pdx_to_pfn(pdx);
> > +
> >              if ( !mfn_valid(_mfn(pfn)) )
> >                  continue;
> >  
> 
> As much as I would have liked to ack this, I fear there's another caveat here:
> At the top of the loop we check not only for RAM, but also for UNUSABLE. The
> latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
> transformations on any such page.

Right you are.  I'm not sure however why we do this - won't we want
the mappings of UNUSABLE regions also be removed from the Xen
page-tables? (but not marked as IO)

I could do something like:

        /* Mark as I/O up to next RAM or UNUSABLE region. */
        if ( (!pfn || pdx_is_region_compressible(pfn_to_paddr(pfn - 1), 1)) &&
             pdx_is_region_compressible(pfn_to_paddr(rstart_pfn), 1) )
        {
            /*
             * Iterate over the PDX space to skip holes which would always fail
             * the mfn_valid() check.
             *
             * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
             * hence provide pfn - 1, which is the tailing PFN from the last
             * RAM range, or pdx 0 if the input pfn is 0.
             */
            for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
                  pdx < pfn_to_pdx(rstart_pfn);
                  pdx++ )
            {
                pfn = pdx_to_pfn(pdx);

                if ( !mfn_valid(_mfn(pfn)) )
                    continue;

                assign_io_page(mfn_to_page(_mfn(pfn)));
            }
        }
        else
        {
            /* Slow path, iterate over the PFN space. */
            for ( ; pfn < rstart_pfn; pfn++ )
            {
                if ( !mfn_valid(_mfn(pfn)) )
                    continue;

                assign_io_page(mfn_to_page(_mfn(pfn)));
            }
        }

But I find it a bit ugly - I might send v5 without this final patch
while I see if I can find a better alternative.

Thanks, Roger.

Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

Posted by Jan Beulich 6 months, 1 week ago

On 05.08.2025 17:27, Roger Pau Monné wrote:
> On Tue, Aug 05, 2025 at 02:38:38PM +0200, Jan Beulich wrote:
>> On 05.08.2025 11:52, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>>>  
>>>  void __init arch_init_memory(void)
>>>  {
>>> -    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
>>> +    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>>>  
>>>      /*
>>>       * Basic guest-accessible flags:
>>> @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
>>>              destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
>>>                                   (unsigned long)mfn_to_virt(ioend_pfn));
>>>  
>>> -        /* Mark as I/O up to next RAM region. */
>>> -        for ( ; pfn < rstart_pfn; pfn++ )
>>> +        /*
>>> +         * Mark as I/O up to next RAM region.  Iterate over the PDX space to
>>> +         * skip holes which would always fail the mfn_valid() check.
>>> +         *
>>> +         * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
>>> +         * hence provide pfn - 1, which is the tailing PFN from the last RAM
>>> +         * range, or pdx 0 if the input pfn is 0.
>>> +         */
>>> +        for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
>>> +              pdx < pfn_to_pdx(rstart_pfn);
>>> +              pdx++ )
>>>          {
>>> +            pfn = pdx_to_pfn(pdx);
>>> +
>>>              if ( !mfn_valid(_mfn(pfn)) )
>>>                  continue;
>>>  
>>
>> As much as I would have liked to ack this, I fear there's another caveat here:
>> At the top of the loop we check not only for RAM, but also for UNUSABLE. The
>> latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
>> transformations on any such page.
> 
> Right you are.  I'm not sure however why we do this - won't we want
> the mappings of UNUSABLE regions also be removed from the Xen
> page-tables? (but not marked as IO)

Yes, I think this is a flaw in current code. Perhaps it was (wrongly) assumed
that no UNUSABLE regions would ever exist this low in a memory map? Imo we want
to deal with this in two steps - first sort the UNUSABLE issue, then improve
the dealing with what is passed to assign_io_page().

While there we may also want to find a way to tie together the 16Mb boundary
checks - the 16UL isn't properly connected to the BOOTSTRAP_MAP_BASE definition
in setup.c. Yet then: Am I overlooking something, or is the 16Mb boundary not
really special anymore? I.e. could e.g. BOOTSTRAP_MAP_BASE perhaps be moved (at
least in principle), either almost arbitrarily up (within the low 4Gb), or down
as much as to the 2Mb boundary? The relevant aspect here would be that the
comment saying "the statically-initialised 1-16MB mapping area" looks to be
stale, as of 7cd7f2f5e116 ("x86/boot: Remove the preconstructed low 16M
superpage mappings"). If there are excess mappings to worry about, those may
nowadays well live above the 16Mb boundary (because of it being 2Mb mappings
that head.S inserts into l2_directmap[]).

If I was to sensibly make changes to that code, I guess I first ought to find
a system which actually surfaces any (ideally "interesting") UNUSABLE ranges.
In any event, I think the cleanest thing we can do is split the loop in two,
one to deal with the removal of mappings between RAM regions up to the 4Gb
boundary, and the other to invoke assign_io_page() for holes between RAM ||
UNUSABLE ones. The latter would then be subject to your optimization. The
former could ignore the PDX aspect due to being limited to the low 4Gb, where
we never compress.

Jan

Re: [PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space

Posted by Jan Beulich 6 months, 1 week ago

On 06.08.2025 10:11, Jan Beulich wrote:
> On 05.08.2025 17:27, Roger Pau Monné wrote:
>> On Tue, Aug 05, 2025 at 02:38:38PM +0200, Jan Beulich wrote:
>>> On 05.08.2025 11:52, Roger Pau Monne wrote:
>>>> --- a/xen/arch/x86/mm.c
>>>> +++ b/xen/arch/x86/mm.c
>>>> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>>>>  
>>>>  void __init arch_init_memory(void)
>>>>  {
>>>> -    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
>>>> +    unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>>>>  
>>>>      /*
>>>>       * Basic guest-accessible flags:
>>>> @@ -328,9 +328,20 @@ void __init arch_init_memory(void)
>>>>              destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
>>>>                                   (unsigned long)mfn_to_virt(ioend_pfn));
>>>>  
>>>> -        /* Mark as I/O up to next RAM region. */
>>>> -        for ( ; pfn < rstart_pfn; pfn++ )
>>>> +        /*
>>>> +         * Mark as I/O up to next RAM region.  Iterate over the PDX space to
>>>> +         * skip holes which would always fail the mfn_valid() check.
>>>> +         *
>>>> +         * pfn_to_pdx() requires a valid (iow: RAM) PFN to convert to PDX,
>>>> +         * hence provide pfn - 1, which is the tailing PFN from the last RAM
>>>> +         * range, or pdx 0 if the input pfn is 0.
>>>> +         */
>>>> +        for ( pdx = pfn ? pfn_to_pdx(pfn - 1) + 1 : 0;
>>>> +              pdx < pfn_to_pdx(rstart_pfn);
>>>> +              pdx++ )
>>>>          {
>>>> +            pfn = pdx_to_pfn(pdx);
>>>> +
>>>>              if ( !mfn_valid(_mfn(pfn)) )
>>>>                  continue;
>>>>  
>>>
>>> As much as I would have liked to ack this, I fear there's another caveat here:
>>> At the top of the loop we check not only for RAM, but also for UNUSABLE. The
>>> latter, like RAM, shouldn't be marked I/O, but we also can't use PFN <-> PDX
>>> transformations on any such page.
>>
>> Right you are.  I'm not sure however why we do this - won't we want
>> the mappings of UNUSABLE regions also be removed from the Xen
>> page-tables? (but not marked as IO)
> 
> Yes, I think this is a flaw in current code. Perhaps it was (wrongly) assumed
> that no UNUSABLE regions would ever exist this low in a memory map? Imo we want
> to deal with this in two steps - first sort the UNUSABLE issue, then improve
> the dealing with what is passed to assign_io_page().
> 
> While there we may also want to find a way to tie together the 16Mb boundary
> checks - the 16UL isn't properly connected to the BOOTSTRAP_MAP_BASE definition
> in setup.c. Yet then: Am I overlooking something, or is the 16Mb boundary not
> really special anymore? I.e. could e.g. BOOTSTRAP_MAP_BASE perhaps be moved (at
> least in principle), either almost arbitrarily up (within the low 4Gb), or down
> as much as to the 2Mb boundary? The relevant aspect here would be that the
> comment saying "the statically-initialised 1-16MB mapping area" looks to be
> stale, as of 7cd7f2f5e116 ("x86/boot: Remove the preconstructed low 16M
> superpage mappings"). If there are excess mappings to worry about, those may
> nowadays well live above the 16Mb boundary (because of it being 2Mb mappings
> that head.S inserts into l2_directmap[]).

Hmm, extending this to beyond 16M collides with mappings done by acpi_dmar_init(),
erst_init(), and acpi_hest_init(). Luckily efi_init_memory() runs only afterwards.
While I guess I could limit this to the space 2M mappings were done for in head.S,
theoretically there could still be a collision afterwards. So I think we need to
either somehow exclude such mappings (might end up fragile), or stop (ab)using
the directmap there. Thoughts?

Jan

[PATCH v4 1/8] kconfig: turn PDX compression into a choice
[PATCH v4 2/8] pdx: provide a unified set of unit functions
[PATCH v4 3/8] pdx: introduce command line compression toggle
[PATCH v4 4/8] pdx: allow per-arch optimization of PDX conversion helpers
[PATCH v4 5/8] test/pdx: add PDX compression unit tests
[PATCH v4 6/8] pdx: move some helpers in preparation for new compression
[PATCH v4 7/8] pdx: introduce a new compression algorithm based on region offsets
[PATCH v4 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space