[v1] xen/mm: fix fallout from populate_physmap() deferred scrub change

[PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Roger Pau Monne 1 week, 1 day ago

Assigning pages to a domain make them the possible target of hypercalls
like XENMEM_decrease_reservation ahead of such pages being scrubbed in
populate_physmap() when the guest is running in PV mode.  This might allow
pages to be freed ahead of being scrubbed for example, as a stubdomain
already running could target them by guessing their MFNs.  It's also
possible other action could set the page type ahead of scrubbing, which
would be problematic.

Prevent the pages pending scrub from being assigned to the domain, and only
do the assign once the scrubbing has finished.  This has the disadvantage
that the allocated pages will be removed from the free pool, but not yet
accounted towards the domain consumed page quota.  However there can only
be one stashed page in that state, and it's maximum size is bounded by the
memop-max-order option.  This is not too different from the current logic,
where assigning pages to a domain (and thus checking whether such domain
doesn't overflow it's quota) is also done after the memory has been
allocated and removed from the pool of free pages.

Fixes: 83a784a15b47 ("xen/mm: allow deferred scrub of physmap populate allocated pages")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've attempted various different ways to solve this, but they all ended up
being impossible.

 * Prevent non-scrubbed pages from getting extra refcounts (iow: make
   get_page() fail for them).  This seemed nice, but the cleanup using
   put_page_alloc_ref() was impossible as non-scrubbed pages would return
   failure in get_page(), and so I couldn't take the extra reference ahead
   of calling put_page_alloc_ref().
 * Disallow XENMEM_decrease_reservation until the domain has finished
   creation would fix the issue of pages being freed while pending scrub,
   but it's not clear there might be other usages that would be problematic,
   as get_page() on non-scrubbed pages would still return success.
---
 xen/common/memory.c     | 6 ++++++
 xen/common/page_alloc.c | 8 +++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index f0ff1311881c..1ad4b51c5b02 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
                             goto out;
                         }
                     }
+
+                    if ( assign_page(page, a->extent_order, d, memflags) )
+                    {
+                        free_domheap_pages(page, a->extent_order);
+                        goto out;
+                    }
                 }
 
                 if ( unlikely(a->memflags & MEMF_no_tlbflush) )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1316dfbd15ee..b72a74c705ba 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2699,7 +2699,13 @@ struct page_info *alloc_domheap_pages(
                                   memflags, d)) == NULL)) )
          return NULL;
 
-    if ( d && !(memflags & MEMF_no_owner) )
+    /*
+     * Don't add pages with the PGC_need_scrub bit set to the domain, the
+     * caller must clean the bit and then manually call assign_pages().
+     * Otherwise pages with the PGC_need_scrub would be reachable using
+     * get_page().
+     */
+    if ( d && !(memflags & MEMF_no_owner) && !(memflags & MEMF_keep_scrub) )
     {
         if ( memflags & MEMF_no_refcount )
         {
-- 
2.51.0

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Jan Beulich 1 week, 1 day ago

On 25.03.2026 11:08, Roger Pau Monne wrote:
> ---
> I've attempted various different ways to solve this, but they all ended up
> being impossible.
> 
>  * Prevent non-scrubbed pages from getting extra refcounts (iow: make
>    get_page() fail for them).  This seemed nice, but the cleanup using
>    put_page_alloc_ref() was impossible as non-scrubbed pages would return
>    failure in get_page(), and so I couldn't take the extra reference ahead
>    of calling put_page_alloc_ref().

A special-case variant of get_page() could be introduced, but maybe that
would still be overly fragile.

When we discussed this, what I had proposed didn't require use of get_page()
though. assign_pages() would install two general references (plus one type
ref for PGT_writable) in this special case. To free, you'd call
put_page_alloc_ref() followed by put_page_and_type().

That said, the patch here is still less intrusive than I feared, so I'm not
asking to re-work this again.

>  * Disallow XENMEM_decrease_reservation until the domain has finished
>    creation would fix the issue of pages being freed while pending scrub,
>    but it's not clear there might be other usages that would be problematic,
>    as get_page() on non-scrubbed pages would still return success.

I agree this is of concern.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
>                              goto out;
>                          }
>                      }
> +
> +                    if ( assign_page(page, a->extent_order, d, memflags) )
> +                    {
> +                        free_domheap_pages(page, a->extent_order);

The pages don't have an owner set yet, so that function will go straight
to free_heap_pages(), needlessly passing "true" as last argument. Correct,
but (for large pages, which the stashing is about) highly inefficient.

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -2699,7 +2699,13 @@ struct page_info *alloc_domheap_pages(
>                                    memflags, d)) == NULL)) )
>           return NULL;
>  
> -    if ( d && !(memflags & MEMF_no_owner) )
> +    /*
> +     * Don't add pages with the PGC_need_scrub bit set to the domain, the
> +     * caller must clean the bit and then manually call assign_pages().
> +     * Otherwise pages with the PGC_need_scrub would be reachable using
> +     * get_page().
> +     */

How about replacing the latter "with the PGC_need_scrub" by "still subject
to scrubbing"?

> +    if ( d && !(memflags & MEMF_no_owner) && !(memflags & MEMF_keep_scrub) )
>      {
>          if ( memflags & MEMF_no_refcount )
>          {

This no-refcount code isn't repeated at the new call site of assign_page().
It's not needed there, yes, but wouldn't we better allow this to be taken
care of right here, moving the MEMF_keep_scrub check immediately ahead of
the call to assign_page()?

Otherwise should we reject (much earlier) MEMF_no_refcount used together
with MEMF_keep_scrub?

Jan

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Roger Pau Monné 1 week, 1 day ago

On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
> On 25.03.2026 11:08, Roger Pau Monne wrote:
> > ---
> > I've attempted various different ways to solve this, but they all ended up
> > being impossible.
> > 
> >  * Prevent non-scrubbed pages from getting extra refcounts (iow: make
> >    get_page() fail for them).  This seemed nice, but the cleanup using
> >    put_page_alloc_ref() was impossible as non-scrubbed pages would return
> >    failure in get_page(), and so I couldn't take the extra reference ahead
> >    of calling put_page_alloc_ref().
> 
> A special-case variant of get_page() could be introduced, but maybe that
> would still be overly fragile.

It seemed too much complexity (and risk), just to deal with this
scenario.

> When we discussed this, what I had proposed didn't require use of get_page()
> though. assign_pages() would install two general references (plus one type
> ref for PGT_writable) in this special case. To free, you'd call
> put_page_alloc_ref() followed by put_page_and_type().

Doesn't that risk under flowing the page counter if there's a parallel
call to decrease_reservation() against this MFN before?

How would the freeing done in populate_physmap() (in case of
concurrent calls) know whether already scrubbed pages have had it's
PGC_allocated bit dropped?

> That said, the patch here is still less intrusive than I feared, so I'm not
> asking to re-work this again.

Thanks.

> >  * Disallow XENMEM_decrease_reservation until the domain has finished
> >    creation would fix the issue of pages being freed while pending scrub,
> >    but it's not clear there might be other usages that would be problematic,
> >    as get_page() on non-scrubbed pages would still return success.
> 
> I agree this is of concern.
> 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
> >                              goto out;
> >                          }
> >                      }
> > +
> > +                    if ( assign_page(page, a->extent_order, d, memflags) )
> > +                    {
> > +                        free_domheap_pages(page, a->extent_order);
> 
> The pages don't have an owner set yet, so that function will go straight
> to free_heap_pages(), needlessly passing "true" as last argument. Correct,
> but (for large pages, which the stashing is about) highly inefficient.

My bad, I was sure I was using the same freeing function as
alloc_domheap_pages() on failure to assign, but I clearly wasn't.  I
will switch to using free_heap_pages().

> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -2699,7 +2699,13 @@ struct page_info *alloc_domheap_pages(
> >                                    memflags, d)) == NULL)) )
> >           return NULL;
> >  
> > -    if ( d && !(memflags & MEMF_no_owner) )
> > +    /*
> > +     * Don't add pages with the PGC_need_scrub bit set to the domain, the
> > +     * caller must clean the bit and then manually call assign_pages().
> > +     * Otherwise pages with the PGC_need_scrub would be reachable using
> > +     * get_page().
> > +     */
> 
> How about replacing the latter "with the PGC_need_scrub" by "still subject
> to scrubbing"?

Sure.

> > +    if ( d && !(memflags & MEMF_no_owner) && !(memflags & MEMF_keep_scrub) )
> >      {
> >          if ( memflags & MEMF_no_refcount )
> >          {
> 
> This no-refcount code isn't repeated at the new call site of assign_page().
> It's not needed there, yes, but wouldn't we better allow this to be taken
> care of right here, moving the MEMF_keep_scrub check immediately ahead of
> the call to assign_page()?
> 
> Otherwise should we reject (much earlier) MEMF_no_refcount used together
> with MEMF_keep_scrub?

hm, I was likely too focus on the specific use-case of
populate_physmap(), which is the only user of MEMF_keep_scrub.  I've
noticed the MEMF_no_refcount bits here, but since populate_physmap()
never uses that flag I just left it as-is.

Will see about adjusting it.

Thanks, Roger.

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Jan Beulich 1 week ago

On 25.03.2026 17:26, Roger Pau Monné wrote:
> On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
>> On 25.03.2026 11:08, Roger Pau Monne wrote:
>>> ---
>>> I've attempted various different ways to solve this, but they all ended up
>>> being impossible.
>>>
>>>  * Prevent non-scrubbed pages from getting extra refcounts (iow: make
>>>    get_page() fail for them).  This seemed nice, but the cleanup using
>>>    put_page_alloc_ref() was impossible as non-scrubbed pages would return
>>>    failure in get_page(), and so I couldn't take the extra reference ahead
>>>    of calling put_page_alloc_ref().
>>
>> A special-case variant of get_page() could be introduced, but maybe that
>> would still be overly fragile.
> 
> It seemed too much complexity (and risk), just to deal with this
> scenario.
> 
>> When we discussed this, what I had proposed didn't require use of get_page()
>> though. assign_pages() would install two general references (plus one type
>> ref for PGT_writable) in this special case. To free, you'd call
>> put_page_alloc_ref() followed by put_page_and_type().
> 
> Doesn't that risk under flowing the page counter if there's a parallel
> call to decrease_reservation() against this MFN before?
> 
> How would the freeing done in populate_physmap() (in case of
> concurrent calls) know whether already scrubbed pages have had it's
> PGC_allocated bit dropped?

In that case put_page_alloc_ref() simply does nothing. That's why we have
this wrapper: To avoid open-coding the same check in many places.

Jan

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Roger Pau Monné 1 week, 1 day ago

On Wed, Mar 25, 2026 at 05:26:01PM +0100, Roger Pau Monné wrote:
> On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
> > On 25.03.2026 11:08, Roger Pau Monne wrote:
> > >  * Disallow XENMEM_decrease_reservation until the domain has finished
> > >    creation would fix the issue of pages being freed while pending scrub,
> > >    but it's not clear there might be other usages that would be problematic,
> > >    as get_page() on non-scrubbed pages would still return success.
> > 
> > I agree this is of concern.
> > 
> > > --- a/xen/common/memory.c
> > > +++ b/xen/common/memory.c
> > > @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
> > >                              goto out;
> > >                          }
> > >                      }
> > > +
> > > +                    if ( assign_page(page, a->extent_order, d, memflags) )
> > > +                    {
> > > +                        free_domheap_pages(page, a->extent_order);
> > 
> > The pages don't have an owner set yet, so that function will go straight
> > to free_heap_pages(), needlessly passing "true" as last argument. Correct,
> > but (for large pages, which the stashing is about) highly inefficient.
> 
> My bad, I was sure I was using the same freeing function as
> alloc_domheap_pages() on failure to assign, but I clearly wasn't.  I
> will switch to using free_heap_pages().

Coming back to this, I can export free_heap_pages(), but then the call
would also unconditionally have need_scrub == true, as the pages have
been allocated without scrubbing.  So I don't see much benefit of
using free_heap_pages(), the more that it requires making such
function global then.

Thanks, Roger.

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Jan Beulich 1 week ago

On 25.03.2026 17:54, Roger Pau Monné wrote:
> On Wed, Mar 25, 2026 at 05:26:01PM +0100, Roger Pau Monné wrote:
>> On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
>>> On 25.03.2026 11:08, Roger Pau Monne wrote:
>>>>  * Disallow XENMEM_decrease_reservation until the domain has finished
>>>>    creation would fix the issue of pages being freed while pending scrub,
>>>>    but it's not clear there might be other usages that would be problematic,
>>>>    as get_page() on non-scrubbed pages would still return success.
>>>
>>> I agree this is of concern.
>>>
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
>>>>                              goto out;
>>>>                          }
>>>>                      }
>>>> +
>>>> +                    if ( assign_page(page, a->extent_order, d, memflags) )
>>>> +                    {
>>>> +                        free_domheap_pages(page, a->extent_order);
>>>
>>> The pages don't have an owner set yet, so that function will go straight
>>> to free_heap_pages(), needlessly passing "true" as last argument. Correct,
>>> but (for large pages, which the stashing is about) highly inefficient.
>>
>> My bad, I was sure I was using the same freeing function as
>> alloc_domheap_pages() on failure to assign, but I clearly wasn't.  I
>> will switch to using free_heap_pages().
> 
> Coming back to this, I can export free_heap_pages(), but then the call
> would also unconditionally have need_scrub == true, as the pages have
> been allocated without scrubbing.

But the assign_page() call is here to have the scrubbing done ahead of
it, so re-scrubbing after freeing shouldn't be necessary?

Jan

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Roger Pau Monné 1 week ago

On Thu, Mar 26, 2026 at 09:18:14AM +0100, Jan Beulich wrote:
> On 25.03.2026 17:54, Roger Pau Monné wrote:
> > On Wed, Mar 25, 2026 at 05:26:01PM +0100, Roger Pau Monné wrote:
> >> On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
> >>> On 25.03.2026 11:08, Roger Pau Monne wrote:
> >>>>  * Disallow XENMEM_decrease_reservation until the domain has finished
> >>>>    creation would fix the issue of pages being freed while pending scrub,
> >>>>    but it's not clear there might be other usages that would be problematic,
> >>>>    as get_page() on non-scrubbed pages would still return success.
> >>>
> >>> I agree this is of concern.
> >>>
> >>>> --- a/xen/common/memory.c
> >>>> +++ b/xen/common/memory.c
> >>>> @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
> >>>>                              goto out;
> >>>>                          }
> >>>>                      }
> >>>> +
> >>>> +                    if ( assign_page(page, a->extent_order, d, memflags) )
> >>>> +                    {
> >>>> +                        free_domheap_pages(page, a->extent_order);
> >>>
> >>> The pages don't have an owner set yet, so that function will go straight
> >>> to free_heap_pages(), needlessly passing "true" as last argument. Correct,
> >>> but (for large pages, which the stashing is about) highly inefficient.
> >>
> >> My bad, I was sure I was using the same freeing function as
> >> alloc_domheap_pages() on failure to assign, but I clearly wasn't.  I
> >> will switch to using free_heap_pages().
> > 
> > Coming back to this, I can export free_heap_pages(), but then the call
> > would also unconditionally have need_scrub == true, as the pages have
> > been allocated without scrubbing.
> 
> But the assign_page() call is here to have the scrubbing done ahead of
> it, so re-scrubbing after freeing shouldn't be necessary?

I think I've done what you suggested in patch 3 of the v2.  For the
call here, yes, we could entirely avoid the scrubbing.  For the other
free_domheap_pages() calls in stash_allocation() and
get_stashed_allocation() respectively we need to be more careful as
some pages will still be pending scrub.

Thanks, Roger.

Re: [PATCH 2/2] xen/mm: do not assign pages to a domain until they are scrubbed

Posted by Jan Beulich 1 week, 1 day ago

On 25.03.2026 17:26, Roger Pau Monné wrote:
> On Wed, Mar 25, 2026 at 03:56:05PM +0100, Jan Beulich wrote:
>> On 25.03.2026 11:08, Roger Pau Monne wrote:
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -388,6 +388,12 @@ static void populate_physmap(struct memop_args *a)
>>>                              goto out;
>>>                          }
>>>                      }
>>> +
>>> +                    if ( assign_page(page, a->extent_order, d, memflags) )
>>> +                    {
>>> +                        free_domheap_pages(page, a->extent_order);
>>
>> The pages don't have an owner set yet, so that function will go straight
>> to free_heap_pages(), needlessly passing "true" as last argument. Correct,
>> but (for large pages, which the stashing is about) highly inefficient.
> 
> My bad, I was sure I was using the same freeing function as
> alloc_domheap_pages() on failure to assign, but I clearly wasn't.  I
> will switch to using free_heap_pages().

Well, not so much your bad, but likely a result of free_heap_pages()
being static in page_alloc.c.

Jan