[v4] xen/mm: limit in-place scrubbing

[PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monne 1 week, 5 days ago

Physmap population has the need to use pages as big as possible to reduce
p2m shattering.  However that triggers issues when big enough pages are not
yet scrubbed, and so scrubbing must be done at allocation time.  On some
scenarios with added contention the watchdog can trigger:

Watchdog timer detects that CPU55 is stuck!
----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
CPU:    55
RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
[...]
Xen call trace:
   [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
   [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
   [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
   [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
   [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
   [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970

Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
relies on stashing the dirty page in the domain struct temporarily to
preempt to guest context, so the scrubbing can resume when the domain
re-enters the hypercall.  The added deferral mechanism will only be used for
domain construction, and is designed to be used with a single threaded
domain builder.  If the toolstack makes concurrent calls to
XENMEM_populate_physmap for the same target domain it will trash stashed
pages, resulting in slow domain physmap population.

Note a similar issue is present in increase reservation.  However that
hypercall is likely to only be used once the domain is already running and
the known implementations use 4K pages. It will be deal with in a separate
patch using a different approach, that will also take care of the
allocation in populate_physmap() once the domain is running.

Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v3:
 - Introduce helper to free stashed pages.
 - Attempt to free stashed pages from domain_unpause_by_systemcontroller()
   also.
 - Free stashed page in get_stashed_allocation() if it doesn't match the
   requested parameters.

Changes since v2:
 - Introduce FREE_DOMHEAP_PAGE{,S}().
 - Remove j local counter.
 - Free page pending scrub in domain_kill() also.
 - Remove BUG_ON().
 - Reorder get_stashed_allocation() flow.
 - s/dirty/unscrubbed/ in a printk message.

Changes since v1:
 - New in this version, different approach than v1.
---
 xen/common/domain.c     |  30 ++++++++++++
 xen/common/memory.c     | 101 +++++++++++++++++++++++++++++++++++++++-
 xen/common/page_alloc.c |   2 +-
 xen/include/xen/mm.h    |  10 ++++
 xen/include/xen/sched.h |   5 ++
 5 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 376351b528c9..123202f2c025 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -710,6 +710,23 @@ static int domain_teardown(struct domain *d)
     return 0;
 }
 
+/*
+ * Called multiple times during domain destruction, to attempt to early free
+ * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
+ * when the toolstack can no longer stash any pages.
+ */
+static void domain_free_pending_scrub(struct domain *d)
+{
+    rspin_lock(&d->page_alloc_lock);
+    if ( d->pending_scrub )
+    {
+        FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
+        d->pending_scrub_order = 0;
+        d->pending_scrub_index = 0;
+    }
+    rspin_unlock(&d->page_alloc_lock);
+}
+
 /*
  * Destroy a domain once all references to it have been dropped.  Used either
  * from the RCU path, or from the domain_create() error path before the domain
@@ -722,6 +739,8 @@ static void _domain_destroy(struct domain *d)
 
     XVFREE(d->console);
 
+    domain_free_pending_scrub(d);
+
     argo_destroy(d);
 
     rangeset_domain_destroy(d);
@@ -1286,6 +1305,8 @@ int domain_kill(struct domain *d)
         rspin_barrier(&d->domain_lock);
         argo_destroy(d);
         vnuma_destroy(d->vnuma);
+        domain_free_pending_scrub(d);
+        rspin_unlock(&d->page_alloc_lock);
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
     case DOMDYING_dying:
@@ -1678,6 +1699,15 @@ int domain_unpause_by_systemcontroller(struct domain *d)
      */
     if ( new == 0 && !d->creation_finished )
     {
+        if ( d->pending_scrub )
+        {
+            printk(XENLOG_ERR
+                   "%pd: cannot be started with pending unscrubbed pages, destroying\n",
+                   d);
+            domain_crash(d);
+            domain_free_pending_scrub(d);
+            return -EBUSY;
+        }
         d->creation_finished = true;
         arch_domain_creation_finished(d);
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 10becf7c1f4c..1c48e99a6ab2 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -159,6 +159,70 @@ static void increase_reservation(struct memop_args *a)
     a->nr_done = i;
 }
 
+/*
+ * Temporary storage for a domain assigned page that's not been fully scrubbed.
+ * Stored pages must be domheap ones.
+ *
+ * The stashed page can be freed at any time by Xen, the caller must pass the
+ * order and NUMA node requirement to the fetch function to ensure the
+ * currently stashed page matches it's requirements.
+ */
+static void stash_allocation(struct domain *d, struct page_info *page,
+                             unsigned int order, unsigned int scrub_index)
+{
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * Drop the passed page in preference for the already stashed one.  This
+     * interface is designed to be used for single-threaded domain creation.
+     */
+    if ( d->pending_scrub )
+        free_domheap_pages(page, order);
+    else
+    {
+        d->pending_scrub_index = scrub_index;
+        d->pending_scrub_order = order;
+        d->pending_scrub = page;
+    }
+
+    rspin_unlock(&d->page_alloc_lock);
+}
+
+static struct page_info *get_stashed_allocation(struct domain *d,
+                                                unsigned int order,
+                                                nodeid_t node,
+                                                unsigned int *scrub_index)
+{
+    struct page_info *page = NULL;
+
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * If there's a pending page to scrub check if it satisfies the current
+     * request.  If it doesn't free it and return NULL.
+     */
+    if ( d->pending_scrub && d->pending_scrub_order == order &&
+         (node == NUMA_NO_NODE || node == page_to_nid(d->pending_scrub)) )
+    {
+        page = d->pending_scrub;
+        *scrub_index = d->pending_scrub_index;
+    }
+    else
+        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
+
+    /*
+     * The caller now owns the page or it has been freed, clear stashed
+     * information.  Prevent concurrent usages of get_stashed_allocation()
+     * from returning the same page to different contexts.
+     */
+    d->pending_scrub_index = 0;
+    d->pending_scrub_order = 0;
+    d->pending_scrub = NULL;
+
+    rspin_unlock(&d->page_alloc_lock);
+    return page;
+}
+
 static void populate_physmap(struct memop_args *a)
 {
     struct page_info *page;
@@ -275,7 +339,18 @@ static void populate_physmap(struct memop_args *a)
             }
             else
             {
-                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
+                unsigned int scrub_start = 0;
+                nodeid_t node =
+                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
+                                                    : NUMA_NO_NODE;
+
+                page = get_stashed_allocation(d, a->extent_order, node,
+                                              &scrub_start);
+
+                if ( !page )
+                    page = alloc_domheap_pages(d, a->extent_order,
+                        a->memflags | (d->creation_finished ? 0
+                                                            : MEMF_no_scrub));
 
                 if ( unlikely(!page) )
                 {
@@ -286,6 +361,30 @@ static void populate_physmap(struct memop_args *a)
                     goto out;
                 }
 
+                if ( !d->creation_finished )
+                {
+                    unsigned int dirty_cnt = 0;
+
+                    /* Check if there's anything to scrub. */
+                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
+                    {
+                        if ( !test_and_clear_bit(_PGC_need_scrub,
+                                                 &page[j].count_info) )
+                            continue;
+
+                        scrub_one_page(&page[j], true);
+
+                        if ( (j + 1) != (1U << a->extent_order) &&
+                             !(++dirty_cnt & 0xff) &&
+                             hypercall_preempt_check() )
+                        {
+                            a->preempted = 1;
+                            stash_allocation(d, page, a->extent_order, ++j);
+                            goto out;
+                        }
+                    }
+                }
+
                 if ( unlikely(a->memflags & MEMF_no_tlbflush) )
                 {
                     for ( j = 0; j < (1U << a->extent_order); j++ )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index de1480316f05..c9e82fd7ab62 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -792,7 +792,7 @@ static void page_list_add_scrub(struct page_info *pg, unsigned int node,
 # define scrub_page_cold clear_page_cold
 #endif
 
-static void scrub_one_page(const struct page_info *pg, bool cold)
+void scrub_one_page(const struct page_info *pg, bool cold)
 {
     void *ptr;
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 426362adb2f4..d80bfba6d393 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -145,6 +145,16 @@ unsigned long avail_node_heap_pages(unsigned int nodeid);
 #define alloc_domheap_page(d,f) (alloc_domheap_pages(d,0,f))
 #define free_domheap_page(p)  (free_domheap_pages(p,0))
 
+/* Free an allocation, and zero the pointer to it. */
+#define FREE_DOMHEAP_PAGES(p, o) do { \
+    void *_ptr_ = (p);                \
+    (p) = NULL;                       \
+    free_domheap_pages(_ptr_, o);     \
+} while ( false )
+#define FREE_DOMHEAP_PAGE(p) FREE_DOMHEAP_PAGES(p, 0)
+
+void scrub_one_page(const struct page_info *pg, bool cold);
+
 int online_page(mfn_t mfn, uint32_t *status);
 int offline_page(mfn_t mfn, int broken, uint32_t *status);
 int query_page_offline(mfn_t mfn, uint32_t *status);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 91d6a49daf16..735d5b76b411 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -661,6 +661,11 @@ struct domain
 
     /* Pointer to console settings; NULL for system domains. */
     struct domain_console *console;
+
+    /* Pointer to allocated domheap page that possibly needs scrubbing. */
+    struct page_info *pending_scrub;
+    unsigned int pending_scrub_order;
+    unsigned int pending_scrub_index;
 } __aligned(PAGE_SIZE);
 
 static inline struct page_list_head *page_to_list(
-- 
2.51.0

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Jan Beulich 1 week, 4 days ago

On 28.01.2026 13:03, Roger Pau Monne wrote:
> @@ -275,7 +339,18 @@ static void populate_physmap(struct memop_args *a)
>              }
>              else
>              {
> -                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
> +                unsigned int scrub_start = 0;
> +                nodeid_t node =
> +                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
> +                                                    : NUMA_NO_NODE;
> +
> +                page = get_stashed_allocation(d, a->extent_order, node,
> +                                              &scrub_start);
> +
> +                if ( !page )
> +                    page = alloc_domheap_pages(d, a->extent_order,
> +                        a->memflags | (d->creation_finished ? 0
> +                                                            : MEMF_no_scrub));

I fear there's a more basic issue here that so far we didn't pay attention to:
alloc_domheap_pages() is what invokes assign_page(), which in turn resets
->count_info for each of the pages. This reset includes setting PGC_allocated,
which ...

> @@ -286,6 +361,30 @@ static void populate_physmap(struct memop_args *a)
>                      goto out;
>                  }
>  
> +                if ( !d->creation_finished )
> +                {
> +                    unsigned int dirty_cnt = 0;
> +
> +                    /* Check if there's anything to scrub. */
> +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> +                    {
> +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> +                                                 &page[j].count_info) )
> +                            continue;

... means we will now scrub every page in the block, not just those which weren't
scrubbed yet, and we end up clearing PGC_allocated. All because of PGC_need_scrub
aliasing PGC_allocated. I wonder how this didn't end up screwing any testing you
surely will have done. Or maybe I'm completely off here?

Jan

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monné 1 week, 4 days ago

On Wed, Jan 28, 2026 at 03:46:04PM +0100, Jan Beulich wrote:
> On 28.01.2026 13:03, Roger Pau Monne wrote:
> > @@ -275,7 +339,18 @@ static void populate_physmap(struct memop_args *a)
> >              }
> >              else
> >              {
> > -                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
> > +                unsigned int scrub_start = 0;
> > +                nodeid_t node =
> > +                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
> > +                                                    : NUMA_NO_NODE;
> > +
> > +                page = get_stashed_allocation(d, a->extent_order, node,
> > +                                              &scrub_start);
> > +
> > +                if ( !page )
> > +                    page = alloc_domheap_pages(d, a->extent_order,
> > +                        a->memflags | (d->creation_finished ? 0
> > +                                                            : MEMF_no_scrub));
> 
> I fear there's a more basic issue here that so far we didn't pay attention to:
> alloc_domheap_pages() is what invokes assign_page(), which in turn resets
> ->count_info for each of the pages. This reset includes setting PGC_allocated,
> which ...
> 
> > @@ -286,6 +361,30 @@ static void populate_physmap(struct memop_args *a)
> >                      goto out;
> >                  }
> >  
> > +                if ( !d->creation_finished )
> > +                {
> > +                    unsigned int dirty_cnt = 0;
> > +
> > +                    /* Check if there's anything to scrub. */
> > +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> > +                    {
> > +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> > +                                                 &page[j].count_info) )
> > +                            continue;
> 
> ... means we will now scrub every page in the block, not just those which weren't
> scrubbed yet, and we end up clearing PGC_allocated. All because of PGC_need_scrub
> aliasing PGC_allocated. I wonder how this didn't end up screwing any testing you
> surely will have done. Or maybe I'm completely off here?

Thanks for spotting this!  No, I didn't see any issues.  I don't see
any check for PGC_allocated in free_domheap_pages(), which could
explain the lack of failures?

I will have to allocate with MEMF_no_owner and then do the
assign_pages() call from populate_physmap() after the scrubbing is
done.  Maybe that would work.  Memory allocated using MEMF_no_owner
still consumes the claim pool if a domain parameter is passed to
alloc_heap_pages().

Roger.

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Jan Beulich 1 week, 4 days ago

On 28.01.2026 20:06, Roger Pau Monné wrote:
> On Wed, Jan 28, 2026 at 03:46:04PM +0100, Jan Beulich wrote:
>> On 28.01.2026 13:03, Roger Pau Monne wrote:
>>> @@ -275,7 +339,18 @@ static void populate_physmap(struct memop_args *a)
>>>              }
>>>              else
>>>              {
>>> -                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
>>> +                unsigned int scrub_start = 0;
>>> +                nodeid_t node =
>>> +                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
>>> +                                                    : NUMA_NO_NODE;
>>> +
>>> +                page = get_stashed_allocation(d, a->extent_order, node,
>>> +                                              &scrub_start);
>>> +
>>> +                if ( !page )
>>> +                    page = alloc_domheap_pages(d, a->extent_order,
>>> +                        a->memflags | (d->creation_finished ? 0
>>> +                                                            : MEMF_no_scrub));
>>
>> I fear there's a more basic issue here that so far we didn't pay attention to:
>> alloc_domheap_pages() is what invokes assign_page(), which in turn resets
>> ->count_info for each of the pages. This reset includes setting PGC_allocated,
>> which ...
>>
>>> @@ -286,6 +361,30 @@ static void populate_physmap(struct memop_args *a)
>>>                      goto out;
>>>                  }
>>>  
>>> +                if ( !d->creation_finished )
>>> +                {
>>> +                    unsigned int dirty_cnt = 0;
>>> +
>>> +                    /* Check if there's anything to scrub. */
>>> +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
>>> +                    {
>>> +                        if ( !test_and_clear_bit(_PGC_need_scrub,
>>> +                                                 &page[j].count_info) )
>>> +                            continue;
>>
>> ... means we will now scrub every page in the block, not just those which weren't
>> scrubbed yet, and we end up clearing PGC_allocated. All because of PGC_need_scrub
>> aliasing PGC_allocated. I wonder how this didn't end up screwing any testing you
>> surely will have done. Or maybe I'm completely off here?
> 
> Thanks for spotting this!  No, I didn't see any issues.  I don't see
> any check for PGC_allocated in free_domheap_pages(), which could
> explain the lack of failures?

Maybe. PGC_allocated consumes a page ref, so I would have expected accounting
issues.

> I will have to allocate with MEMF_no_owner and then do the
> assign_pages() call from populate_physmap() after the scrubbing is
> done.  Maybe that would work.  Memory allocated using MEMF_no_owner
> still consumes the claim pool if a domain parameter is passed to
> alloc_heap_pages().

Technically this looks like it might work, but it's feeling as if this was
getting increasingly fragile. I'm also not quite sure whether MEMF_no_owner
allocations should consume claimed pages. Imo there are arguments both in
favor and against such behavior.

We may want to explore the alternative of un-aliasing the two PGC_*.

Jan

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monné 1 week, 4 days ago

On Thu, Jan 29, 2026 at 08:53:05AM +0100, Jan Beulich wrote:
> On 28.01.2026 20:06, Roger Pau Monné wrote:
> > On Wed, Jan 28, 2026 at 03:46:04PM +0100, Jan Beulich wrote:
> >> On 28.01.2026 13:03, Roger Pau Monne wrote:
> >>> @@ -275,7 +339,18 @@ static void populate_physmap(struct memop_args *a)
> >>>              }
> >>>              else
> >>>              {
> >>> -                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
> >>> +                unsigned int scrub_start = 0;
> >>> +                nodeid_t node =
> >>> +                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
> >>> +                                                    : NUMA_NO_NODE;
> >>> +
> >>> +                page = get_stashed_allocation(d, a->extent_order, node,
> >>> +                                              &scrub_start);
> >>> +
> >>> +                if ( !page )
> >>> +                    page = alloc_domheap_pages(d, a->extent_order,
> >>> +                        a->memflags | (d->creation_finished ? 0
> >>> +                                                            : MEMF_no_scrub));
> >>
> >> I fear there's a more basic issue here that so far we didn't pay attention to:
> >> alloc_domheap_pages() is what invokes assign_page(), which in turn resets
> >> ->count_info for each of the pages. This reset includes setting PGC_allocated,
> >> which ...
> >>
> >>> @@ -286,6 +361,30 @@ static void populate_physmap(struct memop_args *a)
> >>>                      goto out;
> >>>                  }
> >>>  
> >>> +                if ( !d->creation_finished )
> >>> +                {
> >>> +                    unsigned int dirty_cnt = 0;
> >>> +
> >>> +                    /* Check if there's anything to scrub. */
> >>> +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> >>> +                    {
> >>> +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> >>> +                                                 &page[j].count_info) )
> >>> +                            continue;
> >>
> >> ... means we will now scrub every page in the block, not just those which weren't
> >> scrubbed yet, and we end up clearing PGC_allocated. All because of PGC_need_scrub
> >> aliasing PGC_allocated. I wonder how this didn't end up screwing any testing you
> >> surely will have done. Or maybe I'm completely off here?
> > 
> > Thanks for spotting this!  No, I didn't see any issues.  I don't see
> > any check for PGC_allocated in free_domheap_pages(), which could
> > explain the lack of failures?
> 
> Maybe. PGC_allocated consumes a page ref, so I would have expected accounting
> issues.
> 
> > I will have to allocate with MEMF_no_owner and then do the
> > assign_pages() call from populate_physmap() after the scrubbing is
> > done.  Maybe that would work.  Memory allocated using MEMF_no_owner
> > still consumes the claim pool if a domain parameter is passed to
> > alloc_heap_pages().
> 
> Technically this looks like it might work, but it's feeling as if this was
> getting increasingly fragile. I'm also not quite sure whether MEMF_no_owner
> allocations should consume claimed pages. Imo there are arguments both in
> favor and against such behavior.
> 
> We may want to explore the alternative of un-aliasing the two PGC_*.

I expected the PGC_ bits would be all consumed, but I see there's a
range that are still empty, so it might indeed be easier to remove the
alias.  Let me give that a try.

Thanks, Roger.

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Andrew Cooper 1 week, 5 days ago

On 28/01/2026 12:03 pm, Roger Pau Monne wrote:
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 376351b528c9..123202f2c025 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -710,6 +710,23 @@ static int domain_teardown(struct domain *d)
>      return 0;
>  }
>  
> +/*
> + * Called multiple times during domain destruction, to attempt to early free
> + * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
> + * when the toolstack can no longer stash any pages.
> + */
> +static void domain_free_pending_scrub(struct domain *d)
> +{
> +    rspin_lock(&d->page_alloc_lock);
> +    if ( d->pending_scrub )
> +    {
> +        FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> +        d->pending_scrub_order = 0;
> +        d->pending_scrub_index = 0;
> +    }
> +    rspin_unlock(&d->page_alloc_lock);
> +}
> +
>  /*
>   * Destroy a domain once all references to it have been dropped.  Used either
>   * from the RCU path, or from the domain_create() error path before the domain
> @@ -722,6 +739,8 @@ static void _domain_destroy(struct domain *d)
>  
>      XVFREE(d->console);
>  
> +    domain_free_pending_scrub(d);
> +
>      argo_destroy(d);
>  
>      rangeset_domain_destroy(d);
> @@ -1286,6 +1305,8 @@ int domain_kill(struct domain *d)
>          rspin_barrier(&d->domain_lock);
>          argo_destroy(d);
>          vnuma_destroy(d->vnuma);
> +        domain_free_pending_scrub(d);
> +        rspin_unlock(&d->page_alloc_lock);

This is a double unlock, isn't it?


The freeing wants to be in domain_kill() (ish), or _domain_destroy() but
not both.

In this case we can't have anything using pending scrubbing until the
domain is the domlist (i.e. can be the target of other hypercalls), so
this wants to be in domain_relinquish_resources()  (rather than
domain_kill() which we're trying to empty).

In principle we could assert that it's already NULL in _domain_destroy()
which might help catch an error if it gets set early but the domain
destroyed before getting into the domlist, but that seems like a rather
slim case.

~Andrew

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monné 1 week, 4 days ago

On Wed, Jan 28, 2026 at 12:44:26PM +0000, Andrew Cooper wrote:
> On 28/01/2026 12:03 pm, Roger Pau Monne wrote:
> > diff --git a/xen/common/domain.c b/xen/common/domain.c
> > index 376351b528c9..123202f2c025 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -710,6 +710,23 @@ static int domain_teardown(struct domain *d)
> >      return 0;
> >  }
> >  
> > +/*
> > + * Called multiple times during domain destruction, to attempt to early free
> > + * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
> > + * when the toolstack can no longer stash any pages.
> > + */
> > +static void domain_free_pending_scrub(struct domain *d)
> > +{
> > +    rspin_lock(&d->page_alloc_lock);
> > +    if ( d->pending_scrub )
> > +    {
> > +        FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> > +        d->pending_scrub_order = 0;
> > +        d->pending_scrub_index = 0;
> > +    }
> > +    rspin_unlock(&d->page_alloc_lock);
> > +}
> > +
> >  /*
> >   * Destroy a domain once all references to it have been dropped.  Used either
> >   * from the RCU path, or from the domain_create() error path before the domain
> > @@ -722,6 +739,8 @@ static void _domain_destroy(struct domain *d)
> >  
> >      XVFREE(d->console);
> >  
> > +    domain_free_pending_scrub(d);
> > +
> >      argo_destroy(d);
> >  
> >      rangeset_domain_destroy(d);
> > @@ -1286,6 +1305,8 @@ int domain_kill(struct domain *d)
> >          rspin_barrier(&d->domain_lock);
> >          argo_destroy(d);
> >          vnuma_destroy(d->vnuma);
> > +        domain_free_pending_scrub(d);
> > +        rspin_unlock(&d->page_alloc_lock);
> 
> This is a double unlock, isn't it?

Bah, this was a leftover from the previous version, sorry.

> 
> The freeing wants to be in domain_kill() (ish), or _domain_destroy() but
> not both.

Jan specifically asked for the cleanup to be in both.
_domain_destroy() is the must have one, as that's done once the domain
is unhooked from the domain list and hence can no longer be the target
of populate physmap hypercalls.  Jan asked for the domain_kill()
instance to attempt to free the pending page as early as possible.

> In this case we can't have anything using pending scrubbing until the
> domain is the domlist (i.e. can be the target of other hypercalls), so
> this wants to be in domain_relinquish_resources()  (rather than
> domain_kill() which we're trying to empty).

domain_relinquish_resources() is arch-specific, while the
pending_scrub stuff is common with all arches.  It seemed better
placed in domain_kill() because that's generic code shared by all
arches.

I could place it in domain_teardown() instead if that's better than
domain_kill().

> In principle we could assert that it's already NULL in _domain_destroy()
> which might help catch an error if it gets set early but the domain
> destroyed before getting into the domlist, but that seems like a rather
> slim case.

Given my understanding of the logic in the XENMEM_ hypercalls, I think
toolstack can still target domains in the process of being destroyed,
at which point we need to keep a final cleanup instance
_domain_destroy(), or otherwise adjust XENMEM_populate_physmap
hypercall (and others?) so they can't target dying domains.

Thanks, Roger.

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Jan Beulich 1 week, 4 days ago

On 28.01.2026 14:33, Roger Pau Monné wrote:
> On Wed, Jan 28, 2026 at 12:44:26PM +0000, Andrew Cooper wrote:
>> In principle we could assert that it's already NULL in _domain_destroy()
>> which might help catch an error if it gets set early but the domain
>> destroyed before getting into the domlist, but that seems like a rather
>> slim case.
> 
> Given my understanding of the logic in the XENMEM_ hypercalls, I think
> toolstack can still target domains in the process of being destroyed,
> at which point we need to keep a final cleanup instance
> _domain_destroy(), or otherwise adjust XENMEM_populate_physmap
> hypercall (and others?) so they can't target dying domains.

Considering that these requests will fail for dying domains because of the
check in assign_pages(), it may indeed make sense to have another earlier
check for the purposes here. Otoh doing this early may not buy us very
much, as the domain may become "dying" immediately after the check. Whereas
switching stash_allocation()'s if() to

    if ( d->pending_scrub || d->is_dying )

looks like it might do.

Jan

Re: [PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monné 1 week, 3 days ago

On Wed, Jan 28, 2026 at 03:34:11PM +0100, Jan Beulich wrote:
> On 28.01.2026 14:33, Roger Pau Monné wrote:
> > On Wed, Jan 28, 2026 at 12:44:26PM +0000, Andrew Cooper wrote:
> >> In principle we could assert that it's already NULL in _domain_destroy()
> >> which might help catch an error if it gets set early but the domain
> >> destroyed before getting into the domlist, but that seems like a rather
> >> slim case.
> > 
> > Given my understanding of the logic in the XENMEM_ hypercalls, I think
> > toolstack can still target domains in the process of being destroyed,
> > at which point we need to keep a final cleanup instance
> > _domain_destroy(), or otherwise adjust XENMEM_populate_physmap
> > hypercall (and others?) so they can't target dying domains.
> 
> Considering that these requests will fail for dying domains because of the
> check in assign_pages(), it may indeed make sense to have another earlier
> check for the purposes here. Otoh doing this early may not buy us very
> much, as the domain may become "dying" immediately after the check. Whereas
> switching stash_allocation()'s if() to
> 
>     if ( d->pending_scrub || d->is_dying )
> 
> looks like it might do.

Oh, I didn't notice the check in assign_pages().

I think a check in stash_allocation() together with a call to
domain_pending_scrub_free() in domain_teardown() (which is after
setting d->is_dying) should be enough to ensure the field is clear
when reaching _domain_destroy().  The call to
domain_pending_scrub_free() in _domain_destroy() can be replaced with
an ASSERT(!d->pending_scrub) then.

Thanks, Roger.

[PATCH v4 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations
[PATCH v4 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
[PATCH v4 3/3] xen/mm: limit non-scrubbed allocations to a specific order