[v2] xen/mm: limit in-place scrubbing

[PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monne 3 weeks, 4 days ago

Physmap population has the need to use pages as big as possible to reduce
p2m shattering.  However that triggers issues when big enough pages are not
yet scrubbed, and so scrubbing must be done at allocation time.  On some
scenarios with added contention the watchdog can trigger:

Watchdog timer detects that CPU55 is stuck!
----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
CPU:    55
RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
[...]
Xen call trace:
   [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
   [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
   [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
   [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
   [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
   [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970

Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
relies on stashing the dirty page in the domain struct temporarily to
preempt to guest context, so the scrubbing can resume when the domain
re-enters the hypercall.  The added deferral mechanism will only be used for
domain construction, and is designed to be used with a single threaded
domain builder.  If the toolstack makes concurrent calls to
XENMEM_populate_physmap for the same target domain it will trash stashed
pages, resulting in slow domain physmap population.

Note a similar issue is present in increase reservation.  However that
hypercall is likely to only be used once the domain is already running and
the known implementations use 4K pages. It will be deal with in a separate
patch using a different approach, that will also take care of the
allocation in populate_physmap() once the domain is running.

Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - New in this version, different approach than v1.
---
 xen/common/domain.c     |  17 +++++++
 xen/common/memory.c     | 105 +++++++++++++++++++++++++++++++++++++++-
 xen/common/page_alloc.c |   2 +-
 xen/include/xen/mm.h    |   1 +
 xen/include/xen/sched.h |   5 ++
 5 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 376351b528c9..5bbbc7e1aada 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -722,6 +722,15 @@ static void _domain_destroy(struct domain *d)
 
     XVFREE(d->console);
 
+    if ( d->pending_scrub )
+    {
+        BUG_ON(d->creation_finished);
+        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
+        d->pending_scrub = NULL;
+        d->pending_scrub_order = 0;
+        d->pending_scrub_index = 0;
+    }
+
     argo_destroy(d);
 
     rangeset_domain_destroy(d);
@@ -1678,6 +1687,14 @@ int domain_unpause_by_systemcontroller(struct domain *d)
      */
     if ( new == 0 && !d->creation_finished )
     {
+        if ( d->pending_scrub )
+        {
+            printk(XENLOG_ERR
+                   "%pd: cannot be started with pending dirty pages, destroying\n",
+                   d);
+            domain_crash(d);
+            return -EBUSY;
+        }
         d->creation_finished = true;
         arch_domain_creation_finished(d);
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 10becf7c1f4c..4ad2cc6428d5 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -159,6 +159,74 @@ static void increase_reservation(struct memop_args *a)
     a->nr_done = i;
 }
 
+/*
+ * Temporary storage for a domain assigned page that's not been fully scrubbed.
+ * Stored pages must be domheap ones.
+ *
+ * The stashed page can be freed at any time by Xen, the caller must pass the
+ * order and NUMA node requirement to the fetch function to ensure the
+ * currently stashed page matches it's requirements.
+ */
+static void stash_allocation(struct domain *d, struct page_info *page,
+                             unsigned int order, unsigned int scrub_index)
+{
+    BUG_ON(d->creation_finished);
+
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * Drop any stashed allocation to accommodated the current one.  This
+     * interface is designed to be used for single-threaded domain creation.
+     */
+    if ( d->pending_scrub )
+        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
+
+    d->pending_scrub_index = scrub_index;
+    d->pending_scrub_order = order;
+    d->pending_scrub = page;
+
+    rspin_unlock(&d->page_alloc_lock);
+}
+
+static struct page_info *get_stashed_allocation(struct domain *d,
+                                                unsigned int order,
+                                                nodeid_t node,
+                                                unsigned int *scrub_index)
+{
+    struct page_info *page = NULL;
+
+    BUG_ON(d->creation_finished && d->pending_scrub);
+
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * If there's a pending page to scrub check it satisfies the current
+     * request.  If it doesn't keep it stashed and return NULL.
+     */
+    if ( !d->pending_scrub || d->pending_scrub_order != order ||
+         (node != NUMA_NO_NODE && node != page_to_nid(d->pending_scrub)) )
+        goto done;
+    else
+    {
+        page = d->pending_scrub;
+        *scrub_index = d->pending_scrub_index;
+    }
+
+    /*
+     * The caller now owns the page, clear stashed information.  Prevent
+     * concurrent usages of get_stashed_allocation() from returning the same
+     * page to different contexts.
+     */
+    d->pending_scrub_index = 0;
+    d->pending_scrub_order = 0;
+    d->pending_scrub = NULL;
+
+ done:
+    rspin_unlock(&d->page_alloc_lock);
+
+    return page;
+}
+
 static void populate_physmap(struct memop_args *a)
 {
     struct page_info *page;
@@ -275,7 +343,18 @@ static void populate_physmap(struct memop_args *a)
             }
             else
             {
-                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
+                unsigned int scrub_start = 0;
+                nodeid_t node =
+                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
+                                                    : NUMA_NO_NODE;
+
+                page = get_stashed_allocation(d, a->extent_order, node,
+                                              &scrub_start);
+
+                if ( !page )
+                    page = alloc_domheap_pages(d, a->extent_order,
+                        a->memflags | (d->creation_finished ? 0
+                                                            : MEMF_no_scrub));
 
                 if ( unlikely(!page) )
                 {
@@ -286,6 +365,30 @@ static void populate_physmap(struct memop_args *a)
                     goto out;
                 }
 
+                if ( !d->creation_finished )
+                {
+                    unsigned int dirty_cnt = 0, j;
+
+                    /* Check if there's anything to scrub. */
+                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
+                    {
+                        if ( !test_and_clear_bit(_PGC_need_scrub,
+                                                 &page[j].count_info) )
+                            continue;
+
+                        scrub_one_page(&page[j], true);
+
+                        if ( (j + 1) != (1U << a->extent_order) &&
+                             !(++dirty_cnt & 0xff) &&
+                             hypercall_preempt_check() )
+                        {
+                            a->preempted = 1;
+                            stash_allocation(d, page, a->extent_order, ++j);
+                            goto out;
+                        }
+                    }
+                }
+
                 if ( unlikely(a->memflags & MEMF_no_tlbflush) )
                 {
                     for ( j = 0; j < (1U << a->extent_order); j++ )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index de1480316f05..c9e82fd7ab62 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -792,7 +792,7 @@ static void page_list_add_scrub(struct page_info *pg, unsigned int node,
 # define scrub_page_cold clear_page_cold
 #endif
 
-static void scrub_one_page(const struct page_info *pg, bool cold)
+void scrub_one_page(const struct page_info *pg, bool cold)
 {
     void *ptr;
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 426362adb2f4..f249c52cb7d8 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -144,6 +144,7 @@ unsigned long avail_domheap_pages_region(
 unsigned long avail_node_heap_pages(unsigned int nodeid);
 #define alloc_domheap_page(d,f) (alloc_domheap_pages(d,0,f))
 #define free_domheap_page(p)  (free_domheap_pages(p,0))
+void scrub_one_page(const struct page_info *pg, bool cold);
 
 int online_page(mfn_t mfn, uint32_t *status);
 int offline_page(mfn_t mfn, int broken, uint32_t *status);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 91d6a49daf16..735d5b76b411 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -661,6 +661,11 @@ struct domain
 
     /* Pointer to console settings; NULL for system domains. */
     struct domain_console *console;
+
+    /* Pointer to allocated domheap page that possibly needs scrubbing. */
+    struct page_info *pending_scrub;
+    unsigned int pending_scrub_order;
+    unsigned int pending_scrub_index;
 } __aligned(PAGE_SIZE);
 
 static inline struct page_list_head *page_to_list(
-- 
2.51.0

Re: [PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Jan Beulich 3 weeks ago

On 15.01.2026 12:18, Roger Pau Monne wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -722,6 +722,15 @@ static void _domain_destroy(struct domain *d)
>  
>      XVFREE(d->console);
>  
> +    if ( d->pending_scrub )
> +    {
> +        BUG_ON(d->creation_finished);
> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> +        d->pending_scrub = NULL;
> +        d->pending_scrub_order = 0;
> +        d->pending_scrub_index = 0;
> +    }

Because of the other zeroing wanted (it's not strictly needed, is it?),
it may be a little awkward to use FREE_DOMHEAP_PAGES() here. Yet I would
still have recommended to avoid its open-coding, if only we had such a
wrapper already.

Would this better be done earlier, in domain_kill(), to avoid needlessly
holding back memory that isn't going to be used by this domain anymore?
Would require the spinlock be acquired to guard against a racing
stash_allocation(), I suppose. In fact freeing right in
domain_unpause_by_systemcontroller() might be yet better (albeit without
eliminating the need to do it here or in domain_kill()).

> @@ -1678,6 +1687,14 @@ int domain_unpause_by_systemcontroller(struct domain *d)
>       */
>      if ( new == 0 && !d->creation_finished )
>      {
> +        if ( d->pending_scrub )
> +        {
> +            printk(XENLOG_ERR
> +                   "%pd: cannot be started with pending dirty pages, destroying\n",

s/dirty/unscrubbed/ to avoid ambiguity with "dirty" as in "needing writeback"?

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -159,6 +159,74 @@ static void increase_reservation(struct memop_args *a)
>      a->nr_done = i;
>  }
>  
> +/*
> + * Temporary storage for a domain assigned page that's not been fully scrubbed.
> + * Stored pages must be domheap ones.
> + *
> + * The stashed page can be freed at any time by Xen, the caller must pass the
> + * order and NUMA node requirement to the fetch function to ensure the
> + * currently stashed page matches it's requirements.
> + */
> +static void stash_allocation(struct domain *d, struct page_info *page,
> +                             unsigned int order, unsigned int scrub_index)
> +{
> +    BUG_ON(d->creation_finished);

Is this valid here and ...

> +    rspin_lock(&d->page_alloc_lock);
> +
> +    /*
> +     * Drop any stashed allocation to accommodated the current one.  This
> +     * interface is designed to be used for single-threaded domain creation.
> +     */
> +    if ( d->pending_scrub )
> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> +
> +    d->pending_scrub_index = scrub_index;
> +    d->pending_scrub_order = order;
> +    d->pending_scrub = page;
> +
> +    rspin_unlock(&d->page_alloc_lock);
> +}
> +
> +static struct page_info *get_stashed_allocation(struct domain *d,
> +                                                unsigned int order,
> +                                                nodeid_t node,
> +                                                unsigned int *scrub_index)
> +{
> +    struct page_info *page = NULL;
> +
> +    BUG_ON(d->creation_finished && d->pending_scrub);

... here? A badly behaved toolstack could do a populate in parallel with
the initial unpause, couldn't it?

> +    rspin_lock(&d->page_alloc_lock);
> +
> +    /*
> +     * If there's a pending page to scrub check it satisfies the current
> +     * request.  If it doesn't keep it stashed and return NULL.
> +     */
> +    if ( !d->pending_scrub || d->pending_scrub_order != order ||
> +         (node != NUMA_NO_NODE && node != page_to_nid(d->pending_scrub)) )

Ah, and MEMF_exact_node is handled in the caller.

> +        goto done;
> +    else
> +    {
> +        page = d->pending_scrub;
> +        *scrub_index = d->pending_scrub_index;
> +    }
> +
> +    /*
> +     * The caller now owns the page, clear stashed information.  Prevent
> +     * concurrent usages of get_stashed_allocation() from returning the same
> +     * page to different contexts.
> +     */
> +    d->pending_scrub_index = 0;
> +    d->pending_scrub_order = 0;
> +    d->pending_scrub = NULL;
> +
> + done:
> +    rspin_unlock(&d->page_alloc_lock);
> +
> +    return page;
> +}

Hmm, you free the earlier allocation only in stash_allocation(), i.e. that
memory isn't available to fulfill the present request. (I do understand
that the freeing there can't be dropped, to deal with possible races
caused by the toolstack.)

The use of "goto" here also looks a little odd, as it would be easy to get
away without. Or else I'd like to ask that the "else" be dropped.

> @@ -286,6 +365,30 @@ static void populate_physmap(struct memop_args *a)
>                      goto out;
>                  }
>  
> +                if ( !d->creation_finished )
> +                {
> +                    unsigned int dirty_cnt = 0, j;

Declaring (another) j here is going to upset Eclair, I fear, as ...

> +                    /* Check if there's anything to scrub. */
> +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> +                    {
> +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> +                                                 &page[j].count_info) )
> +                            continue;
> +
> +                        scrub_one_page(&page[j], true);
> +
> +                        if ( (j + 1) != (1U << a->extent_order) &&
> +                             !(++dirty_cnt & 0xff) &&
> +                             hypercall_preempt_check() )
> +                        {
> +                            a->preempted = 1;
> +                            stash_allocation(d, page, a->extent_order, ++j);

Better j + 1, as j's value isn't supposed to be used any further?

> +                            goto out;
> +                        }
> +                    }
> +                }
> +
>                  if ( unlikely(a->memflags & MEMF_no_tlbflush) )
>                  {
>                      for ( j = 0; j < (1U << a->extent_order); j++ )

... for this to work there must already be one available from an outer scope.

Jan

Re: [PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Roger Pau Monné 2 weeks, 4 days ago

On Mon, Jan 19, 2026 at 02:00:49PM +0100, Jan Beulich wrote:
> On 15.01.2026 12:18, Roger Pau Monne wrote:
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -722,6 +722,15 @@ static void _domain_destroy(struct domain *d)
> >  
> >      XVFREE(d->console);
> >  
> > +    if ( d->pending_scrub )
> > +    {
> > +        BUG_ON(d->creation_finished);
> > +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> > +        d->pending_scrub = NULL;
> > +        d->pending_scrub_order = 0;
> > +        d->pending_scrub_index = 0;
> > +    }
> 
> Because of the other zeroing wanted (it's not strictly needed, is it?),
> it may be a little awkward to use FREE_DOMHEAP_PAGES() here. Yet I would
> still have recommended to avoid its open-coding, if only we had such a
> wrapper already.

I don't mind introducing a FREE_DOMHEAP_PAGES() wrapper in this same
patch, if you are OK with it.

> Would this better be done earlier, in domain_kill(), to avoid needlessly
> holding back memory that isn't going to be used by this domain anymore?
> Would require the spinlock be acquired to guard against a racing
> stash_allocation(), I suppose. In fact freeing right in
> domain_unpause_by_systemcontroller() might be yet better (albeit without
> eliminating the need to do it here or in domain_kill()).

Even with a lock taken moving to domain_kill() would be racy.  A rogue
toolstack could keep trying to issue populate_physmap hypercalls which
would fail in the assign_pages() call, but it could still leave
pending pages in d->pending_scrub, as the assign_pages() call happens
strictly after the scrubbing is done.

> > @@ -1678,6 +1687,14 @@ int domain_unpause_by_systemcontroller(struct domain *d)
> >       */
> >      if ( new == 0 && !d->creation_finished )
> >      {
> > +        if ( d->pending_scrub )
> > +        {
> > +            printk(XENLOG_ERR
> > +                   "%pd: cannot be started with pending dirty pages, destroying\n",
> 
> s/dirty/unscrubbed/ to avoid ambiguity with "dirty" as in "needing writeback"?
> 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -159,6 +159,74 @@ static void increase_reservation(struct memop_args *a)
> >      a->nr_done = i;
> >  }
> >  
> > +/*
> > + * Temporary storage for a domain assigned page that's not been fully scrubbed.
> > + * Stored pages must be domheap ones.
> > + *
> > + * The stashed page can be freed at any time by Xen, the caller must pass the
> > + * order and NUMA node requirement to the fetch function to ensure the
> > + * currently stashed page matches it's requirements.
> > + */
> > +static void stash_allocation(struct domain *d, struct page_info *page,
> > +                             unsigned int order, unsigned int scrub_index)
> > +{
> > +    BUG_ON(d->creation_finished);
> 
> Is this valid here and ...
> 
> > +    rspin_lock(&d->page_alloc_lock);
> > +
> > +    /*
> > +     * Drop any stashed allocation to accommodated the current one.  This
> > +     * interface is designed to be used for single-threaded domain creation.
> > +     */
> > +    if ( d->pending_scrub )
> > +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> > +
> > +    d->pending_scrub_index = scrub_index;
> > +    d->pending_scrub_order = order;
> > +    d->pending_scrub = page;
> > +
> > +    rspin_unlock(&d->page_alloc_lock);
> > +}
> > +
> > +static struct page_info *get_stashed_allocation(struct domain *d,
> > +                                                unsigned int order,
> > +                                                nodeid_t node,
> > +                                                unsigned int *scrub_index)
> > +{
> > +    struct page_info *page = NULL;
> > +
> > +    BUG_ON(d->creation_finished && d->pending_scrub);
> 
> ... here? A badly behaved toolstack could do a populate in parallel with
> the initial unpause, couldn't it?

Oh, I think I've forgot to refresh before sending the patch . I have
those as ASSERTs in my local copy anyway.  But yes, you are right,
populate_physmap() is not done while holding the domctl lock, so it
can race with an unpause.  I will remove the ASSERTs.

> > +    rspin_lock(&d->page_alloc_lock);
> > +
> > +    /*
> > +     * If there's a pending page to scrub check it satisfies the current
> > +     * request.  If it doesn't keep it stashed and return NULL.
> > +     */
> > +    if ( !d->pending_scrub || d->pending_scrub_order != order ||
> > +         (node != NUMA_NO_NODE && node != page_to_nid(d->pending_scrub)) )
> 
> Ah, and MEMF_exact_node is handled in the caller.
> 
> > +        goto done;
> > +    else
> > +    {
> > +        page = d->pending_scrub;
> > +        *scrub_index = d->pending_scrub_index;
> > +    }
> > +
> > +    /*
> > +     * The caller now owns the page, clear stashed information.  Prevent
> > +     * concurrent usages of get_stashed_allocation() from returning the same
> > +     * page to different contexts.
> > +     */
> > +    d->pending_scrub_index = 0;
> > +    d->pending_scrub_order = 0;
> > +    d->pending_scrub = NULL;
> > +
> > + done:
> > +    rspin_unlock(&d->page_alloc_lock);
> > +
> > +    return page;
> > +}
> 
> Hmm, you free the earlier allocation only in stash_allocation(), i.e. that
> memory isn't available to fulfill the present request. (I do understand
> that the freeing there can't be dropped, to deal with possible races
> caused by the toolstack.)

Since we expect populate_physmap(9 to be executed sequentially by the
toolstack I would argue it's fine to hold onto that memory.  Otherwise
I could possibly free in get_stashed_allocation() when the request
doesn't match what's stashed.  I opted for freeing later in
stash_allocation() to maybe give time for the other parallel caller to
finish the scrubbing.

> The use of "goto" here also looks a little odd, as it would be easy to get
> away without. Or else I'd like to ask that the "else" be dropped.

Hm, OK, let me use an unlock + return and also drop the else then.  I
think that's clearer.

> > @@ -286,6 +365,30 @@ static void populate_physmap(struct memop_args *a)
> >                      goto out;
> >                  }
> >  
> > +                if ( !d->creation_finished )
> > +                {
> > +                    unsigned int dirty_cnt = 0, j;
> 
> Declaring (another) j here is going to upset Eclair, I fear, as ...
> 
> > +                    /* Check if there's anything to scrub. */
> > +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> > +                    {
> > +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> > +                                                 &page[j].count_info) )
> > +                            continue;
> > +
> > +                        scrub_one_page(&page[j], true);
> > +
> > +                        if ( (j + 1) != (1U << a->extent_order) &&
> > +                             !(++dirty_cnt & 0xff) &&
> > +                             hypercall_preempt_check() )
> > +                        {
> > +                            a->preempted = 1;
> > +                            stash_allocation(d, page, a->extent_order, ++j);
> 
> Better j + 1, as j's value isn't supposed to be used any further?
> 
> > +                            goto out;
> > +                        }
> > +                    }
> > +                }
> > +
> >                  if ( unlikely(a->memflags & MEMF_no_tlbflush) )
> >                  {
> >                      for ( j = 0; j < (1U << a->extent_order); j++ )
> 
> ... for this to work there must already be one available from an outer scope.

Indeed, I've removed that outrageous j variable.

Thanks, Roger.

Re: [PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Posted by Jan Beulich 2 weeks, 4 days ago

On 22.01.2026 13:48, Roger Pau Monné wrote:
> On Mon, Jan 19, 2026 at 02:00:49PM +0100, Jan Beulich wrote:
>> On 15.01.2026 12:18, Roger Pau Monne wrote:
>>> --- a/xen/common/domain.c
>>> +++ b/xen/common/domain.c
>>> @@ -722,6 +722,15 @@ static void _domain_destroy(struct domain *d)
>>>  
>>>      XVFREE(d->console);
>>>  
>>> +    if ( d->pending_scrub )
>>> +    {
>>> +        BUG_ON(d->creation_finished);
>>> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>>> +        d->pending_scrub = NULL;
>>> +        d->pending_scrub_order = 0;
>>> +        d->pending_scrub_index = 0;
>>> +    }
>>
>> Because of the other zeroing wanted (it's not strictly needed, is it?),
>> it may be a little awkward to use FREE_DOMHEAP_PAGES() here. Yet I would
>> still have recommended to avoid its open-coding, if only we had such a
>> wrapper already.
> 
> I don't mind introducing a FREE_DOMHEAP_PAGES() wrapper in this same
> patch, if you are OK with it.

I'd be fine with that.

>> Would this better be done earlier, in domain_kill(), to avoid needlessly
>> holding back memory that isn't going to be used by this domain anymore?
>> Would require the spinlock be acquired to guard against a racing
>> stash_allocation(), I suppose. In fact freeing right in
>> domain_unpause_by_systemcontroller() might be yet better (albeit without
>> eliminating the need to do it here or in domain_kill()).
> 
> Even with a lock taken moving to domain_kill() would be racy.  A rogue
> toolstack could keep trying to issue populate_physmap hypercalls which
> would fail in the assign_pages() call, but it could still leave
> pending pages in d->pending_scrub, as the assign_pages() call happens
> strictly after the scrubbing is done.

As indicated, the freeing here may need to stay. But making an attempt far
earlier may help the system overall.

>>> +    /*
>>> +     * If there's a pending page to scrub check it satisfies the current
>>> +     * request.  If it doesn't keep it stashed and return NULL.
>>> +     */
>>> +    if ( !d->pending_scrub || d->pending_scrub_order != order ||
>>> +         (node != NUMA_NO_NODE && node != page_to_nid(d->pending_scrub)) )
>>
>> Ah, and MEMF_exact_node is handled in the caller.
>>
>>> +        goto done;
>>> +    else
>>> +    {
>>> +        page = d->pending_scrub;
>>> +        *scrub_index = d->pending_scrub_index;
>>> +    }
>>> +
>>> +    /*
>>> +     * The caller now owns the page, clear stashed information.  Prevent
>>> +     * concurrent usages of get_stashed_allocation() from returning the same
>>> +     * page to different contexts.
>>> +     */
>>> +    d->pending_scrub_index = 0;
>>> +    d->pending_scrub_order = 0;
>>> +    d->pending_scrub = NULL;
>>> +
>>> + done:
>>> +    rspin_unlock(&d->page_alloc_lock);
>>> +
>>> +    return page;
>>> +}
>>
>> Hmm, you free the earlier allocation only in stash_allocation(), i.e. that
>> memory isn't available to fulfill the present request. (I do understand
>> that the freeing there can't be dropped, to deal with possible races
>> caused by the toolstack.)
> 
> Since we expect populate_physmap(9 to be executed sequentially by the
> toolstack I would argue it's fine to hold onto that memory.

Here you say "sequentially", just to ...

>  Otherwise
> I could possibly free in get_stashed_allocation() when the request
> doesn't match what's stashed.  I opted for freeing later in
> stash_allocation() to maybe give time for the other parallel caller to
> finish the scrubbing.

... assume non-sequential behavior here. I guess I'm a little confused.
(Yes, freeing right in get_stashed_allocation() is what I'd expect.)

>> The use of "goto" here also looks a little odd, as it would be easy to get
>> away without. Or else I'd like to ask that the "else" be dropped.
> 
> Hm, OK, let me use an unlock + return and also drop the else then.  I
> think that's clearer.

I think if() with the condition inverted and a single unlock+return at
the end would be easiest to follow.

Jan

[PATCH v2 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations
[PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
[PATCH v2 3/3] xen/mm: limit non-scrubbed allocations to a specific order