[v1] xen/mm: limit in-place scrubbing

[PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Roger Pau Monne 4 weeks, 1 day ago

Hello,

In XenServer we have seen the watchdog occasionally triggering during
domain creation if 1GB pages are scrubbed in-place during physmap
population.  The following series attempt to mitigate this by limiting
the in-place scrubbing during allocation to 2M pages, but it has some
drawbacks, see the post-commit remarks in patch 2.

I'm hopping someone might have a better idea, or we converge we can't do
better than this for the time being.

Thanks, Roger.

Roger Pau Monne (2):
  xen/mm: add a NUMA node parameter to scrub_free_pages()
  xen/mm: limit non-scrubbed allocations to a specific order

 xen/arch/arm/domain.c   |  2 +-
 xen/arch/x86/domain.c   |  2 +-
 xen/common/memory.c     | 12 +++++++++
 xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++----
 xen/include/xen/mm.h    | 12 ++++++++-
 5 files changed, 74 insertions(+), 8 deletions(-)

-- 
2.51.0

Re: [PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Jan Beulich 4 weeks ago

On 08.01.2026 18:55, Roger Pau Monne wrote:
> In XenServer we have seen the watchdog occasionally triggering during
> domain creation if 1GB pages are scrubbed in-place during physmap
> population.

That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
can it? Is there lock contention involved? Or is this when very many CPUs
try to do the same in parallel?

Jan

>  The following series attempt to mitigate this by limiting
> the in-place scrubbing during allocation to 2M pages, but it has some
> drawbacks, see the post-commit remarks in patch 2.
> 
> I'm hopping someone might have a better idea, or we converge we can't do
> better than this for the time being.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (2):
>   xen/mm: add a NUMA node parameter to scrub_free_pages()
>   xen/mm: limit non-scrubbed allocations to a specific order
> 
>  xen/arch/arm/domain.c   |  2 +-
>  xen/arch/x86/domain.c   |  2 +-
>  xen/common/memory.c     | 12 +++++++++
>  xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++----
>  xen/include/xen/mm.h    | 12 ++++++++-
>  5 files changed, 74 insertions(+), 8 deletions(-)
>

Re: [PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Andrew Cooper 4 weeks ago

On 09/01/2026 10:15 am, Jan Beulich wrote:
> On 08.01.2026 18:55, Roger Pau Monne wrote:
>> In XenServer we have seen the watchdog occasionally triggering during
>> domain creation if 1GB pages are scrubbed in-place during physmap
>> population.
> That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
> can it?

Sure it can.

> Is there lock contention involved?

Almost certainly, and it's probably the more relevant aspect in this case.

> Or is this when very many CPUs
> try to do the same in parallel?

The scenario is reboot of a VM when Xapi is doing NUMA placement using
per-node claims.

In this case, even with sufficient scrubbed RAM on other nodes, you need
to take from the node you claimed on which might need scrubbing.

The underlying problem is the need to do a long running operation in a
context where you cannot continue, and cannot (reasonably) fail.

~Andrew

Re: [PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Roger Pau Monné 4 weeks ago

On Fri, Jan 09, 2026 at 10:29:20AM +0000, Andrew Cooper wrote:
> On 09/01/2026 10:15 am, Jan Beulich wrote:
> > On 08.01.2026 18:55, Roger Pau Monne wrote:
> >> In XenServer we have seen the watchdog occasionally triggering during
> >> domain creation if 1GB pages are scrubbed in-place during physmap
> >> population.
> > That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
> > can it?
> 
> Sure it can.
> 
> > Is there lock contention involved?
> 
> Almost certainly, and it's probably the more relevant aspect in this case.

Possibly.  I can tell Edwin to give me his reproduction.  There's also
the map_domain_page() page aspect of this operation.  On big enough
systems this will cause a fair amount of stress to the map cache,
since each page is mapped, scrubbed and unmapped.  I don't think
however the systems on which we have seen this to be using the map
cache (it was on debug=n builds with less than 5TB of memory).

> > Or is this when very many CPUs
> > try to do the same in parallel?
> 
> The scenario is reboot of a VM when Xapi is doing NUMA placement using
> per-node claims.

Not exclusively.  We have reports of this also happening without any
claims or NUMA placements being used.

AFAICT it's possibly triggered when doing reboots of multiple VMs in
parallel, and all reports of it I've seen it's on multi-node NUMA
systems.  I wonder if scrubbing a 1G remote page in 4K chunks is
killing the intra-node bandwidth.

Thanks, Roger.

Re: [PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Jan Beulich 4 weeks ago

On 09.01.2026 11:29, Andrew Cooper wrote:
> On 09/01/2026 10:15 am, Jan Beulich wrote:
>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>> In XenServer we have seen the watchdog occasionally triggering during
>>> domain creation if 1GB pages are scrubbed in-place during physmap
>>> population.
>> That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
>> can it?
> 
> Sure it can.

Under what unusual circumstances, or on what extremely slow hardware? (Of
course improperly set MTRRs could cause such, for example.)

>> Is there lock contention involved?
> 
> Almost certainly, and it's probably the more relevant aspect in this case.

Thing is - the scrubbing happens after alloc_heap_pages() has already
dropped the heap lock. And I can't spot the XENMEM_populate_physmap path
to take any locks outward from alloc_heap_pages(). And the domain's
page alloc lock (which in principle should be uncontended anyway unless
the toolstack tries to race with itself) is acquired only later.

If it was a lock contention problem, the first goal ought to be to move
the scrubbing outside of any (potentially contended) locks.

>> Or is this when very many CPUs
>> try to do the same in parallel?
> 
> The scenario is reboot of a VM when Xapi is doing NUMA placement using
> per-node claims.
> 
> In this case, even with sufficient scrubbed RAM on other nodes, you need
> to take from the node you claimed on which might need scrubbing.

Much like if there was an exact-node request without involving claims.

> The underlying problem is the need to do a long running operation in a
> context where you cannot continue, and cannot (reasonably) fail.

Right.

Jan

Re: [PATCH 0/2] xen/mm: limit in-place scrubbing

Posted by Andrew Cooper 4 weeks ago

On 09/01/2026 11:32 am, Jan Beulich wrote:
>>> Or is this when very many CPUs
>>> try to do the same in parallel?
>> The scenario is reboot of a VM when Xapi is doing NUMA placement using
>> per-node claims.
>>
>> In this case, even with sufficient scrubbed RAM on other nodes, you need
>> to take from the node you claimed on which might need scrubbing.
> Much like if there was an exact-node request without involving claims.
>
>> The underlying problem is the need to do a long running operation in a
>> context where you cannot continue, and cannot (reasonably) fail.
> Right.

Yeah - I think this is a scenario that could happen without NUMA
aspects, if the system is almost full.  I suspect we've just made it
easier to hit, or we've got better testing.  Hard to say.

~Andrew