xen/arch/arm/domain.c | 2 +- xen/arch/x86/domain.c | 2 +- xen/common/memory.c | 12 +++++++++ xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++---- xen/include/xen/mm.h | 12 ++++++++- 5 files changed, 74 insertions(+), 8 deletions(-)
Hello, In XenServer we have seen the watchdog occasionally triggering during domain creation if 1GB pages are scrubbed in-place during physmap population. The following series attempt to mitigate this by limiting the in-place scrubbing during allocation to 2M pages, but it has some drawbacks, see the post-commit remarks in patch 2. I'm hopping someone might have a better idea, or we converge we can't do better than this for the time being. Thanks, Roger. Roger Pau Monne (2): xen/mm: add a NUMA node parameter to scrub_free_pages() xen/mm: limit non-scrubbed allocations to a specific order xen/arch/arm/domain.c | 2 +- xen/arch/x86/domain.c | 2 +- xen/common/memory.c | 12 +++++++++ xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++---- xen/include/xen/mm.h | 12 ++++++++- 5 files changed, 74 insertions(+), 8 deletions(-) -- 2.51.0
On 08.01.2026 18:55, Roger Pau Monne wrote: > In XenServer we have seen the watchdog occasionally triggering during > domain creation if 1GB pages are scrubbed in-place during physmap > population. That's pretty extreme - writing to 1Gb of memory can't really take over 5s, can it? Is there lock contention involved? Or is this when very many CPUs try to do the same in parallel? Jan > The following series attempt to mitigate this by limiting > the in-place scrubbing during allocation to 2M pages, but it has some > drawbacks, see the post-commit remarks in patch 2. > > I'm hopping someone might have a better idea, or we converge we can't do > better than this for the time being. > > Thanks, Roger. > > Roger Pau Monne (2): > xen/mm: add a NUMA node parameter to scrub_free_pages() > xen/mm: limit non-scrubbed allocations to a specific order > > xen/arch/arm/domain.c | 2 +- > xen/arch/x86/domain.c | 2 +- > xen/common/memory.c | 12 +++++++++ > xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++---- > xen/include/xen/mm.h | 12 ++++++++- > 5 files changed, 74 insertions(+), 8 deletions(-) >
On 09/01/2026 10:15 am, Jan Beulich wrote: > On 08.01.2026 18:55, Roger Pau Monne wrote: >> In XenServer we have seen the watchdog occasionally triggering during >> domain creation if 1GB pages are scrubbed in-place during physmap >> population. > That's pretty extreme - writing to 1Gb of memory can't really take over 5s, > can it? Sure it can. > Is there lock contention involved? Almost certainly, and it's probably the more relevant aspect in this case. > Or is this when very many CPUs > try to do the same in parallel? The scenario is reboot of a VM when Xapi is doing NUMA placement using per-node claims. In this case, even with sufficient scrubbed RAM on other nodes, you need to take from the node you claimed on which might need scrubbing. The underlying problem is the need to do a long running operation in a context where you cannot continue, and cannot (reasonably) fail. ~Andrew
On Fri, Jan 09, 2026 at 10:29:20AM +0000, Andrew Cooper wrote: > On 09/01/2026 10:15 am, Jan Beulich wrote: > > On 08.01.2026 18:55, Roger Pau Monne wrote: > >> In XenServer we have seen the watchdog occasionally triggering during > >> domain creation if 1GB pages are scrubbed in-place during physmap > >> population. > > That's pretty extreme - writing to 1Gb of memory can't really take over 5s, > > can it? > > Sure it can. > > > Is there lock contention involved? > > Almost certainly, and it's probably the more relevant aspect in this case. Possibly. I can tell Edwin to give me his reproduction. There's also the map_domain_page() page aspect of this operation. On big enough systems this will cause a fair amount of stress to the map cache, since each page is mapped, scrubbed and unmapped. I don't think however the systems on which we have seen this to be using the map cache (it was on debug=n builds with less than 5TB of memory). > > Or is this when very many CPUs > > try to do the same in parallel? > > The scenario is reboot of a VM when Xapi is doing NUMA placement using > per-node claims. Not exclusively. We have reports of this also happening without any claims or NUMA placements being used. AFAICT it's possibly triggered when doing reboots of multiple VMs in parallel, and all reports of it I've seen it's on multi-node NUMA systems. I wonder if scrubbing a 1G remote page in 4K chunks is killing the intra-node bandwidth. Thanks, Roger.
On 09.01.2026 11:29, Andrew Cooper wrote: > On 09/01/2026 10:15 am, Jan Beulich wrote: >> On 08.01.2026 18:55, Roger Pau Monne wrote: >>> In XenServer we have seen the watchdog occasionally triggering during >>> domain creation if 1GB pages are scrubbed in-place during physmap >>> population. >> That's pretty extreme - writing to 1Gb of memory can't really take over 5s, >> can it? > > Sure it can. Under what unusual circumstances, or on what extremely slow hardware? (Of course improperly set MTRRs could cause such, for example.) >> Is there lock contention involved? > > Almost certainly, and it's probably the more relevant aspect in this case. Thing is - the scrubbing happens after alloc_heap_pages() has already dropped the heap lock. And I can't spot the XENMEM_populate_physmap path to take any locks outward from alloc_heap_pages(). And the domain's page alloc lock (which in principle should be uncontended anyway unless the toolstack tries to race with itself) is acquired only later. If it was a lock contention problem, the first goal ought to be to move the scrubbing outside of any (potentially contended) locks. >> Or is this when very many CPUs >> try to do the same in parallel? > > The scenario is reboot of a VM when Xapi is doing NUMA placement using > per-node claims. > > In this case, even with sufficient scrubbed RAM on other nodes, you need > to take from the node you claimed on which might need scrubbing. Much like if there was an exact-node request without involving claims. > The underlying problem is the need to do a long running operation in a > context where you cannot continue, and cannot (reasonably) fail. Right. Jan
On 09/01/2026 11:32 am, Jan Beulich wrote: >>> Or is this when very many CPUs >>> try to do the same in parallel? >> The scenario is reboot of a VM when Xapi is doing NUMA placement using >> per-node claims. >> >> In this case, even with sufficient scrubbed RAM on other nodes, you need >> to take from the node you claimed on which might need scrubbing. > Much like if there was an exact-node request without involving claims. > >> The underlying problem is the need to do a long running operation in a >> context where you cannot continue, and cannot (reasonably) fail. > Right. Yeah - I think this is a scenario that could happen without NUMA aspects, if the system is almost full. I suspect we've just made it easier to hit, or we've got better testing. Hard to say. ~Andrew
© 2016 - 2026 Red Hat, Inc.