Entities building domains are expected to deal with higher order
allocation attempts (for populating a new domain) failing. If we set
aside a reservation for DMA, try to avoid taking higher order pages from
that reserve pool. Instead favor order-0 ones which often can still be
supplied from higher addressed memory, even if we've run out of
large/huge pages there.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: More generally for any requests targeting remote domains?
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -192,6 +192,14 @@ static void populate_physmap(struct memo
* delayed.
*/
a->memflags |= MEMF_no_icache_flush;
+
+ /*
+ * Heuristically assume that during domain construction the caller is
+ * capable of falling back to order-0 allocations, allowing us to
+ * conserve on memory otherwise held back for DMA purposes.
+ */
+ if ( a->extent_order )
+ a->memflags |= MEMF_no_dma;
}
for ( i = a->nr_done; i < a->nr_extents; i++ )
On Tue, Feb 25, 2025 at 03:58:34PM +0100, Jan Beulich wrote:
> Entities building domains are expected to deal with higher order
> allocation attempts (for populating a new domain) failing. If we set
> aside a reservation for DMA, try to avoid taking higher order pages from
> that reserve pool.
>
> Instead favor order-0 ones which often can still be
> supplied from higher addressed memory, even if we've run out of
> large/huge pages there.
I would maybe write that last sentence as: force non zero order
allocations to use the non-DMA region, and if the region cannot
fulfill the request return an error to the caller for it to retry with
a smaller order. Effectively this limits allocations from the DMA
region to only be of order 0 during physmap domain population.
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> ---
> RFC: More generally for any requests targeting remote domains?
I think doing the limitation for domain creation is fine, the more
that there are also other flags there.
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -192,6 +192,14 @@ static void populate_physmap(struct memo
> * delayed.
> */
> a->memflags |= MEMF_no_icache_flush;
> +
> + /*
> + * Heuristically assume that during domain construction the caller is
> + * capable of falling back to order-0 allocations, allowing us to
> + * conserve on memory otherwise held back for DMA purposes.
> + */
> + if ( a->extent_order )
> + a->memflags |= MEMF_no_dma;
For PV domains: is it possible for toolstack to try to allocate a
certain amount of pages from the DMA pool for the benefit of the
domain?
I also wonder if it would make sense to attempt to implement the
logic on the toolstack side: meminit_{hvm,pv}()?
No strong opinion, but slightly less logic in the hypervisor, and
won't change the interface for possibly existing toolstacks that don't
pass MEMF_no_dma on purpose.
Thanks, Roger.
On 16.06.2025 15:27, Roger Pau Monné wrote:
> On Tue, Feb 25, 2025 at 03:58:34PM +0100, Jan Beulich wrote:
>> Entities building domains are expected to deal with higher order
>> allocation attempts (for populating a new domain) failing. If we set
>> aside a reservation for DMA, try to avoid taking higher order pages from
>> that reserve pool.
>>
>> Instead favor order-0 ones which often can still be
>> supplied from higher addressed memory, even if we've run out of
>> large/huge pages there.
>
> I would maybe write that last sentence as: force non zero order
> allocations to use the non-DMA region, and if the region cannot
> fulfill the request return an error to the caller for it to retry with
> a smaller order. Effectively this limits allocations from the DMA
> region to only be of order 0 during physmap domain population.
I can take this text, sure.
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -192,6 +192,14 @@ static void populate_physmap(struct memo
>> * delayed.
>> */
>> a->memflags |= MEMF_no_icache_flush;
>> +
>> + /*
>> + * Heuristically assume that during domain construction the caller is
>> + * capable of falling back to order-0 allocations, allowing us to
>> + * conserve on memory otherwise held back for DMA purposes.
>> + */
>> + if ( a->extent_order )
>> + a->memflags |= MEMF_no_dma;
>
> For PV domains: is it possible for toolstack to try to allocate a
> certain amount of pages from the DMA pool for the benefit of the
> domain?
Not directly at least. To benefit the domain, it would also need to be
told where in PFN space those pages would have ended up.
> I also wonder if it would make sense to attempt to implement the
> logic on the toolstack side: meminit_{hvm,pv}()?
>
> No strong opinion, but slightly less logic in the hypervisor, and
> won't change the interface for possibly existing toolstacks that don't
> pass MEMF_no_dma on purpose.
MEMF_no_dma isn't exposed outside of the hypervisor.
Jan
On Mon, Jun 16, 2025 at 04:23:40PM +0200, Jan Beulich wrote:
> On 16.06.2025 15:27, Roger Pau Monné wrote:
> > On Tue, Feb 25, 2025 at 03:58:34PM +0100, Jan Beulich wrote:
> >> Entities building domains are expected to deal with higher order
> >> allocation attempts (for populating a new domain) failing. If we set
> >> aside a reservation for DMA, try to avoid taking higher order pages from
> >> that reserve pool.
> >>
> >> Instead favor order-0 ones which often can still be
> >> supplied from higher addressed memory, even if we've run out of
> >> large/huge pages there.
> >
> > I would maybe write that last sentence as: force non zero order
> > allocations to use the non-DMA region, and if the region cannot
> > fulfill the request return an error to the caller for it to retry with
> > a smaller order. Effectively this limits allocations from the DMA
> > region to only be of order 0 during physmap domain population.
>
> I can take this text, sure.
>
> >> --- a/xen/common/memory.c
> >> +++ b/xen/common/memory.c
> >> @@ -192,6 +192,14 @@ static void populate_physmap(struct memo
> >> * delayed.
> >> */
> >> a->memflags |= MEMF_no_icache_flush;
> >> +
> >> + /*
> >> + * Heuristically assume that during domain construction the caller is
> >> + * capable of falling back to order-0 allocations, allowing us to
> >> + * conserve on memory otherwise held back for DMA purposes.
> >> + */
> >> + if ( a->extent_order )
> >> + a->memflags |= MEMF_no_dma;
> >
> > For PV domains: is it possible for toolstack to try to allocate a
> > certain amount of pages from the DMA pool for the benefit of the
> > domain?
>
> Not directly at least. To benefit the domain, it would also need to be
> told where in PFN space those pages would have ended up.
My question makes no sense anyway if MEMF_no_dma isn't exposed to the
hypercall interface.
> > I also wonder if it would make sense to attempt to implement the
> > logic on the toolstack side: meminit_{hvm,pv}()?
> >
> > No strong opinion, but slightly less logic in the hypervisor, and
> > won't change the interface for possibly existing toolstacks that don't
> > pass MEMF_no_dma on purpose.
>
> MEMF_no_dma isn't exposed outside of the hypervisor.
Oh, I see.
One question I have though, on systems with a low amount of memory
(let's say 8GB), does this lead to an increase in domain construction
time due to having to fallback to order 0 allocations when running out
of non-DMA memory?
Thanks, Roger.
On 16.06.2025 16:46, Roger Pau Monné wrote: > One question I have though, on systems with a low amount of memory > (let's say 8GB), does this lead to an increase in domain construction > time due to having to fallback to order 0 allocations when running out > of non-DMA memory? It'll likely be slower, yes, but I can't guesstimate by how much. Jan
On Mon, Jun 16, 2025 at 05:20:45PM +0200, Jan Beulich wrote: > On 16.06.2025 16:46, Roger Pau Monné wrote: > > One question I have though, on systems with a low amount of memory > > (let's say 8GB), does this lead to an increase in domain construction > > time due to having to fallback to order 0 allocations when running out > > of non-DMA memory? > > It'll likely be slower, yes, but I can't guesstimate by how much. Should there be some way to control this behavior then? I'm mostly thinking about client systems like Qubes where memory is likely limited, and the extra slowness to create VMs could become noticeable? Thanks, Roger.
On 16.06.2025 17:41, Roger Pau Monné wrote: > On Mon, Jun 16, 2025 at 05:20:45PM +0200, Jan Beulich wrote: >> On 16.06.2025 16:46, Roger Pau Monné wrote: >>> One question I have though, on systems with a low amount of memory >>> (let's say 8GB), does this lead to an increase in domain construction >>> time due to having to fallback to order 0 allocations when running out >>> of non-DMA memory? >> >> It'll likely be slower, yes, but I can't guesstimate by how much. > > Should there be some way to control this behavior then? I'm mostly > thinking about client systems like Qubes where memory is likely > limited, and the extra slowness to create VMs could become > noticeable? What kind of control would you be thinking of here? Yet another command line option? Jan
On Mon, Jun 16, 2025 at 06:02:07PM +0200, Jan Beulich wrote: > On 16.06.2025 17:41, Roger Pau Monné wrote: > > On Mon, Jun 16, 2025 at 05:20:45PM +0200, Jan Beulich wrote: > >> On 16.06.2025 16:46, Roger Pau Monné wrote: > >>> One question I have though, on systems with a low amount of memory > >>> (let's say 8GB), does this lead to an increase in domain construction > >>> time due to having to fallback to order 0 allocations when running out > >>> of non-DMA memory? > >> > >> It'll likely be slower, yes, but I can't guesstimate by how much. > > > > Should there be some way to control this behavior then? I'm mostly > > thinking about client systems like Qubes where memory is likely > > limited, and the extra slowness to create VMs could become > > noticeable? > > What kind of control would you be thinking of here? Yet another command > line option? I guess that would be enough. I think we need a way to resort to the previous behavior if required, and likely a CHANGELOG entry to notice the change. Overall, would it be possible to only include the flag if we know there's non-DMA memory available to allocate? Otherwise we are crippling allocation performance when there's only DMA memory left. That also raises the question whether it's an acceptable trade-off to possibly shatter p2m super pages (that could be used if allocating from the DMA pool) at the expense of not allocating from the DMA pool until there's non-DMA memory left. Thanks, Roger.
On 16.06.2025 19:23, Roger Pau Monné wrote: > On Mon, Jun 16, 2025 at 06:02:07PM +0200, Jan Beulich wrote: >> On 16.06.2025 17:41, Roger Pau Monné wrote: >>> On Mon, Jun 16, 2025 at 05:20:45PM +0200, Jan Beulich wrote: >>>> On 16.06.2025 16:46, Roger Pau Monné wrote: >>>>> One question I have though, on systems with a low amount of memory >>>>> (let's say 8GB), does this lead to an increase in domain construction >>>>> time due to having to fallback to order 0 allocations when running out >>>>> of non-DMA memory? >>>> >>>> It'll likely be slower, yes, but I can't guesstimate by how much. >>> >>> Should there be some way to control this behavior then? I'm mostly >>> thinking about client systems like Qubes where memory is likely >>> limited, and the extra slowness to create VMs could become >>> noticeable? >> >> What kind of control would you be thinking of here? Yet another command >> line option? > > I guess that would be enough. I think we need a way to resort to the > previous behavior if required, Thinking about it, there already is "dma_bits=". Simply setting this low enough would have largely the same effect as yet another new command line option. Thoughts? > and likely a CHANGELOG entry to notice the change. Hmm, not sure here. This is too small imo, and really an implementation detail. > Overall, would it be possible to only include the flag if we know > there's non-DMA memory available to allocate? Otherwise we are > crippling allocation performance when there's only DMA memory left. Imo trying to determine this would only make sense if the result can then be relied upon. To determine we'd need to obtain the heap lock, and we'd need to not drop it until after the allocation(s) were done. I think that's far away from being a realistic option. > That also raises the question whether it's an acceptable trade-off to > possibly shatter p2m super pages (that could be used if allocating > from the DMA pool) at the expense of not allocating from the DMA pool > until there's non-DMA memory left. This being an acceptable tradeoff is imo an implicit pre-condition of adding such a heuristic. For the system as a whole, exhausting special purpose memory is likely worse than some loss of performance. Plus as said above, people valuing performance more can reduce the "DMA pool". Jan
On Wed, Jun 18, 2025 at 02:04:12PM +0200, Jan Beulich wrote: > On 16.06.2025 19:23, Roger Pau Monné wrote: > > On Mon, Jun 16, 2025 at 06:02:07PM +0200, Jan Beulich wrote: > >> On 16.06.2025 17:41, Roger Pau Monné wrote: > >>> On Mon, Jun 16, 2025 at 05:20:45PM +0200, Jan Beulich wrote: > >>>> On 16.06.2025 16:46, Roger Pau Monné wrote: > >>>>> One question I have though, on systems with a low amount of memory > >>>>> (let's say 8GB), does this lead to an increase in domain construction > >>>>> time due to having to fallback to order 0 allocations when running out > >>>>> of non-DMA memory? > >>>> > >>>> It'll likely be slower, yes, but I can't guesstimate by how much. > >>> > >>> Should there be some way to control this behavior then? I'm mostly > >>> thinking about client systems like Qubes where memory is likely > >>> limited, and the extra slowness to create VMs could become > >>> noticeable? > >> > >> What kind of control would you be thinking of here? Yet another command > >> line option? > > > > I guess that would be enough. I think we need a way to resort to the > > previous behavior if required, > > Thinking about it, there already is "dma_bits=". Simply setting this low > enough would have largely the same effect as yet another new command line > option. Thoughts? > > > and likely a CHANGELOG entry to notice the change. > > Hmm, not sure here. This is too small imo, and really an implementation > detail. I think it's possible to have a noticeable effect for some use-cases, so it should be listed on CHANGELOG in case users see divergence from previous Xen versions. That can given them a hint at what needs adjusting. It's not only the possible domain creation overhead: if the guest being created is HVM with p2m, the resulting p2m will also be shattered leading to worse guest performance. > > Overall, would it be possible to only include the flag if we know > > there's non-DMA memory available to allocate? Otherwise we are > > crippling allocation performance when there's only DMA memory left. > > Imo trying to determine this would only make sense if the result can > then be relied upon. To determine we'd need to obtain the heap lock, > and we'd need to not drop it until after the allocation(s) were done. > I think that's far away from being a realistic option. Couldn't Xen do best effort? A mismatch won't be fatal anyway. Error cases: you set MEMF_no_dma and there's no non-DMA memory to allocate from, in which case it's the same that's unconditionally done. Other possible failure is you don't set non-DMA and there's still non-DMA memory that's not consumed, which is the current behavior. IMO it would be safer in terms of behavior change to do: if ( a->extent_order && <non-DMA memory available> ) a->memflags |= MEMF_no_dma; I think that's likely to be less intrusive and won't lead to system malfunctions, even if strictly it's (kind of?) a TOCTOU race. > > That also raises the question whether it's an acceptable trade-off to > > possibly shatter p2m super pages (that could be used if allocating > > from the DMA pool) at the expense of not allocating from the DMA pool > > until there's non-DMA memory left. > > This being an acceptable tradeoff is imo an implicit pre-condition of > adding such a heuristic. For the system as a whole, exhausting special > purpose memory is likely worse than some loss of performance. Plus as > said above, people valuing performance more can reduce the "DMA pool". That's an acceptable workaround, but it might be helpful to note this in the commit message, as otherwise it will get lost in xen-devel. Thanks, Roger.
© 2016 - 2025 Red Hat, Inc.