From: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
xen/common/vmap.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/xen/common/vmap.c b/xen/common/vmap.c
index 37922f735b..8343460794 100644
--- a/xen/common/vmap.c
+++ b/xen/common/vmap.c
@@ -68,7 +68,7 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
spin_lock(&vm_lock);
for ( ; ; )
{
- struct page_info *pg;
+ mfn_t mfn;
ASSERT(vm_low[t] == vm_top[t] || !test_bit(vm_low[t], vm_bitmap(t)));
for ( start = vm_low[t]; start < vm_top[t]; )
@@ -103,9 +103,17 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
if ( vm_top[t] >= vm_end[t] )
return NULL;
- pg = alloc_domheap_page(NULL, 0);
- if ( !pg )
- return NULL;
+ if ( system_state == SYS_STATE_early_boot )
+ {
+ mfn = alloc_boot_pages(1, 1);
+ }
+ else
+ {
+ struct page_info *pg = alloc_domheap_page(NULL, 0);
+ if ( !pg )
+ return NULL;
+ mfn = page_to_mfn(pg);
+ }
spin_lock(&vm_lock);
@@ -113,7 +121,7 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
{
unsigned long va = (unsigned long)vm_bitmap(t) + vm_top[t] / 8;
- if ( !map_pages_to_xen(va, page_to_mfn(pg), 1, PAGE_HYPERVISOR) )
+ if ( !map_pages_to_xen(va, mfn, 1, PAGE_HYPERVISOR) )
{
clear_page((void *)va);
vm_top[t] += PAGE_SIZE * 8;
@@ -123,7 +131,10 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
}
}
- free_domheap_page(pg);
+ if ( system_state == SYS_STATE_early_boot )
+ init_boot_pages(mfn_to_maddr(mfn), mfn_to_maddr(mfn) + PAGE_SIZE);
+ else
+ free_domheap_page(mfn_to_page(mfn));
if ( start >= vm_top[t] )
{
--
2.21.0
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Hi David, On 01/02/2020 00:33, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> I am a bit concerned with this change, particularly the consequence this have for the page-tables. There is an assumption that intermediate page-tables allocated via the boot allocator will never be freed. On x86, a call to vunmap() will not free page-tables, but a subsequent call to vmap() may free it depending on the mapping size. So we would call free_domheap_pages() rather than init_heap_pages(). I am not entirely sure what is the full consequence, but I think this is a call for investigation and write it down a summary in the commit message. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Mon, 2020-02-03 at 14:00 +0000, Julien Grall wrote:
> Hi David,
>
> On 01/02/2020 00:33, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
>
> I am a bit concerned with this change, particularly the consequence this
> have for the page-tables. There is an assumption that intermediate
> page-tables allocated via the boot allocator will never be freed.
>
> On x86, a call to vunmap() will not free page-tables, but a subsequent
> call to vmap() may free it depending on the mapping size. So we would
> call free_domheap_pages() rather than init_heap_pages().
>
> I am not entirely sure what is the full consequence, but I think this is
> a call for investigation and write it down a summary in the commit message.
This isn't just about page tables, right? It's about *any* allocation
given out by the boot allocator, being freed with free_heap_pages() ?
Given the amount of code that has conditionals in both alloc and free
paths along the lines of…
if (system_state > SYS_STATE_boot)
use xenheap
else
use boot allocator
… I'm not sure I'd really trust the assumption that such a thing never
happens; that no pages are ever allocated from the boot allocator and
then freed into the heap.
In fact it does work fine except for some esoteric corner cases,
because init_heap_pages() is mostly just a trivial loop over
free_heap_pages().
The corner cases are if you call free_heap_pages() on boot-allocated
memory which matches one or more of the following criteria:
• Includes MFN #0,
• Includes the first page the heap has seen on a given node, so
init_node_heap() has to be called, or
• High-order allocations crossing from one node to another.
The first is trivial to fix by lifting the check for MFN#0 up into
free_heap_pages(). The second isn't hard either.
I'll have to think about the third a little longer... one option might
be to declare that it's OK to call free_heap_pages() on a *single* page
that was allocated from the boot allocator, but not higher orders (in
case they cross nodes). That would at least cover the specific concern
about page table pages and vm_init()/vmap().
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Hi, Answring back to this thread as well, so it is easier to track. On 03/02/2020 16:37, David Woodhouse wrote: > On Mon, 2020-02-03 at 14:00 +0000, Julien Grall wrote: >> Hi David, >> >> On 01/02/2020 00:33, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >> >> I am a bit concerned with this change, particularly the consequence this >> have for the page-tables. There is an assumption that intermediate >> page-tables allocated via the boot allocator will never be freed. >> >> On x86, a call to vunmap() will not free page-tables, but a subsequent >> call to vmap() may free it depending on the mapping size. So we would >> call free_domheap_pages() rather than init_heap_pages(). >> >> I am not entirely sure what is the full consequence, but I think this is >> a call for investigation and write it down a summary in the commit message. > > This isn't just about page tables, right? It's about *any* allocation > given out by the boot allocator, being freed with free_heap_pages() ? That's correct. > > Given the amount of code that has conditionals in both alloc and free > paths along the lines of… > > if (system_state > SYS_STATE_boot) > use xenheap > else > use boot allocator > > … I'm not sure I'd really trust the assumption that such a thing never > happens; that no pages are ever allocated from the boot allocator and > then freed into the heap. I believe this assumption holds on Arm because page-tables are never freed so far. The only other case is in the ACPI code (acpi_os_free_memory()) but I think it would result to a leak instead (see is_xmalloc_memory()). I hadn't realized that this assumption was not holding on x86 :(. Actually, from the discussion on the MFN 0 thread, it looks like x86 is abusing quite a few interface in page alloc. So if this is nothing new, this probably yet another good future clean-up/hardening as long as the assumption hold outside of x86. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Mon, Feb 3, 2020 at 4:37 PM David Woodhouse <dwmw2@infradead.org> wrote: > > On Mon, 2020-02-03 at 14:00 +0000, Julien Grall wrote: > > Hi David, > > > > On 01/02/2020 00:33, David Woodhouse wrote: > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > I am a bit concerned with this change, particularly the consequence this > > have for the page-tables. There is an assumption that intermediate > > page-tables allocated via the boot allocator will never be freed. > > > > On x86, a call to vunmap() will not free page-tables, but a subsequent > > call to vmap() may free it depending on the mapping size. So we would > > call free_domheap_pages() rather than init_heap_pages(). > > > > I am not entirely sure what is the full consequence, but I think this is > > a call for investigation and write it down a summary in the commit message. > > This isn't just about page tables, right? It's about *any* allocation > given out by the boot allocator, being freed with free_heap_pages() ? > > Given the amount of code that has conditionals in both alloc and free > paths along the lines of… > > if (system_state > SYS_STATE_boot) > use xenheap > else > use boot allocator > > … I'm not sure I'd really trust the assumption that such a thing never > happens; that no pages are ever allocated from the boot allocator and > then freed into the heap. > > In fact it does work fine except for some esoteric corner cases, > because init_heap_pages() is mostly just a trivial loop over > free_heap_pages(). > > The corner cases are if you call free_heap_pages() on boot-allocated > memory which matches one or more of the following criteria: > > • Includes MFN #0, > > • Includes the first page the heap has seen on a given node, so > init_node_heap() has to be called, or > > • High-order allocations crossing from one node to another. I was asked to forward a message relating to MFN 0 and allocations crossing zones from a private discussion on the security list: 8<--- > I am having difficulty seeing how invalidating MFN0 would solve the issue here. > The zone number for a specific page is calculated from the most significant bit > position set in it's MFN. As a result, each successive zone contains an order of > magnitude more pages. You would need to invalidate the first or last MFN in each > zone. Because (unless Jan and I are reading the code wrong): * Chunks can only be merged such that they end up on order-boundaries. * Chunks can only be merged if they are the same order. * Zone boundaries are on order boundaries. So say you're freeing mfn 0x100, and mfn 0xff is free. In that loop, (1 << order) & mfn will always be 0, so it will always only look "forward" fro things to merge, not backwards. Suppose on the other hand, that you're freeing mfn 0x101, and 0x98 through 0x100 are free. The loop will look "backwards" and merge with 0x100; but then it will look "forwards" again. Now suppose you've merged 0x100-0x1ff, and the order moves up to size 0x100. Now the mask becomes 0x1ff; so it can't merge with 0x200-0x2ff (which would cross zones); instead it looks backwards to 0x0-0xff. We don't think it's possible for things to be merged across zones unless it can (say) start at 0xff, and merge all the way back to 0x0; which can't be done if 0x0 is never on the free list. That's the idea anyway. That would explain why we've never seen it on x86 -- due to the way the architecture is, mfn 0 is never on the free list. --->8 -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, 2020-02-04 at 11:00 +0000, George Dunlap wrote: > On Mon, Feb 3, 2020 at 4:37 PM David Woodhouse <dwmw2@infradead.org> wrote: > > > > On Mon, 2020-02-03 at 14:00 +0000, Julien Grall wrote: > > > Hi David, > > > > > > On 01/02/2020 00:33, David Woodhouse wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > I am a bit concerned with this change, particularly the consequence this > > > have for the page-tables. There is an assumption that intermediate > > > page-tables allocated via the boot allocator will never be freed. > > > > > > On x86, a call to vunmap() will not free page-tables, but a subsequent > > > call to vmap() may free it depending on the mapping size. So we would > > > call free_domheap_pages() rather than init_heap_pages(). > > > > > > I am not entirely sure what is the full consequence, but I think this is > > > a call for investigation and write it down a summary in the commit message. > > > > This isn't just about page tables, right? It's about *any* allocation > > given out by the boot allocator, being freed with free_heap_pages() ? > > > > Given the amount of code that has conditionals in both alloc and free > > paths along the lines of… > > > > if (system_state > SYS_STATE_boot) > > use xenheap > > else > > use boot allocator > > > > … I'm not sure I'd really trust the assumption that such a thing never > > happens; that no pages are ever allocated from the boot allocator and > > then freed into the heap. > > > > In fact it does work fine except for some esoteric corner cases, > > because init_heap_pages() is mostly just a trivial loop over > > free_heap_pages(). > > > > The corner cases are if you call free_heap_pages() on boot-allocated > > memory which matches one or more of the following criteria: > > > > • Includes MFN #0, > > > > • Includes the first page the heap has seen on a given node, so > > init_node_heap() has to be called, or > > > > • High-order allocations crossing from one node to another. > > I was asked to forward a message relating to MFN 0 and allocations > crossing zones from a private discussion on the security list: > > 8<--- > > > I am having difficulty seeing how invalidating MFN0 would solve the issue here. > > The zone number for a specific page is calculated from the most significant bit > > position set in it's MFN. As a result, each successive zone contains an order of > > magnitude more pages. You would need to invalidate the first or last MFN in each > > zone. > > Because (unless Jan and I are reading the code wrong): > > * Chunks can only be merged such that they end up on order-boundaries. > * Chunks can only be merged if they are the same order. > * Zone boundaries are on order boundaries. > > So say you're freeing mfn 0x100, and mfn 0xff is free. In that loop, (1 > << order) & mfn will always be 0, so it will always only look "forward" > fro things to merge, not backwards. > > Suppose on the other hand, that you're freeing mfn 0x101, and 0x98 > through 0x100 are free. The loop will look "backwards" and merge with > 0x100; but then it will look "forwards" again. > > Now suppose you've merged 0x100-0x1ff, and the order moves up to size > 0x100. Now the mask becomes 0x1ff; so it can't merge with 0x200-0x2ff > (which would cross zones); instead it looks backwards to 0x0-0xff. > > We don't think it's possible for things to be merged across zones unless > it can (say) start at 0xff, and merge all the way back to 0x0; which > can't be done if 0x0 is never on the free list. > > That's the idea anyway. That would explain why we've never seen it on > x86 -- due to the way the architecture is, mfn 0 is never on the free list. > > --->8 Thanks. I still don't really get it. What if the zone boundary is at MFN 0x300? What prevents the buddy allocator from merging a range a 0x200-0x2FF with another from 0x300-0x3FF, creating a single range 0x200-0x400 which crosses nodes? The MFN0 trick only works if all zone boundaries must be at an address which is 2ⁿ, doesn't it? Is that always true? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On Tue, 2020-02-04 at 11:06 +0000, David Woodhouse wrote: > > I still don't really get it. What if the zone boundary is at MFN 0x300? > > What prevents the buddy allocator from merging a range a 0x200-0x2FF > with another from 0x300-0x3FF, creating a single range 0x200-0x400 > which crosses nodes? > > The MFN0 trick only works if all zone boundaries must be at an address > which is 2ⁿ, doesn't it? Is that always true? As yes, it is. And you called it out explicitly. Sorry. I think I was conflating NUMA nodes, with the allocator zones. On the plus side, this is a whole lot easier for the boot allocator to cope with. It's easy enough to avoid handing out any high-order allocations from the boot allocator which would cross a *zone* boundary. It's crossing *node* boundaries which I was thinking of, which would be harder before we've even parsed the SRAT... _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
On 01.02.2020 01:33, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
A word on the "why" would be nice, just like Wei has in the previous
patch. Besides that just two cosmetic requests:
> --- a/xen/common/vmap.c
> +++ b/xen/common/vmap.c
> @@ -68,7 +68,7 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
> spin_lock(&vm_lock);
> for ( ; ; )
> {
> - struct page_info *pg;
> + mfn_t mfn;
>
> ASSERT(vm_low[t] == vm_top[t] || !test_bit(vm_low[t], vm_bitmap(t)));
> for ( start = vm_low[t]; start < vm_top[t]; )
> @@ -103,9 +103,17 @@ static void *vm_alloc(unsigned int nr, unsigned int align,
> if ( vm_top[t] >= vm_end[t] )
> return NULL;
>
> - pg = alloc_domheap_page(NULL, 0);
> - if ( !pg )
> - return NULL;
> + if ( system_state == SYS_STATE_early_boot )
> + {
> + mfn = alloc_boot_pages(1, 1);
> + }
Please omit the braces in cases like this one.
> + else
> + {
> + struct page_info *pg = alloc_domheap_page(NULL, 0);
> + if ( !pg )
Please have a blank line between declaration(s) and statement(s).
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
© 2016 - 2026 Red Hat, Inc.