[Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement

Roger Pau Monne posted 1 patch 4 years, 1 month ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/xen tags/patchew/20200303115253.47449-1-roger.pau@citrix.com
There is a newer version of this series
xen/arch/x86/hvm/dom0_build.c | 58 ++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
[Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Roger Pau Monne 4 years, 1 month ago
Don't assume there's going to be enough space at the tail of the
loaded kernel and instead try to find a suitable memory area where the
initrd and metadata can be loaded.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Calculate end of e820 entry earlier.
 - Only check if the end of the region is < 1MB.
 - Check for range overlaps with the kernel region.
 - Check the region is of type RAM.
 - Fix off-by-one checks in range overlaps.
 - Add a comment about why initrd and metadata is placed together.
 - Add parentheses around size calculations.
---
 xen/arch/x86/hvm/dom0_build.c | 58 ++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index eded87eaf5..33520ec1bc 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
 #undef MB1_PAGES
 }
 
+static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
+                           size_t size)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base;
+    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
+
+        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
+        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
+
+        if ( end <= kernel_start || start >= kernel_end )
+            ; /* No overlap, nothing to do. */
+        /* Deal with the kernel already being loaded in the region. */
+        else if ( kernel_start <= start && kernel_end > start )
+            /* Truncate the start of the region. */
+            start = ROUNDUP(kernel_end, PAGE_SIZE);
+        else if ( kernel_start <= end && kernel_end > end )
+            /* Truncate the end of the region. */
+            end = kernel_start;
+        /* Pick the biggest of the split regions. */
+        else if ( kernel_start - start > end - kernel_end )
+            end = kernel_start;
+        else
+            start = ROUNDUP(kernel_end, PAGE_SIZE);
+
+        if ( end - start >= size )
+            return start;
+    }
+
+    return INVALID_PADDR;
+}
+
 static int __init pvh_load_kernel(struct domain *d, const module_t *image,
                                   unsigned long image_headroom,
                                   module_t *initrd, void *image_base,
@@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
         return rc;
     }
 
-    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+    /*
+     * Find a RAM region big enough (and that doesn't overlap with the loaded
+     * kernel) in order to load the initrd and the metadata. Note it could be
+     * split into smaller allocations, done it as a single region in order to
+     * simplify it.
+     */
+    last_addr = find_memory(d, &elf, sizeof(start_info) +
+                            (initrd ? ROUNDUP(initrd->mod_end, PAGE_SIZE) +
+                                      sizeof(mod)
+                                    : 0) +
+                            (cmdline ? ROUNDUP(strlen(cmdline) + 1,
+                                               elf_64bit(&elf) ? 8 : 4)
+                                     : 0));
+    if ( last_addr == INVALID_PADDR )
+    {
+        printk("Unable to find a memory region to load initrd and metadata\n");
+        return -ENOMEM;
+    }
 
     if ( initrd != NULL )
     {
-- 
2.25.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Jan Beulich 4 years, 1 month ago
On 03.03.2020 12:52, Roger Pau Monne wrote:
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> +                           size_t size)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> +    unsigned int i;
> +
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> +
> +        if ( end <= kernel_start || start >= kernel_end )
> +            ; /* No overlap, nothing to do. */
> +        /* Deal with the kernel already being loaded in the region. */
> +        else if ( kernel_start <= start && kernel_end > start )

Since, according to your reply on v1, [kernel_start,kernel_end) is
a subset of [start,end), I understand that the <= could equally
well be == - do you agree? From this then ...

> +            /* Truncate the start of the region. */
> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> +        else if ( kernel_start <= end && kernel_end > end )

... it follows that you now have two off-by-1s here, as you changed
the right side of the && instead of the left one (the right side
could, as per above, use == again). Using == in both places would,
in lieu of a comment, imo make more visible to the reader that
there is this sub-range relationship between both ranges.

> +            /* Truncate the end of the region. */
> +            end = kernel_start;
> +        /* Pick the biggest of the split regions. */

Then again - wouldn't this part suffice? if start == kernel_start
or end == kernel_end, one side of the "split" region would simply
be empty.

> +        else if ( kernel_start - start > end - kernel_end )
> +            end = kernel_start;
> +        else
> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> +
> +        if ( end - start >= size )
> +            return start;
> +    }
> +
> +    return INVALID_PADDR;
> +}
> +
>  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>                                    unsigned long image_headroom,
>                                    module_t *initrd, void *image_base,
> @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>          return rc;
>      }
>  
> -    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> +    /*
> +     * Find a RAM region big enough (and that doesn't overlap with the loaded
> +     * kernel) in order to load the initrd and the metadata. Note it could be
> +     * split into smaller allocations, done it as a single region in order to
> +     * simplify it.

I guess either "done" without "it" or "doing it"?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Roger Pau Monné 4 years, 1 month ago
On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
> On 03.03.2020 12:52, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> > +                           size_t size)
> > +{
> > +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> > +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> > +
> > +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> > +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> > +
> > +        if ( end <= kernel_start || start >= kernel_end )
> > +            ; /* No overlap, nothing to do. */
> > +        /* Deal with the kernel already being loaded in the region. */
> > +        else if ( kernel_start <= start && kernel_end > start )
> 
> Since, according to your reply on v1, [kernel_start,kernel_end) is
> a subset of [start,end), I understand that the <= could equally
> well be == - do you agree? From this then ...
> 
> > +            /* Truncate the start of the region. */
> > +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> > +        else if ( kernel_start <= end && kernel_end > end )
> 
> ... it follows that you now have two off-by-1s here, as you changed
> the right side of the && instead of the left one (the right side
> could, as per above, use == again). Using == in both places would,
> in lieu of a comment, imo make more visible to the reader that
> there is this sub-range relationship between both ranges.

Right, I agree to both the above and have adjusted the conditions.

> > +            /* Truncate the end of the region. */
> > +            end = kernel_start;
> > +        /* Pick the biggest of the split regions. */
> 
> Then again - wouldn't this part suffice? if start == kernel_start
> or end == kernel_end, one side of the "split" region would simply
> be empty.

That's why it's using an else if construct, so that we only get
here if the kernel is loaded in the middle of the region, and there
are two regions left as part of the split.

> 
> > +        else if ( kernel_start - start > end - kernel_end )
> > +            end = kernel_start;
> > +        else
> > +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> > +
> > +        if ( end - start >= size )
> > +            return start;
> > +    }
> > +
> > +    return INVALID_PADDR;
> > +}
> > +
> >  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> >                                    unsigned long image_headroom,
> >                                    module_t *initrd, void *image_base,
> > @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> >          return rc;
> >      }
> >  
> > -    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> > +    /*
> > +     * Find a RAM region big enough (and that doesn't overlap with the loaded
> > +     * kernel) in order to load the initrd and the metadata. Note it could be
> > +     * split into smaller allocations, done it as a single region in order to
> > +     * simplify it.
> 
> I guess either "done" without "it" or "doing it"?

Fixed, thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Jan Beulich 4 years, 1 month ago
On 04.03.2020 10:53, Roger Pau Monné wrote:
> On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
>> On 03.03.2020 12:52, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>>>  #undef MB1_PAGES
>>>  }
>>>  
>>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
>>> +                           size_t size)
>>> +{
>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
>>> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
>>> +    unsigned int i;
>>> +
>>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>>> +    {
>>> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
>>> +
>>> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
>>> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
>>> +            continue;
>>> +
>>> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
>>> +
>>> +        if ( end <= kernel_start || start >= kernel_end )
>>> +            ; /* No overlap, nothing to do. */
>>> +        /* Deal with the kernel already being loaded in the region. */
>>> +        else if ( kernel_start <= start && kernel_end > start )
>>
>> Since, according to your reply on v1, [kernel_start,kernel_end) is
>> a subset of [start,end), I understand that the <= could equally
>> well be == - do you agree? From this then ...
>>
>>> +            /* Truncate the start of the region. */
>>> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
>>> +        else if ( kernel_start <= end && kernel_end > end )
>>
>> ... it follows that you now have two off-by-1s here, as you changed
>> the right side of the && instead of the left one (the right side
>> could, as per above, use == again). Using == in both places would,
>> in lieu of a comment, imo make more visible to the reader that
>> there is this sub-range relationship between both ranges.
> 
> Right, I agree to both the above and have adjusted the conditions.
> 
>>> +            /* Truncate the end of the region. */
>>> +            end = kernel_start;
>>> +        /* Pick the biggest of the split regions. */
>>
>> Then again - wouldn't this part suffice? if start == kernel_start
>> or end == kernel_end, one side of the "split" region would simply
>> be empty.
> 
> That's why it's using an else if construct, so that we only get
> here if the kernel is loaded in the middle of the region, and there
> are two regions left as part of the split.

But my question is - do we really need the earlier parts of
this if/else-if chain? Won't this latter part deal find with
zero-sized ranges at head or tail of the region?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Roger Pau Monné 4 years, 1 month ago
On Wed, Mar 04, 2020 at 11:00:18AM +0100, Jan Beulich wrote:
> On 04.03.2020 10:53, Roger Pau Monné wrote:
> > On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
> >> On 03.03.2020 12:52, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/hvm/dom0_build.c
> >>> +++ b/xen/arch/x86/hvm/dom0_build.c
> >>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >>>  #undef MB1_PAGES
> >>>  }
> >>>  
> >>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> >>> +                           size_t size)
> >>> +{
> >>> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> >>> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> >>> +    unsigned int i;
> >>> +
> >>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> >>> +    {
> >>> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> >>> +
> >>> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> >>> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> >>> +            continue;
> >>> +
> >>> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> >>> +
> >>> +        if ( end <= kernel_start || start >= kernel_end )
> >>> +            ; /* No overlap, nothing to do. */
> >>> +        /* Deal with the kernel already being loaded in the region. */
> >>> +        else if ( kernel_start <= start && kernel_end > start )
> >>
> >> Since, according to your reply on v1, [kernel_start,kernel_end) is
> >> a subset of [start,end), I understand that the <= could equally
> >> well be == - do you agree? From this then ...
> >>
> >>> +            /* Truncate the start of the region. */
> >>> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> >>> +        else if ( kernel_start <= end && kernel_end > end )
> >>
> >> ... it follows that you now have two off-by-1s here, as you changed
> >> the right side of the && instead of the left one (the right side
> >> could, as per above, use == again). Using == in both places would,
> >> in lieu of a comment, imo make more visible to the reader that
> >> there is this sub-range relationship between both ranges.
> > 
> > Right, I agree to both the above and have adjusted the conditions.
> > 
> >>> +            /* Truncate the end of the region. */
> >>> +            end = kernel_start;
> >>> +        /* Pick the biggest of the split regions. */
> >>
> >> Then again - wouldn't this part suffice? if start == kernel_start
> >> or end == kernel_end, one side of the "split" region would simply
> >> be empty.
> > 
> > That's why it's using an else if construct, so that we only get
> > here if the kernel is loaded in the middle of the region, and there
> > are two regions left as part of the split.
> 
> But my question is - do we really need the earlier parts of
> this if/else-if chain? Won't this latter part deal find with
> zero-sized ranges at head or tail of the region?

Oh, I misread your reply sorry. Yes you are right, I can achieve the
same just with this last part.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Andrew Cooper 4 years, 1 month ago
On 03/03/2020 11:52, Roger Pau Monne wrote:
> Don't assume there's going to be enough space at the tail of the
> loaded kernel and instead try to find a suitable memory area where the
> initrd and metadata can be loaded.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

I can confirm that this fixes the "failed to boot PVH" on my Rome system.

Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

We've still got the excessive-time-to-construct issues to look at, but
this definitely brings things to a better position.

> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index eded87eaf5..33520ec1bc 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> +                           size_t size)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> +    unsigned int i;
> +
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */

The BDA is in mfn 0 so is special for other reasons*.  The EBDA and IBFT
are the problem datastructures.

~Andrew

[*] Thinking about it, how should a PVH hardware domain reconcile its
paravirtualised boot with finding itself on a BIOS or EFI system...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v2] x86/dom0: improve PVH initrd and metadata placement
Posted by Roger Pau Monné 4 years, 1 month ago
On Tue, Mar 03, 2020 at 06:47:35PM +0000, Andrew Cooper wrote:
> On 03/03/2020 11:52, Roger Pau Monne wrote:
> > Don't assume there's going to be enough space at the tail of the
> > loaded kernel and instead try to find a suitable memory area where the
> > initrd and metadata can be loaded.
> >
> > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I can confirm that this fixes the "failed to boot PVH" on my Rome system.
> 
> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks!

> We've still got the excessive-time-to-construct issues to look at, but
> this definitely brings things to a better position.

Well, PV is always going to be faster to construct, since you only
need to allocate memory and create the initial page tables that cover
the kernel, the metadata and optionally the initrd IIRC.

On PVH we need to populate the full p2m, so I think it's safe to say
that PVH build time is always going to be worse than PV. That doesn't
mean we can't make it faster.
I have to 
Since I also have an AMD box that I can play with, how much memory are
you assigning to dom0?

> > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> > index eded87eaf5..33520ec1bc 100644
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> > +                           size_t size)
> > +{
> > +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> > +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> > +
> > +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> 
> The BDA is in mfn 0 so is special for other reasons*.  The EBDA and IBFT
> are the problem datastructures.

Sure. Sorry I haven't added it to the comment.

> ~Andrew
> 
> [*] Thinking about it, how should a PVH hardware domain reconcile its
> paravirtualised boot with finding itself on a BIOS or EFI system...

I guess the same applies to PV which also boots using a PV path but
has access to firmware.

I have to admit I never looked closely at how Linux does that, for
FreeBSD it's fairly easy because at least when booting from BIOS the
kernel won't try to make any calls into the BIOS, and instead expect
the data it requires to be provided by the loader as part of the
metadata blob, together with the modules &c.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel